1689
Comment:
|
2864
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
The {{{UnicodeEncodeError}}} normally happens when encoding a {{{unicode}}} string into a certain coding. Since codings map only a limited number of {{{unicode}}} characters to {{{str}}} strings, a non-presented character will cause the coding-specific {{{encode()}}} to fail. | |
Line 2: | Line 3: |
The {{{UnicodeEncodeError}}} normally happens when encoding a {{{unicode}}} string into a certain coding. Since encodings map only limited number of {{{unicode}}} characters to {{{str}}} strings, a non-presented character will cause the coding-specific {{{encode()}}} to fail. | {{{ #!python r""" Encoding from unicode to str. |
Line 4: | Line 8: |
Paradoxically, a {{{UnicodeEncodeError}}} may happen when _decoding_. The cause of it seems to be the coding-specific {{{decode()}}} functions that normally expect a parameter of type {{{str}}}. It appears that on seeing a {{{unicode}}} parameter, the {{{decode()}}} functions "down-convert" it into {{{str}}}, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the {{{ASCII}}} encoder. Hence a decoding failure inside an encoder. | >>> u"a".encode("iso-8859-15") 'a' >>> u"\u0411".encode("iso-8859-15") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "encodings/iso8859_15.py", line 12, in encode UnicodeEncodeError: 'charmap' codec can't encode character u'\u0411' in position 0: character maps to <undefined> >>> iso_8859_15_encoder = codecs.getencoder("iso-8859-15") >>> encode_to_iso_8859_15 = lambda s, iso_8859_15_encoder=iso_8859_15_encoder, errors="strict": \ ... iso_8859_15_encoder(s, errors)[0] >>> encode_to_iso_8859_15(u"a\u0411b", errors="backslashreplace") 'a\\u0411b' >>> encode_to_iso_8859_15(u"a\u0411b", errors="replace") 'a?b' >>> encode_to_iso_8859_15(u"a\u0411b", errors="xmlcharrefreplace") 'aБb' """ }}} Paradoxically, a {{{UnicodeEncodeError}}} may happen when _decoding_. The cause of it seems to be the coding-specific {{{decode()}}} functions that normally expect a parameter of type {{{str}}}. It appears that on seeing a {{{unicode}}} parameter, the {{{decode()}}} functions "down-convert" it into {{{str}}}, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the {{{ASCII}}} encoder. Hence a nencoding failure inside a decoder. |
Line 10: | Line 34: |
Alternatively, a TypeError exception could always be thrown on receiving a {{{unicode}}} argument in {{{decode()}}} functions. (This would require {{{stream.read()}}} to produce only {{{str}}} for StreamReader{{{.read()}}}. The latter would only produce {{{unicode}}}). |
|
Line 12: | Line 38: |
r""" Decoding from str to unicode. |
|
Line 16: | Line 45: |
>>> u"\u0411".encode("utf-8") '\xd0\x91' >>> u"\u0411".decode("utf-8") |
>>> u"a".decode("utf-8") # Unexpected argument type. u'a' >>> u"\u0411".decode("utf-8") # Unexpected argument type. |
Line 23: | Line 52: |
>>> | """ |
The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.
1 r"""
2 Encoding from unicode to str.
3
4 >>> u"a".encode("iso-8859-15")
5 'a'
6 >>> u"\u0411".encode("iso-8859-15")
7 Traceback (most recent call last):
8 File "<stdin>", line 1, in <module>
9 File "encodings/iso8859_15.py", line 12, in encode
10 UnicodeEncodeError: 'charmap' codec can't encode character u'\u0411' in position 0: character maps to <undefined>
11
12 >>> iso_8859_15_encoder = codecs.getencoder("iso-8859-15")
13 >>> encode_to_iso_8859_15 = lambda s, iso_8859_15_encoder=iso_8859_15_encoder, errors="strict": \
14 ... iso_8859_15_encoder(s, errors)[0]
15 >>> encode_to_iso_8859_15(u"a\u0411b", errors="backslashreplace")
16 'a\\u0411b'
17 >>> encode_to_iso_8859_15(u"a\u0411b", errors="replace")
18 'a?b'
19 >>> encode_to_iso_8859_15(u"a\u0411b", errors="xmlcharrefreplace")
20 'aБb'
21 """
Paradoxically, a UnicodeEncodeError may happen when _decoding_. The cause of it seems to be the coding-specific decode() functions that normally expect a parameter of type str. It appears that on seeing a unicode parameter, the decode() functions "down-convert" it into str, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the ASCII encoder. Hence a nencoding failure inside a decoder.
The choice of the ASCII encoder for "down-conversion" might be considered wise because it is an intersection of all codings. The subsequent decoding may only accept a coding-specific str.
However, unlike a similar issue with UnicodeDecodeError while encoding, there would be not ambiguity if decode() simply returned the unicode argument unmodified. There seems to be not such a shortcut in decode() functions as of Python2.5.
Alternatively, a TypeError exception could always be thrown on receiving a unicode argument in decode() functions. (This would require stream.read() to produce only str for StreamReader.read(). The latter would only produce unicode).
1 r"""
2 Decoding from str to unicode.
3
4 >>> "a".decode("utf-8")
5 u'a'
6 >>> "\xd0\x91".decode("utf-8")
7 u'\u0411'
8 >>> u"a".decode("utf-8") # Unexpected argument type.
9 u'a'
10 >>> u"\u0411".decode("utf-8") # Unexpected argument type.
11 Traceback (most recent call last):
12 File "<stdin>", line 1, in <module>
13 File "encodings/utf_8.py", line 16, in decode
14 UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)
15 """