Revision 1 as of 2007-07-13 03:25:31

Clear message

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since encodings map only limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.

Paradoxically, a UnicodeEncodeError may happen when _decoding_. The cause of it seems to be the coding-specific decode() functions that normally expect a parameter of type str. It appears that on seeing a unicode parameter, the decode() functions "down-convert" it into str, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the ASCII encoder. Hence a decoding failure inside an encoder.

The choice of the ASCII encoder for "down-conversion" might be considered wise because it is an intersection of all codings. The subsequent decoding may only accept a coding-specific str.

However, unlike a similar issue with UnicodeDecodeError while encoding, there would be not ambiguity if decode() simply returned the unicode argument unmodified. There seems to be not such a shortcut in decode() functions as of Python2.5.

   1 >>> "a".decode("utf-8")
   2 u'a'
   3 >>> "\xd0\x91".decode("utf-8")
   4 u'\u0411'
   5 >>> u"\u0411".encode("utf-8")
   6 '\xd0\x91'
   7 >>> u"\u0411".decode("utf-8")
   8 Traceback (most recent call last):
   9   File "<stdin>", line 1, in <module>
  10   File "encodings/utf_8.py", line 16, in decode
  11 UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)
  12 >>> 


CategoryUnicode

Unable to edit the page? See the FrontPage for instructions.