Diff for "UnicodeEncodeError"

Differences between revisions 8 and 9

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.

   1 r"""
   2 Encoding from unicode to str.
   3 
   4 >>> u"a".encode("iso-8859-15")
   5 'a'
   6 >>> u"\u0411".encode("iso-8859-15")
   7 Traceback (most recent call last):
   8   File "<stdin>", line 1, in <module>
   9   File "encodings/iso8859_15.py", line 12, in encode
  10 UnicodeEncodeError: 'charmap' codec can't encode character u'\u0411' in position 0: character maps to <undefined>
  11 """

Paradoxically, a UnicodeEncodeError may happen when _decoding_. The cause of it seems to be the coding-specific decode() functions that normally expect a parameter of type str. It appears that on seeing a unicode parameter, the decode() functions "down-convert" it into str, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the ASCII encoder. Hence a nencoding failure inside a decoder.

The choice of the ASCII encoder for "down-conversion" might be considered wise because it is an intersection of all codings. The subsequent decoding may only accept a coding-specific str.

However, unlike a similar issue with UnicodeDecodeError while encoding, there would be not ambiguity if decode() simply returned the unicode argument unmodified. There seems to be not such a shortcut in decode() functions as of Python2.5.

Alternatively, a TypeError exception could always be thrown on receiving an argument of unexpected type in decode(). (This would require stream.read() to produce only str for StreamReader.read(). The latter would only produce unicode).

   1 r"""
   2 Decoding from str to unicode.
   3 
   4 >>> "a".decode("utf-8")
   5 u'a'
   6 >>> "\xd0\x91".decode("utf-8")
   7 u'\u0411'
   8 >>> u"a".decode("utf-8")      # Unexpected argument type.
   9 u'a'
  10 >>> u"\u0411".decode("utf-8") # Unexpected argument type.
  11 Traceback (most recent call last):
  12   File "<stdin>", line 1, in <module>
  13   File "encodings/utf_8.py", line 16, in decode
  14 UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)
  15 """

CategoryUnicode

-  ⇤ ← Revision 8 as of 2007-07-13 04:32:24 → 
  Size: 2386
  Editor: cscfpc15
  Comment:
+   ← Revision 9 as of 2007-07-13 04:44:51 → ⇥
  Size: 2430
  Editor: cscfpc15
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 24:
-Alternatively, a TypeError exception could always be thrown on receiving an argument of unexpected type in {{{decode()}}}.  (This would disallow {{{stream.read()}}} from producing {{{unicode}}} data in StreamReader.read()).
+Alternatively, a TypeError exception could always be thrown on receiving an argument of unexpected type in {{{decode()}}}.  (This would require {{{stream.read()}}} to produce only {{{str}}} for StreamReader{{{.read()}}}.  The latter would only produce {{{unicode}}}).

Page

User