Diff for "UnicodeEncodeError"

Differences between revisions 1 and 14 (spanning 13 versions)

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.

   1 r"""
   2 Encoding from unicode to str.
   3 
   4 >>> u"a".encode("iso-8859-15")
   5 'a'
   6 >>> u"\u0411".encode("iso-8859-15")
   7 Traceback (most recent call last):
   8   File "<stdin>", line 1, in <module>
   9   File "encodings/iso8859_15.py", line 12, in encode
  10 UnicodeEncodeError: 'charmap' codec can't encode character u'\u0411' in position 0: character maps to <undefined>
  11 
  12 >>> iso_8859_15_encoder = codecs.getencoder("iso-8859-15")
  13 >>> encode_to_iso_8859_15 = lambda s, iso_8859_15_encoder=iso_8859_15_encoder, errors="strict": \
  14 ...     iso_8859_15_encoder(s, errors)[0]
  15 >>> encode_to_iso_8859_15(u"a\u0411b", errors="backslashreplace")
  16 'a\\u0411b'
  17 >>> encode_to_iso_8859_15(u"a\u0411b", errors="replace")
  18 'a?b'
  19 >>> encode_to_iso_8859_15(u"a\u0411b", errors="xmlcharrefreplace")
  20 'a&#1041;b'
  21 """

Paradoxically, a UnicodeEncodeError may happen when _decoding_. The cause of it seems to be the coding-specific decode() functions that normally expect a parameter of type str. It appears that on seeing a unicode parameter, the decode() functions "down-convert" it into str, then decode the result assuming it to be of their own coding. It also appears that the "down-conversion" is performed using the ASCII encoder. Hence a nencoding failure inside a decoder.

The choice of the ASCII encoder for "down-conversion" might be considered wise because it is an intersection of all codings. The subsequent decoding may only accept a coding-specific str.

However, unlike a similar issue with UnicodeDecodeError while encoding, there would be not ambiguity if decode() simply returned the unicode argument unmodified. There seems to be not such a shortcut in decode() functions as of Python2.5.

Alternatively, a TypeError exception could always be thrown on receiving a unicode argument in decode() functions. (This would require stream.read() to produce only str for StreamReader.read(). The latter would only produce unicode).

   1 r"""
   2 Decoding from str to unicode.
   3 
   4 >>> "a".decode("utf-8")
   5 u'a'
   6 >>> "\xd0\x91".decode("utf-8")
   7 u'\u0411'
   8 >>> u"a".decode("utf-8")      # Unexpected argument type.
   9 u'a'
  10 >>> u"\u0411".decode("utf-8") # Unexpected argument type.
  11 Traceback (most recent call last):
  12   File "<stdin>", line 1, in <module>
  13   File "encodings/utf_8.py", line 16, in decode
  14 UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)
  15 """

CategoryUnicode

-  ⇤ ← Revision 1 as of 2007-07-13 03:25:31 → 
  Size: 1689
  Editor: cscfpc15
  Comment:
+   ← Revision 14 as of 2007-07-13 05:14:13 → ⇥
  Size: 2864
  Editor: cscfpc15
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+The {{{UnicodeEncodeError}}} normally happens when encoding a {{{unicode}}} string into a certain coding.  Since codings map only a limited number of {{{unicode}}} characters to {{{str}}} strings, a non-presented character will cause the coding-specific {{{encode()}}} to fail.
-Line 2:
+Line 3:
-The {{{UnicodeEncodeError}}} normally happens when encoding a {{{unicode}}} string into a certain coding.  Since encodings map only limited number of {{{unicode}}} characters to {{{str}}} strings, a non-presented character will cause the coding-specific {{{encode()}}} to fail.
+{{{
#!python
r"""
Encoding from unicode to str.
-Line 4:
+Line 8:
-Paradoxically, a {{{UnicodeEncodeError}}} may happen when _decoding_.  The cause of it seems to be the coding-specific {{{decode()}}} functions that normally expect a parameter of type {{{str}}}.  It appears that on seeing a {{{unicode}}} parameter, the {{{decode()}}} functions "down-convert" it into {{{str}}}, then decode the result assuming it to be of their own coding.  It also appears that the "down-conversion" is performed using the {{{ASCII}}} encoder.  Hence a decoding failure inside an encoder.
+>>> u"a".encode("iso-8859-15")
'a'
>>> u"\u0411".encode("iso-8859-15")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "encodings/iso8859_15.py", line 12, in encode
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0411' in position 0: character maps to <undefined>

>>> iso_8859_15_encoder = codecs.getencoder("iso-8859-15")
>>> encode_to_iso_8859_15 = lambda s, iso_8859_15_encoder=iso_8859_15_encoder, errors="strict": \
...     iso_8859_15_encoder(s, errors)[0]
>>> encode_to_iso_8859_15(u"a\u0411b", errors="backslashreplace")
'a\\u0411b'
>>> encode_to_iso_8859_15(u"a\u0411b", errors="replace")
'a?b'
>>> encode_to_iso_8859_15(u"a\u0411b", errors="xmlcharrefreplace")
'a&#1041;b'
"""
}}}

Paradoxically, a {{{UnicodeEncodeError}}} may happen when _decoding_.  The cause of it seems to be the coding-specific {{{decode()}}} functions that normally expect a parameter of type {{{str}}}.  It appears that on seeing a {{{unicode}}} parameter, the {{{decode()}}} functions "down-convert" it into {{{str}}}, then decode the result assuming it to be of their own coding.  It also appears that the "down-conversion" is performed using the {{{ASCII}}} encoder.  Hence a nencoding failure inside a decoder.
-Line 10:
+Line 34:
+Alternatively, a TypeError exception could always be thrown on receiving a {{{unicode}}} argument in {{{decode()}}} functions.  (This would require {{{stream.read()}}} to produce only {{{str}}} for StreamReader{{{.read()}}}.  The latter would only produce {{{unicode}}}).
-Line 12:
+Line 38:
+r"""
Decoding from str to unicode.
-Line 16:
+Line 45:
->>> u"\u0411".encode("utf-8")
'\xd0\x91'
>>> u"\u0411".decode("utf-8")
+>>> u"a".decode("utf-8")      # Unexpected argument type.
u'a'
>>> u"\u0411".decode("utf-8") # Unexpected argument type.
-Line 23:
+Line 52:
->>>
+"""

Page

User