Diff for "UnicodeDecodeError"

Differences between revisions 4 and 11 (spanning 7 versions)

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.

Paradoxically, a UnicodeDecodeError may happen when _encoding_. The cause of it seems to be the coding-specific encode() functions that normally expect a parameter of type unicode. It appears that on seeing an str parameter, the encode() functions "up-convert" it into unicode before converting to their own coding. It also appears that such "up-conversion" makes no assumption of str parameter's coding, choosing a default ascii decoder. Hence a decoding failure inside an encoder.

Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided. This is because the str result of encode() must be a legal coding-specific sequence. However, a more flexible treatment of an unexpected str argument might first validate the str argument by attempting to decode it, then return it unmodified if the validation was successful. As of Python2.5, this is not implemented.

   1 r"""
   2 Decoding from str to unicode.
   3 
   4 >>> "a".decode("utf-8")
   5 u'a'
   6 >>> "\x81".decode("utf-8")
   7 Traceback (most recent call last):
   8   File "<stdin>", line 1, in <module>
   9   File "encodings/utf_8.py", line 16, in decode
  10 UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
  11 
  12 Encoding from unicode to str.
  13 
  14 >>> u"a".encode("utf-8")
  15 'a'
  16 >>> u"\u0411".encode("utf-8")
  17 '\xd0\x91'
  18 >>> "a".encode("utf-8")         # Unexpected argument type.
  19 'a'
  20 >>> "\xd0\x91".encode("utf-8")  # Unexpected argument type.
  21 Traceback (most recent call last):
  22   File "<stdin>", line 1, in <module>
  23 UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
  24

CategoryUnicode

-  ⇤ ← Revision 4 as of 2007-07-13 02:55:07 → 
  Size: 891
  Editor: cscfpc15
  Comment:
+   ← Revision 11 as of 2007-07-13 03:49:14 → ⇥
  Size: 2038
  Editor: cscfpc15
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Paradoxically, a {{{UnicodeDecodeError}}} happens when _encoding_.  The cause of it seems to be the encoding-specific {{{encode()}}} functions that normally expect a parameter of type {{{unicode}}}.  It appears that on seeing an {{{str}}} parameter, the {{{encode()}}} functions "up-convert" it into {{{unicode}}} before applying their own encoding.  It also appears that the "up-conversion" makes no assumption of {{{str}}} parameter's encoding, assuming it to be {{{ascii}}}.  Hence a decoding failure inside an encoder.
+The {{{UnicodeDecodeError}}} normally happens when decoding an {{{str}}} string from a certain coding.  Since codings map only a limited number of {{{str}}} strings to {{{unicode}}} characters, an illegal sequence of {{{str}}} characters will cause the coding-specific {{{decode()}}} to fail.

Paradoxically, a {{{UnicodeDecodeError}}} may happen when _encoding_.  The cause of it seems to be the coding-specific {{{encode()}}} functions that normally expect a parameter of type {{{unicode}}}.  It appears that on seeing an {{{str}}} parameter, the {{{encode()}}} functions "up-convert" it into {{{unicode}}} before converting to their own coding.  It also appears that such "up-conversion" makes no assumption of {{{str}}} parameter's coding, choosing a default {{{ascii}}} decoder.  Hence a decoding failure inside an encoder.

Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided.  This is because the {{{str}}} result of {{{encode()}}} must be a legal coding-specific sequence.  However, a more flexible treatment of an unexpected {{{str}}} argument might first validate the {{{str}}} argument by attempting to decode it, then return it unmodified if the validation was successful.  As of Python2.5, this is not implemented.
-Line 5:
+Line 9:
+r"""
Decoding from str to unicode.

>>> "a".decode("utf-8")
u'a'
>>> "\x81".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

Encoding from unicode to str.

>>> u"a".encode("utf-8")
'a'
-Line 7:
+Line 26:
->>> u"\u0411".encode("utf-8")
'\xd0\x91'
>>> "a".encode("utf-8")
+>>> "a".encode("utf-8")         # Unexpected argument type.
-Line 11:
+Line 28:
->>> "\xd0\x91".encode("utf-8")
+>>> "\xd0\x91".encode("utf-8")  # Unexpected argument type.

Page

User