Diff for "UnicodeDecodeError"

Differences between revisions 6 and 8 (spanning 2 versions)

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.

Paradoxically, a UnicodeDecodeError may happen when _encoding_. The cause of it seems to be the coding-specific encode() functions that normally expect a parameter of type unicode. It appears that on seeing an str parameter, the encode() functions "up-convert" it into unicode before converting to their own coding. It also appears that such "up-conversion" makes no assumption of str parameter's coding, choosing a default ascii decoder. Hence a decoding failure inside an encoder.

Unlike a similar case with UnicodeEncodeError, such a failure cannot be avoided. This is because the str result of encode() must be a legal coding-specific sequence. However, a more flexible treatment of an unexpected str argument might first validate the str argument by attempting to decode it, then return it unmodified if the validation was successful.

   1 >>> u"a".encode("utf-8")
   2 'a'
   3 >>> u"\u0411".encode("utf-8")
   4 '\xd0\x91'
   5 >>> "a".encode("utf-8")
   6 'a'
   7 >>> "\xd0\x91".encode("utf-8")
   8 Traceback (most recent call last):
   9   File "<stdin>", line 1, in <module>
  10 UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

CategoryUnicode

-  ⇤ ← Revision 6 as of 2007-07-13 02:58:48 → 
  Size: 888
  Editor: cscfpc15
  Comment:
+   ← Revision 8 as of 2007-07-13 03:34:01 → ⇥
  Size: 1573
  Editor: cscfpc15
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Paradoxically, a {{{UnicodeDecodeError}}} happens when _encoding_.  The cause of it seems to be the coding-specific {{{encode()}}} functions that normally expect a parameter of type {{{unicode}}}.  It appears that on seeing an {{{str}}} parameter, the {{{encode()}}} functions "up-convert" it into {{{unicode}}} before converting to their own coding.  It also appears that such "up-conversion" makes no assumption of {{{str}}} parameter's coding, choosing a default {{{ascii}}} decoder.  Hence a decoding failure inside an encoder.
+The {{{UnicodeDecodeError}}} normally happens when decoding an {{{str}}} string from a certain coding.  Since codings map only a limited number of {{{str}}} strings to {{{unicode}}} characters, an illegal sequence of {{{str}}} characters will cause the coding-specific {{{decode()}}} to fail.

Paradoxically, a {{{UnicodeDecodeError}}} may happen when _encoding_.  The cause of it seems to be the coding-specific {{{encode()}}} functions that normally expect a parameter of type {{{unicode}}}.  It appears that on seeing an {{{str}}} parameter, the {{{encode()}}} functions "up-convert" it into {{{unicode}}} before converting to their own coding.  It also appears that such "up-conversion" makes no assumption of {{{str}}} parameter's coding, choosing a default {{{ascii}}} decoder.  Hence a decoding failure inside an encoder.

Unlike a similar case with UnicodeEncodeError, such a failure cannot be avoided.  This is because the {{{str}}} result of {{{encode()}}} must be a legal coding-specific sequence.  However, a more flexible treatment of an unexpected {{{str}}} argument might first validate the {{{str}}} argument by attempting to decode it, then return it unmodified if the validation was successful.

Page

User