Differences between revisions 6 and 36 (spanning 30 versions)
Revision 6 as of 2007-04-02 16:12:40
Size: 3944
Editor: cscfpc15
Comment: There is a built-in print codec somewhere in the argument type coercion.
Revision 36 as of 2011-08-03 03:40:28
Size: 11000
Editor: 2001-44b8-317b-6500-e5fe-41f6-965e-3bce
Comment: Added clarification on solution for setting PYTHONIOENCODING and clarifications to Windows section (and fixed broken link syntax)
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Issue ==
Line 12: Line 14:
That means you're using legacy, limited or misconfigured console. If you're just trying to play with unicode at interactive prompt move to a modern unicode-aware console. Most modern Python distributions come with IDLE where you'll be able to print all unicode characters.

===== Standard Microsoft Windows console =====

By default console in Microsoft Windows is able to display 256 characters. Python will automatically detect what characters are supported by this console. If you try to print unprintable character you will get {{{UnicodeEncodeError}}}.

===== Various UNIX consoles =====
This means that the python console app can't write the given character to the console's encoding.

More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.

'''sys.stdout''' --> '''_io.TextIOWrapperd''' --> (your console)

To understand it more clearly, look at:
 * sys.stdout
 * sys.stdout.encoding -- /!\ note: cannot be altered!
 * sys.stdout.errors -- /!\ note: cannot be altered!


== Solutions ==

Set the environment variable "PYTHONIOENCODING" appropriately for the capabilities of your output console and preference for error handling. This does not require changing your source code, however other users of your code will need to set this correctly if their consoles have similar limitations. See [[http://docs.python.org/py3k/using/cmdline.html?highlight=pythonioencoding#PYTHONIOENCODING|PYTHONIOENCODING in the python docs]].

"PYTHONIOENCODING=utf_8" can be used where [[http://daveagp.wordpress.com/2010/10/26/what-a-character/|output is destined for web platforms]] and utf-8 is the actual intended encoding of the rendered markup (i.e. where you're piping stdout to another process, not reading on the console). For human-readable output on the console, instead set "PYTHONIOENCODING" to "<code page>:<error handler>" where <code page> is your console's code page, e.g. "cp850:backslashreplace". Error handlers other than "backslashreplace" can be used; see [[http://docs.python.org/py3k/using/cmdline.html?highlight=pythonioencoding#PYTHONIOENCODING|the docs]].

Another solution is to use IDLE. IDLE can print all unicode characters.

Another is to put an intercept between sys.stdout, and the text wrapper.

{{{
#!python
class StreamTee:
    
    """Intercept a stream.
    
    Invoke like so:
    sys.stdout = StreamTee(sys.stdout)
    
    See: grid 109 for notes on older version (StdoutTee).
    """
    
    def __init__(self, target):
        self.target = target
    
    def write(self, s):
        s = self.intercept(s)
        self.target.write(s)
    
    def intercept(self, s):
        """Pass-through -- Overload this."""
        return s


class SafeStreamFilter(StreamTee):
    """Convert string traffic to to something safe."""
    def __init__(self, target):
        StreamTee.__init__(self, target)
        self.encoding = 'utf-8'
        self.errors = 'replace'
        self.encode_to = self.target.encoding
    def intercept(self, s):
        return s.encode(self.encode_to, self.errors).decode(self.encode_to)


def console_mode():
    """Console mode."""
    import sys
    sys.stdout = SafeStreamFilter(sys.stdout)
}}}

/!\ There's work yet to be done for this solution. For example, when you do help(''module-name''), ordinarily, it paginates. With this answer, there is no pagination.

/!\ This seems to work on one of my computers (Vista,) but not on another of my computers (XP.) I haven't looked into differences of situation in detail.

=== Windows ===

By default, the console in Microsoft Windows only displays 256 characters (cp437, of [[http://en.wikipedia.org/wiki/Code_page_437 | "Code page 437"]], the original IBM-PC 1981 extended ASCII character set.)

If you try to print an unprintable character you will get {{{UnicodeEncodeError}}}.

Setting the PYTHONIOENCODING environment variable as described above can be used to suppress the error messages. Setting to "utf-8" is not recommended as this produces an inaccurate, garbled representation of the output to the console. For best results, use your console's correct default codepage and a suitable error handler other than "strict".

=== Various UNIX consoles ===
Line 30: Line 101:

===
== print, write, read and Unicode in pre-3.0 Python =====
== print, write and Unicode in pre-3.0 Python ==
Line 35: Line 105:
$ echo $LANG
en_CA.utf8

$ python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
[TYPING: абв ENTER]
<type 'str'> 7
абв

$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
<type 'str'> 7
абв
$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;' | cat
<type 'str'> 7
абв
}}}

Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the {{{print}}} statement built the required transformation into the argument type coercion routine.
 * When Python finds the output to be a terminal and sets the .encoding attributes, the {{{print}}} statement's handler will automatically convert {{{unicode}}} strings into {{{str}}} strings in the course of argument coercion.
 * When Python does not see the desired character set of the output, it sets .encoding to None, and {{{print}}}'s coercion will invoke the "ascii" codec.
  $ echo $LANG
  en_CA.utf8

  $ python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;'
  [TYPING: абв ENTER]
  <type 'str'> 7
  абв

  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;'
  <type 'str'> 7
  абв
  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;' | cat
  <type 'str'> 7
  абв
}}}

Since programmers need to display {{{unicode}}} strings, the designers of the {{{print}}} statement built the required transformation into it.
 * When Python finds its output attached to a terminal, it sets the {{{sys.stdout.encoding}}} attribute to the terminal's encoding. The {{{print}}} statement's handler will automatically encode {{{unicode}}} arguments into {{{str}}} output.
{{{
    $ python -c 'import sys; print sys.stdout.encoding; print u"\u0411\n"'
    UTF-8
    Б
}}}
 * When Python does not detect the desired character set of the output, it sets {{{sys.stdout.encoding}}} to None, and {{{print}}} will invoke the "ascii" codec.
{{{
    $ python -c 'import sys; print sys.stdout.encoding; print u"\u0411\n"' 2>&1 | cat
    None
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)
}}}

I (IL) understand the implementation of Python2.5's {{{print}}} statement as follows.
{{{
#!python
    # At Python startup.
    sys.stdout.encoding = tty_enc
    if tty_enc is not None:
      class_tty_enc_sw = codecs.getstreamwriter(tty_enc)
    else
      class_tty_enc_sw = None

    def print(*args):
      if class_tty_enc_sw is not None:
        eout = class_tty_enc_sw(sys.stdout)
      else:
        eout = None
      for arg in args:
         sarg = stringify_to_str_or_unicode(arg)
         if type(sarg) == str or eout is None:
           # Avoid coercion to unicode in eout.write().
           sys.stdout.write(sarg)
         else:
           eout.write(sarg)
}}}

 * At startup, Python will detect the encoding of the standard output and, probably, store the respective StreamWriter class definition. The {{{print}}} statement stringifies all its arguments to narrow {{{str}}} and wide {{{unicode}}} strings based on the width of the original arguments. Then {{{print}}} passes narrow strings to {{{sys.stdout}}} directly and wide strings to the instance of StreamWriter wrapped around {{{sys.stdout}}}.
 * If the user does not replace {{{sys.stdout}}} as shown below and Python does not detect an output encoding, the {{{write}}} method will coerce {{{unicode}}} values to {{{str}}} by invoking the ASCII codec (DefaultEncoding).

Python file's {{{write}}} and {{{read}}} methods do not invoke codecs internally. Python2.5's file {{{open}}} built-in sets the {{{.encoding}}} attribute of the resulting instance to {{{None}}}.

Wrapping {{{sys.stdout}}} into an instance of StreamWriter will allow writing {{{unicode}}} data with {{{sys.stdout.write()}}} and {{{print}}}.
{{{
  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б
}}}

The {{{write}}} call executes {{{StreamWriter.write}}} which in turn invokes codec-specific {{{encode}}} and passes the result to the underlying file. It appears that the {{{print}}} statement will not fail due to the argument type coercion when {{{sys.stdout}}} is wrapped. My (IL's) understanding of {{{print}}}'s implementation above agrees with that.

== read and Unicode in pre-3.0 Python ==
Line 57: Line 193:
    uline = typed_read(unicode) # Generally, a list of input data types along with an optional parsing format line.
}}}



See also: ["Unicode"]
  line = typed_read(unicode) # Generally, a list of input data types along with an optional parsing format line.
}}}

{{{print}}} statement encodes {{{unicode}}} strings to {{{str}}} strings. One can complement this with decoding of {{{str}}} input data into {{{unicode}}} strings in {{{sys.stdin.read/readline}}}. For this, we will wrap {{{sys.stdin}}} into a StreamReader instance:
{{{
  $ python -c 'import sys, codecs, locale; \
    print sys.stdin.encoding; \
    sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); \
    line = sys.stdin.readline(); print type(line), len(line)' 2>&1
  UTF-8
  [TYPING: абв ENTER]
  <type 'unicode'> 4
  $ echo "абв" | python -c 'import sys, codecs, locale; \
    print sys.stdin.encoding; \
    sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); \
    line = sys.stdin.readline(); print type(line), len(line)'
  None
  <type 'unicode'> 4
}}}



See also: [[Unicode]]

Issue

If you try to print a unicode string to console and get a message like this one:

>>> print u"\u03A9"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python24\lib\encodings\cp866.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in position
 0: character maps to <undefined>

This means that the python console app can't write the given character to the console's encoding.

More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.

sys.stdout --> _io.TextIOWrapperd --> (your console)

To understand it more clearly, look at:

  • sys.stdout
  • sys.stdout.encoding -- /!\ note: cannot be altered!

  • sys.stdout.errors -- /!\ note: cannot be altered!

Solutions

Set the environment variable "PYTHONIOENCODING" appropriately for the capabilities of your output console and preference for error handling. This does not require changing your source code, however other users of your code will need to set this correctly if their consoles have similar limitations. See PYTHONIOENCODING in the python docs.

"PYTHONIOENCODING=utf_8" can be used where output is destined for web platforms and utf-8 is the actual intended encoding of the rendered markup (i.e. where you're piping stdout to another process, not reading on the console). For human-readable output on the console, instead set "PYTHONIOENCODING" to "<code page>:<error handler>" where <code page> is your console's code page, e.g. "cp850:backslashreplace". Error handlers other than "backslashreplace" can be used; see the docs.

Another solution is to use IDLE. IDLE can print all unicode characters.

Another is to put an intercept between sys.stdout, and the text wrapper.

   1 class StreamTee:
   2     
   3     """Intercept a stream.
   4     
   5     Invoke like so:
   6     sys.stdout = StreamTee(sys.stdout)
   7     
   8     See: grid 109 for notes on older version (StdoutTee).
   9     """
  10     
  11     def __init__(self, target):
  12         self.target = target
  13     
  14     def write(self, s):
  15         s = self.intercept(s)
  16         self.target.write(s)
  17     
  18     def intercept(self, s):
  19         """Pass-through -- Overload this."""
  20         return s
  21 
  22 
  23 class SafeStreamFilter(StreamTee):
  24     """Convert string traffic to to something safe."""
  25     def __init__(self, target):
  26         StreamTee.__init__(self, target)
  27         self.encoding = 'utf-8'
  28         self.errors = 'replace'
  29         self.encode_to = self.target.encoding
  30     def intercept(self, s):
  31         return s.encode(self.encode_to, self.errors).decode(self.encode_to)
  32 
  33 
  34 def console_mode():
  35     """Console mode."""
  36     import sys
  37     sys.stdout = SafeStreamFilter(sys.stdout)

/!\ There's work yet to be done for this solution. For example, when you do help(module-name), ordinarily, it paginates. With this answer, there is no pagination.

/!\ This seems to work on one of my computers (Vista,) but not on another of my computers (XP.) I haven't looked into differences of situation in detail.

Windows

By default, the console in Microsoft Windows only displays 256 characters (cp437, of "Code page 437", the original IBM-PC 1981 extended ASCII character set.)

If you try to print an unprintable character you will get UnicodeEncodeError.

Setting the PYTHONIOENCODING environment variable as described above can be used to suppress the error messages. Setting to "utf-8" is not recommended as this produces an inaccurate, garbled representation of the output to the console. For best results, use your console's correct default codepage and a suitable error handler other than "strict".

Various UNIX consoles

There is no standard way to query UNIX console for find out what characters it supports but fortunately there is a way to find out what characters are considered to be printable. Locale category LC_CTYPE defines what characters are printable. To find out its value type at python prompt:

   1 >>> import locale
   2 >>> locale.getdefaultlocale()[1]
   3 'utf-8'

If you got any other value you won't be able to print all unicode characters. As soon as you try to print a unprintable character you will get UnicodeEncodeError. To fix this situation you need to set the environment variable LANG to one of supported by your system unicode locales. To get the full list of locales use command "locale -a", look for locales that end with string ".utf-8". If you have set LANG variable but now instead of UnicodeEncodeError you see garbage on your screen you need to set up your terminal to use font unicode font. Consult terminal manual on how to do it.

print, write and Unicode in pre-3.0 Python

Because file operations are 8-bit clean, reading data from the original stdin will return str's containing data in the input character set. Writing these str's to stdout without any codecs will result in the output identical to the input.

  $ echo $LANG
  en_CA.utf8

  $ python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;'
  [TYPING: абв ENTER]
  <type 'str'> 7
  абв

  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;'
  <type 'str'> 7
  абв
  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print type(line), len(line); print line;' | cat
  <type 'str'> 7
  абв

Since programmers need to display unicode strings, the designers of the print statement built the required transformation into it.

  • When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal's encoding. The print statement's handler will automatically encode unicode arguments into str output.

    $ python -c 'import sys; print sys.stdout.encoding; print u"\u0411\n"'
    UTF-8
    Б
  • When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

    $ python -c 'import sys; print sys.stdout.encoding; print u"\u0411\n"' 2>&1 | cat
    None
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)

I (IL) understand the implementation of Python2.5's print statement as follows.

   1     # At Python startup.
   2     sys.stdout.encoding = tty_enc
   3     if tty_enc is not None:
   4       class_tty_enc_sw = codecs.getstreamwriter(tty_enc)
   5     else
   6       class_tty_enc_sw = None
   7 
   8     def print(*args):
   9       if class_tty_enc_sw is not None:
  10         eout = class_tty_enc_sw(sys.stdout)
  11       else:
  12         eout = None
  13       for arg in args:
  14          sarg = stringify_to_str_or_unicode(arg)
  15          if type(sarg) == str or eout is None:
  16            # Avoid coercion to unicode in eout.write().
  17            sys.stdout.write(sarg)
  18          else:
  19            eout.write(sarg)
  • At startup, Python will detect the encoding of the standard output and, probably, store the respective StreamWriter class definition. The print statement stringifies all its arguments to narrow str and wide unicode strings based on the width of the original arguments. Then print passes narrow strings to sys.stdout directly and wide strings to the instance of StreamWriter wrapped around sys.stdout.

  • If the user does not replace sys.stdout as shown below and Python does not detect an output encoding, the write method will coerce unicode values to str by invoking the ASCII codec (DefaultEncoding).

Python file's write and read methods do not invoke codecs internally. Python2.5's file open built-in sets the .encoding attribute of the resulting instance to None.

Wrapping sys.stdout into an instance of StreamWriter will allow writing unicode data with sys.stdout.write() and print.

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

The write call executes StreamWriter.write which in turn invokes codec-specific encode and passes the result to the underlying file. It appears that the print statement will not fail due to the argument type coercion when sys.stdout is wrapped. My (IL's) understanding of print's implementation above agrees with that.

read and Unicode in pre-3.0 Python

I (IL) believe reading from stdin does not involve coercion at all because the existing ways to read from stdin such as "for line in sys.stdin" do not convey the expected type of the returned value to the stdin handler. A function that would complement the print statement might look like this:

  line = typed_read(unicode)   # Generally, a list of input data types along with an optional parsing format line.

print statement encodes unicode strings to str strings. One can complement this with decoding of str input data into unicode strings in sys.stdin.read/readline. For this, we will wrap sys.stdin into a StreamReader instance:

  $ python -c 'import sys, codecs, locale; \
    print sys.stdin.encoding; \
    sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); \
    line = sys.stdin.readline(); print type(line), len(line)' 2>&1
  UTF-8
  [TYPING: абв ENTER]
  <type 'unicode'> 4
  $ echo "абв" | python -c 'import sys, codecs, locale; \
    print sys.stdin.encoding; \
    sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); \
    line = sys.stdin.readline(); print type(line), len(line)'
  None
  <type 'unicode'> 4

See also: Unicode


CategoryUnicode

PrintFails (last edited 2012-11-25 11:32:18 by techtonik)

Unable to edit the page? See the FrontPage for instructions.