Differences between revisions 10 and 11
Revision 10 as of 2007-04-02 23:14:38
Size: 6374
Editor: cscfpc15
Comment:
Revision 11 as of 2007-04-03 01:09:21
Size: 7795
Editor: cscfpc15
Comment:
Deletions are marked like this. Additions are marked like this.
Line 31: Line 31:
===== print, write, read and Unicode in pre-3.0 Python ===== ===== print, write and Unicode in pre-3.0 Python =====
Line 51: Line 51:
Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the {{{print}}} statement built the required transformation into the argument type coercion routine.
 * When Python finds the output to be a terminal and sets the {{{.encoding}}} attributes of {{{stdout}}} and {{{stderr}}}, the {{{print}}} statement's handler will automatically convert {{{unicode}}} strings into {{{str}}} strings in the course of argument coercion.
Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the {{{print}}} statement built the required transformation into it.
 * When Python finds its output attached to a terminal, it sets the {{{sys.stdout.encoding}}} attribute to the terminal's encoding. The {{{print}}} statement's handler will automatically encode {{{unicode}}} arguments into {{{str}}} output.
Line 58: Line 58:
 * When Python does not see the desired character set of the output, it sets .encoding to None, and {{{print}}}'s coercion will invoke the "ascii" codec.  * When Python does not detect the desired character set of the output, it sets {{{sys.stdout.encoding}}} to None, and {{{print}}} will invoke the "ascii" codec.
Line 67: Line 67:
I (IL) believe reading from {{{stdin}}} does not involve coercion at all because the existing ways to read from {{{stdin}}} such as {{{"for line in sys.stdin"}}} do not convey the expected type of the returned value to the {{{stdin}}} handler. A function that would complement the {{{print}}} statement might look like this: Here is my (IL's) understanding of the implementation of Python2.5's {{{print}}} statement.
Line 69: Line 69:
  uline = typed_read(unicode) # Generally, a list of input data types along with an optional parsing format line.   #!python
    # At Python startup.
    eout = codecs.open(sys.stdout, "w+", encoding = tty_enc or None)

    # The implementation of the print statement:
    def print(*args):
      global eout
      for arg in args:
         sarg = stringify_to_str_or_unicode(arg)
         if type(sarg) == str:
           # Avoid coercion to unicode in eout.write().
           sys.stdout.write(sarg)
         else: # type(sarg) == unicode
           eout.write(sarg)
Line 71: Line 84:
* The {{{codecs.open()}}} call will return either the original {{{sys.stdout}}} or an instance of StreamReaderWriter wrapping it, depending on the detection of the terminal's encoding at startup. The {{{print}}} statement stringifies all its arguments to narrow {{{str}}} and wide {{{unicode}}} strings based on the width of the original arguments. Then {{{print}}} passes the strings to the {{{eout.write()}}} handler.
* If Python detects the encoding, {{{eout}}} is a {{{StreamWriter}}} that will call the respective codec's {{{encode}}} method on wide strings.
* Conversely, if Python does not detect the encoding at startup, {{{eout}}} would equal to {{{sys.stdout}}}, and its {{{.write()}}} method would ["Coercion" coerce] {{{unicode}}} values to {{{str}}}.
Line 72: Line 88:
The {{{write}}} and {{{read}}} methods do not invoke codecs internally. Python2.5's file {{{open}}} built-in sets the {{{.encoding}}} attribute of the resulting instance to {{{None}}}. To complement {{{print}}} statement's automatic coercing of {{{unicode}}} strings with automatic decoding of input data into {{{unicode}}} strings in {{{sys.stdin.read/readline}}}, one can wrap the file into a StreamReader instance:
{{{
  $ python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)' 2>&1
  UTF-8
  [TYPING: абв ENTER]
  <type 'unicode'> 4
  $ echo "абв" | python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)'
  None
  <type 'unicode'> 4
}}}
Python file's {{{write}}} and {{{read}}} methods do not invoke codecs internally. Python2.5's file {{{open}}} built-in sets the {{{.encoding}}} attribute of the resulting instance to {{{None}}}.
Line 98: Line 105:
The {{{write}}} call will execute {{{StreamWriter.write}}} which in turn invokes codec-specific {{{encode}}} and passes the result to the underlying file. It appears that the {{{print}}} statement will not fail due to the argument type coercion when {{{sys.stdout}}} is wrapped. The {{{write}}} call will execute {{{StreamWriter.write}}} which in turn invokes codec-specific {{{encode}}} and passes the result to the underlying file. It appears that the {{{print}}} statement will not fail due to the argument type coercion when {{{sys.stdout}}} is wrapped. One can explain this behaviour by assuming that {{{print}}}


===== read and Unicode in pre-3.0 Python =====

I (IL) believe reading from {{{stdin}}} does not involve coercion at all because the existing ways to read from {{{stdin}}} such as {{{"for line in sys.stdin"}}} do not convey the expected type of the returned value to the {{{stdin}}} handler. A function that would complement the {{{print}}} statement might look like this:
{{{
  uline = typed_read(unicode) # Generally, a list of input data types along with an optional parsing format line.
}}}

To complement {{{print}}} statement's coercion of {{{unicode}}} strings with decoding of input data into {{{unicode}}} strings in {{{sys.stdin.read/readline}}}, one can wrap {{{sys.stdin}}} into a StreamReader instance:
{{{
  $ python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)' 2>&1
  UTF-8
  [TYPING: абв ENTER]
  <type 'unicode'> 4
  $ echo "абв" | python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)'
  None
  <type 'unicode'> 4
}}}

If you try to print a unicode string to console and get a message like this one:

>>> print u"\u03A9"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python24\lib\encodings\cp866.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in position
 0: character maps to <undefined>

That means you're using legacy, limited or misconfigured console. If you're just trying to play with unicode at interactive prompt move to a modern unicode-aware console. Most modern Python distributions come with IDLE where you'll be able to print all unicode characters.

Standard Microsoft Windows console

By default console in Microsoft Windows is able to display 256 characters. Python will automatically detect what characters are supported by this console. If you try to print unprintable character you will get UnicodeEncodeError.

Various UNIX consoles

There is no standard way to query UNIX console for find out what characters it supports but fortunately there is a way to find out what characters are considered to be printable. Locale category LC_CTYPE defines what characters are printable. To find out its value type at python prompt:

   1 >>> import locale
   2 >>> locale.getdefaultlocale()[1]
   3 'utf-8'

If you got any other value you won't be able to print all unicode characters. As soon as you try to print a unprintable character you will get UnicodeEncodeError. To fix this situation you need to set the environment variable LANG to one of supported by your system unicode locales. To get the full list of locales use command "locale -a", look for locales that end with string ".utf-8". If you have set LANG variable but now instead of UnicodeEncodeError you see garbage on your screen you need to set up your terminal to use font unicode font. Consult terminal manual on how to do it.

print, write and Unicode in pre-3.0 Python

Because file operations are 8-bit clean, reading data from the original stdin will return str's containing data in the input character set. Writing these str's to stdout without any codecs will result in the output identical to the input.

  $ echo $LANG
  en_CA.utf8

  $ python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
  [TYPING: абв ENTER]
  <type 'str'> 7
  абв

  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
  <type 'str'> 7
  абв
  $ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;' | cat
  <type 'str'> 7
  абв

Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the print statement built the required transformation into it.

  • When Python finds its output attached to a terminal, it sets the sys.stdout.encoding attribute to the terminal's encoding. The print statement's handler will automatically encode unicode arguments into str output.

    $ python -c 'import sys; print str(sys.stdout.encoding); print u"\u0411\n"'
    UTF-8
    Б
  • When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

    $ python -c 'import sys; print str(sys.stdout.encoding); print u"\u0411\n"' 2>&1 | cat
    None
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0411' in position 0: ordinal not in range(128)

Here is my (IL's) understanding of the implementation of Python2.5's print statement.

   1   #!python
   2     # At Python startup.
   3     eout = codecs.open(sys.stdout, "w+", encoding = tty_enc or None)
   4 
   5     # The implementation of the print statement:
   6     def print(*args):
   7       global eout
   8       for arg in args:
   9          sarg = stringify_to_str_or_unicode(arg)
  10          if type(sarg) == str:
  11            # Avoid coercion to unicode in eout.write().
  12            sys.stdout.write(sarg)
  13          else:                # type(sarg) == unicode
  14            eout.write(sarg)

* The codecs.open() call will return either the original sys.stdout or an instance of StreamReaderWriter wrapping it, depending on the detection of the terminal's encoding at startup. The print statement stringifies all its arguments to narrow str and wide unicode strings based on the width of the original arguments. Then print passes the strings to the eout.write() handler. * If Python detects the encoding, eout is a StreamWriter that will call the respective codec's encode method on wide strings. * Conversely, if Python does not detect the encoding at startup, eout would equal to sys.stdout, and its .write() method would ["Coercion" coerce] unicode values to str.

Python file's write and read methods do not invoke codecs internally. Python2.5's file open built-in sets the .encoding attribute of the resulting instance to None.

Wrapping sys.stdout into an instance of StreamWriter will allow writing unicode data with sys.stdout.write() and print.

  $ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); line = u"\u0411\n"; print type(line), len(line); sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); line = u"\u0411\n"; print type(line), len(line); sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

The write call will execute StreamWriter.write which in turn invokes codec-specific encode and passes the result to the underlying file. It appears that the print statement will not fail due to the argument type coercion when sys.stdout is wrapped. One can explain this behaviour by assuming that print

read and Unicode in pre-3.0 Python

I (IL) believe reading from stdin does not involve coercion at all because the existing ways to read from stdin such as "for line in sys.stdin" do not convey the expected type of the returned value to the stdin handler. A function that would complement the print statement might look like this:

  uline = typed_read(unicode)   # Generally, a list of input data types along with an optional parsing format line.

To complement print statement's coercion of unicode strings with decoding of input data into unicode strings in sys.stdin.read/readline, one can wrap sys.stdin into a StreamReader instance:

  $ python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)' 2>&1
  UTF-8
  [TYPING: абв ENTER]
  <type 'unicode'> 4
  $ echo "абв" | python -c 'import sys, codecs, locale; print str(sys.stdin.encoding); sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin); line = sys.stdin.readline(); print type(line), len(line)'
  None
  <type 'unicode'> 4

See also: ["Unicode"]


CategoryUnicode

PrintFails (last edited 2012-11-25 11:32:18 by techtonik)

Unable to edit the page? See the FrontPage for instructions.