Differences between revisions 5 and 6
Revision 5 as of 2007-03-11 02:27:51
Size: 2029
Editor: MatsWichmann
Comment: typo
Revision 6 as of 2007-04-02 16:12:40
Size: 3944
Editor: cscfpc15
Comment: There is a built-in print codec somewhere in the argument type coercion.
Deletions are marked like this. Additions are marked like this.
Line 31: Line 31:
===== print, write, read and Unicode in pre-3.0 Python =====

Because file operations are 8-bit clean, reading data from the original {{{stdin}}} will return {{{str}}}'s containing data in the input character set. Writing these {{{str}}}'s to {{{stdout}}} without any codecs will result in the output identical to the input.
{{{
$ echo $LANG
en_CA.utf8

$ python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
[TYPING: абв ENTER]
<type 'str'> 7
абв

$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
<type 'str'> 7
абв
$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;' | cat
<type 'str'> 7
абв
}}}

Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the {{{print}}} statement built the required transformation into the argument type coercion routine.
 * When Python finds the output to be a terminal and sets the .encoding attributes, the {{{print}}} statement's handler will automatically convert {{{unicode}}} strings into {{{str}}} strings in the course of argument coercion.
 * When Python does not see the desired character set of the output, it sets .encoding to None, and {{{print}}}'s coercion will invoke the "ascii" codec.

I (IL) believe reading from {{{stdin}}} does not involve coercion at all because the existing ways to read from {{{stdin}}} such as {{{"for line in sys.stdin"}}} do not convey the expected type of the returned value to the {{{stdin}}} handler. A function that would complement the {{{print}}} statement might look like this:
{{{
    uline = typed_read(unicode) # Generally, a list of input data types along with an optional parsing format line.
}}}


Line 32: Line 63:
----
CategoryUnicode

If you try to print a unicode string to console and get a message like this one:

>>> print u"\u03A9"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python24\lib\encodings\cp866.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in position
 0: character maps to <undefined>

That means you're using legacy, limited or misconfigured console. If you're just trying to play with unicode at interactive prompt move to a modern unicode-aware console. Most modern Python distributions come with IDLE where you'll be able to print all unicode characters.

Standard Microsoft Windows console

By default console in Microsoft Windows is able to display 256 characters. Python will automatically detect what characters are supported by this console. If you try to print unprintable character you will get UnicodeEncodeError.

Various UNIX consoles

There is no standard way to query UNIX console for find out what characters it supports but fortunately there is a way to find out what characters are considered to be printable. Locale category LC_CTYPE defines what characters are printable. To find out its value type at python prompt:

   1 >>> import locale
   2 >>> locale.getdefaultlocale()[1]
   3 'utf-8'

If you got any other value you won't be able to print all unicode characters. As soon as you try to print a unprintable character you will get UnicodeEncodeError. To fix this situation you need to set the environment variable LANG to one of supported by your system unicode locales. To get the full list of locales use command "locale -a", look for locales that end with string ".utf-8". If you have set LANG variable but now instead of UnicodeEncodeError you see garbage on your screen you need to set up your terminal to use font unicode font. Consult terminal manual on how to do it.

print, write, read and Unicode in pre-3.0 Python

Because file operations are 8-bit clean, reading data from the original stdin will return str's containing data in the input character set. Writing these str's to stdout without any codecs will result in the output identical to the input.

$ echo $LANG
en_CA.utf8

$ python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
[TYPING: абв ENTER]
<type 'str'> 7
абв

$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;'
<type 'str'> 7
абв
$ echo "абв" | python -c 'import sys; line = sys.stdin.readline(); print str(type(line)), len(line); print line;' | cat
<type 'str'> 7
абв

Since programmers need to convert 8-bit input streams to Unicode and write Unicode to 8-bit output streams, the designers of the print statement built the required transformation into the argument type coercion routine.

  • When Python finds the output to be a terminal and sets the .encoding attributes, the print statement's handler will automatically convert unicode strings into str strings in the course of argument coercion.

  • When Python does not see the desired character set of the output, it sets .encoding to None, and print's coercion will invoke the "ascii" codec.

I (IL) believe reading from stdin does not involve coercion at all because the existing ways to read from stdin such as "for line in sys.stdin" do not convey the expected type of the returned value to the stdin handler. A function that would complement the print statement might look like this:

    uline = typed_read(unicode)   # Generally, a list of input data types along with an optional parsing format line.

See also: ["Unicode"]


CategoryUnicode

PrintFails (last edited 2012-11-25 11:32:18 by techtonik)

Unable to edit the page? See the FrontPage for instructions.