Diff for "Python3kStringRepr"

Differences between revisions 1 and 9 (spanning 8 versions)

PEP:

Title: String representation in Python 3000 Version: $Revision$ Last-Modified: $Date$ Author: Atsuo Ishimoto <ishimoto--at--gembook.org> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: Post-History:

Abstract

This PEP proposes new string representation form for Python 3000. In Python prior to Python 3000, the repr() built-in function converts arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, a wider range of characters, based on the Unicode standard, should be considered 'printable'.

Motivation

The current repr() converts 8-bit strings to ASCII using following algorithm.

Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters(>=0x80) to '\xXX'.
Backslash-escape quote characters(apostrophe, ') and add the quote character at the beginning and the end.

For Unicode strings, the following additional conversions are done.

Convert leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
Convert 16-bit characters(>=0x100) to '\uXXXX'.
Convert 21-bit characters(>=0x10000) and surrogate pair characters to '\U00xxxxxx'.

This algorithm converts any string to printable ASCII, and repr() is used as handy and safe way to print strings for debugging or for logging. Although all non-ASCII characters are escaped, this does not matter when most of the string's characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient. Python 3000 has a lot of nice features for non-Latin users such as non-ASCII identifiers, so it would be helpful if Python could also progress in a similar way for printable output.

Some users might be concerned that such output will mess up their console if they print binary data like images. But this is unlikely to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won't mess it up.

This issue was once discussed by Hye-Shik Chang [1] , but was rejected.

Specification

Add Python API int PY_UNICODE_ISPRINTABLE(Py_UNICODE ch). PY_UNICODE_ISPRINTABLE() return 0 if repr() should escape the Unicode character ch, 1 otherwise. Characters should be escaped are
- Characters defined in the Unicode character database as "Other"(Cc, Cf, Cs, Co, Cn).
- Characters defined in the Unicode character database as "Separator"(Zl, Zp, Zs) other than ASCII space(0x20).
The algorithm to build repr() strings should be changed to:
- Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
- Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\xXX'.
- Convert leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
- Convert non-printable characters(PY_UNICODE_ISPRINTABLE() returns 0) to 'xXX', '\uXXXX' or '\U00xxxxxx'.
- Backslash-escape quote characters(apostrophe, ') and add quote character at the beginning and the end.
Set the Unicode error-handler for sys.stderr to 'backslashreplace' by default.
Set the Unicode error-handler for sys.stdout in the Python interactive session to 'backslashreplace' by default.
Add '%a' string format operator. '%a' converts any python object to string using repr() and then hex-escape all non-ASCII characters. '%a' operator generates same string as '%r' in Python 2.
Add ascii() builtin function. ascii() converts any python object to string using repr() and then hex-escape all non-ASCII characters. ascii() generates same string as repr() in Python 2.
Add isprintable() method to the string type. str.isprintable() returns False if repr() should escape the characters in the string, True otherwise. isprintable() method calls PY_UNICODE_ISPRINTABLE() internally.

Rationale

The repr() in Python 3000 should be Unicode not ASCII based, just like Python 3000 strings. Also, conversion should not be affected by the locale setting, because the locale is not necessarily the same as the output device's locale. For example, it is common for a daemon process to be invoked in an ASCII setting, but writes UTF-8 to its log files. Also, web applications might want to report the error information in more readable form based on the HTML page's encoding.

Characters not supported by user's console are hex-escaped on printing, by the Unicode encoder's error-handler. If the error-handler of the output file is 'backslashreplace', such characters are hex-escaped without raising UnicodeEncodeError. For example, if your default encoding is ASCII, print('Hello ¢') will prints 'Hello \xa2'. If your encoding is ISO-8859-1, 'Hello ¢' will be printed.

For non-interactive session, default error-handler of sys.stdout should be default to 'strict'. Other applications reading the output might not understand hex-escaped characters, so un-supported characters should be trapped when writing.

Printable characters

The Unicode standard doesn't define Non-printable characters, so we must create our own definition. Here we propose to define Non-printable characters as follows.

Non-printable ASCII characters as Python 2.
Broken surrogate pair characters.
Characters defined in the Unicode character database as
- Cc (Other, Control)
- Cf (Other, Format)
- Cs (Other, Surrogate)
- Co (Other, Private Use)
- Cn (Other, Not Assigned)
- Zl Separator, Line ('\u2028', LINE SEPARATOR)
- Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
- Zs (Separator, Space) other than ASCII space('\x20'). Characters in this category should be escaped to avoid ambiguity.

Alternate Solutions

To help debugging in non-Latin languages without changing repr(), other suggestion were made.

Supply a tool to print lists or dicts.

Strings to be printed for debugging are not only contained by lists or dicts, but also in many other types of object. File objects contain a file name in Unicode, exception objects contain a message in Unicode, etc. These strings should be printed in readable form when repr()ed. It is unlikely to be possible to implement a tool to print all possible object types.
Use sys.displayhook and sys.excepthook.

For interactive sessions, we can write hooks to restore hex escaped characters to the original characters. But these hooks are called only when the result of evaluating an expression entered in an interactive Python session, and doesn't work for the print() function, for non-interactive sessions or for logging.debug("%r", ...), etc.
Subclass sys.stdout and sys.stderr.

It is difficult to implement a subclass to restore hex-escaped characters since there isn't enough information left by the time it's a string to undo the escaping correctly in all cases. For example, print("\\"+"u0041") should be printed as '\u0041', not 'A'. But there is no chance to tell file objects apart.
Make the encoding used by unicode_repr() adjustable, and make current repr() as default.

With adjustable repr(), result of repr() is unpredictable and would make impossible to write correct code involving repr(). And if current repr() is default, then old convention remains intact and user may expect ASCII strings as the result of repr(). Third party applications or libraries could be choked when custom repr() function is used.

Backwards Compatibility

Changing repr() may break some existing codes, especially testing code. Five of Python's regression test fail with this modification. If you need repr() strings without non-ASCII character as Python 2, you can use following function.

def repr_ascii(obj):
    return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")

For logging or for debugging, following code can raise UnicodeEncodeError.

log = open("logfile", "w")
log.write(repr(data))     # UnicodeEncodeError will be raised
                          # if data contains unsupported characters.

To avoid exceptions raised, you can specify error-handler explicitly.

log = open("logfile", "w", errors="backslashreplace")
log.write(repr(data))  # Unsupported characters will be escaped.

For the console with Unicode-based encoding, for example, en_US.utf8 and de_DE.utf8, the backslashescape trick doesn't work and all printable characters are not escaped. This will cause a problem of similarly drawing characters in Western,Greek and Cyrillic languages. These languages use similar (but different) alphabets (descended from the common ancestor) and contain letters that look similar but has different character codes. For example, it is hard to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual representation, of course, very much depends on the fonts used but usually these letters are almost indistinguishable.) To avoid the problem, user can adjust terminal encoding to get desired result suitable for their environment.

Open Issues

Is ascii() function necessary, or documentation is just fine? If necessary, should ascii() belong to builtin namespace?

Rejected Proposals

Add encoding and errors arguments to the builtin print() function, with defaults of sys.getfilesystemencoding() and 'backslashreplace'.

Complicated to implement, and in general, this is not seem to good idea. [2]
Use character names to escape characters, instead of hex character codes. For example, repr('\u03b1') can be converted to "\N{GREEK SMALL LETTER ALPHA}".

Using character names get verbose compared to hex-escape. e.g., repr("\ufbf9") is converted to "\N{ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}".

Reference Implementation

http://bugs.python.org/issue2630

References

[1]	Multibyte string on string::string_print (http://bugs.python.org/issue479898)

[2]	[Python-3000] Displaying strings containing unicode escapes (http://mail.python.org/pipermail/python-3000/2008-April/013366.html)

Copyright

This document has been placed in the public domain.

-  ⇤ ← Revision 1 as of 2008-05-01 02:43:48 → 
  Size: 6575
  Editor: i218-47-192-49
  Comment:
+   ← Revision 9 as of 2008-05-24 10:11:48 → ⇥
  Size: 10642
  Editor: i218-47-192-49
  Comment: Moved ambiguity issue to Backwards Compatibility section
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
 PEP:
-Line 11:
+Line 12:
-Created: 
Post-History:
+Created:
Post-History:
-Line 18:
+Line 19:
-This PEP proposes new string repr for Python 3000. In Python prior to Python 3000, repr() built-in function is used to convert arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, wider range of characters defined in Unicode standard should be considered 'printable'.
+This PEP proposes new string representation form for Python 3000. In Python prior to Python 3000, the repr() built-in function converts arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, a wider range of characters, based on the Unicode standard, should be considered 'printable'.
-Line 24:
+Line 25:
-Current repr() converts 8-bit strings to ASCII by following algorithm.
+The current repr() converts 8-bit strings to ASCII using following algorithm.
-Line 26:
+Line 27:
-- Convert CR, LF, TAB and '\\' to '\r', '\n', '\t', '\\'.
+- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
-Line 28:
+Line 29:
-- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters(>=0x80) to '\xXX'.
+- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters(>=0x80) to '\\xXX'.
-Line 30:
+Line 31:
-- Backslash-escape quote characters(' or ") and add quote character at head and tail.
+- Backslash-escape quote characters(apostrophe, ') and add the quote character at the beginning and the end.
-Line 32:
+Line 33:
-For Unicode strings, following conversions are added.
+For Unicode strings, the following additional conversions are done.
-Line 34:
+Line 35:
-- Leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
+- Convert leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
-Line 36:
+Line 37:
-- Convert 16-bit characters(>=0x100) to '\uXXXX'.
+- Convert 16-bit characters(>=0x100) to '\\uXXXX'.
-Line 38:
+Line 39:
-- Convert 21-bit characters(>=0x10000) and surrogate pair characters to '\U00xxxxxx'.
+- Convert 21-bit characters(>=0x10000) and surrogate pair characters to '\\U00xxxxxx'.
-Line 40:
+Line 41:
-This algorithm converts any strings to printable ASCII, and repr() is used as handy and safe way when printing strings for debug or logging. Although all non-ASCII characters are escaped, it is not a problem when most characters in string are ASCII. But this is an inconvenience for other languages such as Japanese which most characters in string are not ASCII. Python 3000 has a lot of nice features for non-Latin people such as non-ASCII identifiers, so progressing in this area would be desired.
+This algorithm converts any string to printable ASCII, and repr() is used as handy and safe way to print strings for debugging or for logging. Although all non-ASCII characters are escaped, this does not matter when most of the string's characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient. Python 3000 has a lot of nice features for non-Latin users such as non-ASCII identifiers, so it would be helpful if Python could also progress in a similar way for printable output.
-Line 42:
+Line 43:
-People might concern such output will mess their console up if they print binary data like images. But such ruin is unlikely to happen because bytes and strings are different type in Python 3000, so printing image to console doesn't break your display.
+Some users might be concerned that such output will mess up their console if they print binary data like images. But this is unlikely to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won't mess it up.
-Line 50:
+Line 51:
-- Algorithm to build repr string is changed to:
+- Add Python API ``int PY_UNICODE_ISPRINTABLE(Py_UNICODE ch)``. ``PY_UNICODE_ISPRINTABLE()`` return 0 if repr() should escape the Unicode character ``ch``, 1 otherwise. Characters should be escaped are
-Line 52:
+Line 53:
- * Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
+  * Characters defined in the Unicode character database as "Other"(Cc, Cf, Cs, Co, Cn).
-Line 54:
+Line 55:
- * Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to '\xXX'.
+  * Characters defined in the Unicode character database as "Separator"(Zl, Zp, Zs) other than ASCII space(0x20).
-Line 56:
+Line 57:
- * Leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
+- The algorithm to build repr() strings should be changed to:
-Line 58:
+Line 59:
- * Convert Unicode whitespace other than ASCII space('\x20') and control characters (categories Z* and C* in Unicode database) to '\uXXXX' or '\U00xxxxxx'.
+  * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
-Line 60:
+Line 61:
-- Set Unicode error-handler for sys.stdout and sys.stderr to 'backslashreplace' as default.
+  * Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.
-Line 62:
+Line 63:
-- Add encoding and errors arguments to print() built-in function, which defaults are ``sys.getfilesystemencoding()`` and 'backslashreplace'.
+  * Convert leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.

  * Convert non-printable characters(PY_UNICODE_ISPRINTABLE() returns 0) to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.

  * Backslash-escape quote characters(apostrophe, ') and add quote character at the beginning and the end.

- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by default.

- Set the Unicode error-handler for sys.stdout in the Python interactive session to 'backslashreplace'  by default.

- Add ``'%a'`` string format operator. ``'%a'`` converts any python object to string using repr() and then hex-escape all non-ASCII characters. ``'%a'`` operator generates same string as ``'%r'`` in Python 2.

- Add ``ascii()`` builtin function. ``ascii()`` converts any python object to string using repr() and then hex-escape all non-ASCII characters. ``ascii()`` generates same string as ``repr()`` in Python 2.

- Add ``isprintable()`` method to the string type. ``str.isprintable()`` returns False if repr() should escape the characters in the string, True otherwise. ``isprintable()`` method calls ``PY_UNICODE_ISPRINTABLE()`` internally.
-Line 68:
+Line 83:
-repr() in Python 3000 should not rely on ASCII but Unicode standard. Also conversion should not be affected by locale setting, because locale is not necessary to same as output device's locale. For example, daemon process invoked in ASCII setting, but emits log in UTF-8 is pretty common.
+The repr() in Python 3000 should be Unicode not ASCII based, just like Python 3000 strings. Also, conversion should not be affected by the locale setting, because the locale is not necessarily the same as the output device's locale. For example, it is common for a daemon process to be invoked in an ASCII setting, but writes UTF-8 to its log files. Also, web applications might want to report the error information in more readable form based on the HTML page's encoding.
-Line 70:
+Line 85:
-Characters not supported by user's console are hex-escaped on printing, by error-handler of Unicode encoders. If error-handler of the output file is 'backslashreplace', such characters are hex-escaped by error handler without raising UnicodeEncodeError. For example, if your default encoding is ASCII, ``print('¢')`` will prints '\xa2'. If your encoding is ISO-8859-1, '¢' will be printed. If you want to print same strings as Python 2, you can set encoding and errors of output file to 'ASCII' and 'strict' respectively. You can also specify encoding and error-handler when printing, e.g. ``print('¢', encoding='ASCII', errors='backslashreplace')``.
+Characters not supported by user's console are hex-escaped on printing, by the Unicode encoder's error-handler. If the error-handler of the output file is 'backslashreplace', such characters are hex-escaped without raising UnicodeEncodeError. For example, if your default encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'. If your encoding is ISO-8859-1, 'Hello ¢' will be printed.
-Line 72:
+Line 87:
+For non-interactive session, default error-handler of sys.stdout should be default to 'strict'. Other applications reading the output might not understand hex-escaped characters, so un-supported characters should be trapped when writing.
-Line 76:
+Line 92:
-Unicode standard doesn't define Non-printable characters to be escaped in repr(). So we define Non-printable characters as follows.
+The Unicode standard doesn't define Non-printable characters, so we must create our own definition. Here we propose to define Non-printable characters as follows.
-Line 82:
+Line 98:
-- Characters defined in Unicode character database as
+- Characters defined in the Unicode character database as
-Line 89:
+Line 105:
-  * Zl Separator, Line ('\u2028', LINE SEPARATOR)
  * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
  * Zs (Separator, Space) other than ASCII space('\x20'). Characters in this category should be escaped to avoid ambiguity.
+  * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
  * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
  * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in this category should be escaped to avoid ambiguity.
-Line 96:
+Line 112:
-To help debugging in non-Latin language without changing repr(), other suggestion were made.
+To help debugging in non-Latin languages without changing repr(), other suggestion were made.
-Line 100:
+Line 116:
-  Strings to be printed for debugging are not only contained by lists or dicts, but a lot of complex objects. File objects contain a file name in Unicode, exception objects contain message in Unicode, etc. These strings should be printed in readable form when repr()ed. It is impossible to implement a tool to print all possible object types.
+  Strings to be printed for debugging are not only contained by lists or dicts, but also in many other types of object. File objects contain a file name in Unicode, exception objects contain a message in Unicode, etc. These strings should be printed in readable form when repr()ed. It is unlikely to be possible to implement a tool to print all possible object types.
-Line 104:
+Line 120:
-  At the interactive session, we can write hooks to restore hex escaped characters to original. But these hooks are called only when the result of evaluating an expression entered in an interactive Python session, doesn't work for print function or non-interactive session.
+  For interactive sessions, we can write hooks to restore hex escaped characters to the original characters. But these hooks are called only when the result of evaluating an expression entered in an interactive Python session, and doesn't work for the print() function, for non-interactive sessions or for logging.debug("%r", ...), etc.
-Line 108:
+Line 124:
-  It is difficult to implement a subclass to restore hex-escaped characters since there isn't enough information left by the time it's a string to undo the escaping correctly in all cases. For example, ``print("\\"+"u0041")`` should be printed as '\u0041', not 'A'. But no chance to tell file objects apart.
+  It is difficult to implement a subclass to restore hex-escaped characters since there isn't enough information left by the time it's a string to undo the escaping correctly in all cases. For example, ``print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But there is no chance to tell file objects apart.
-Line 110:
+Line 126:
-- Make the encoding used by unicode_repr() adjustable.
+- Make the encoding used by unicode_repr() adjustable, and make current repr() as default.
-Line 112:
+Line 128:
-  I don't want to preserve current repr() behavior to make application/library authors aware of non-ASCII repr(). And I think selecting an encoding on printing is more flexible than having global setting.


Open Issues 
===========

- A lot of people uses UTF-8 for their encoding, such as de_DE.utf8. In this case, backslashescape trick doesn't work.
+  With adjustable repr(), result of repr() is unpredictable and would make impossible to write correct code involving repr(). And if current repr() is default, then old convention remains intact and user may expect ASCII strings as the result of repr(). Third party applications or libraries could be choked when custom repr() function is used.
-Line 124:
+Line 134:
-Changing repr() result break some of existing codes, especially testing code. Five tests in Python's regression test failed by this modification.
+Changing repr() may break some existing codes, especially testing code. Five of Python's regression test fail with this modification. If you need repr() strings without non-ASCII character as Python 2, you can use following function. ::

    def repr_ascii(obj):
        return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")

For logging or for debugging, following code can raise UnicodeEncodeError. ::

    log = open("logfile", "w")
    log.write(repr(data))     # UnicodeEncodeError will be raised
                              # if data contains unsupported characters.

To avoid exceptions raised, you can specify error-handler explicitly. ::

    log = open("logfile", "w", errors="backslashreplace")
    log.write(repr(data))  # Unsupported characters will be escaped.


For the console with Unicode-based encoding, for example, en_US.utf8 and de_DE.utf8, the backslashescape trick doesn't work and all printable characters are not escaped. This will cause a problem of similarly drawing characters in Western,Greek and Cyrillic languages. These languages use similar (but different) alphabets (descended from the common ancestor) and contain letters that look similar but has different character codes. For example, it is hard to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual representation, of course, very much depends on the fonts used but usually these letters are almost indistinguishable.) To avoid the problem, user can adjust terminal encoding to get desired result suitable for their environment.



Open Issues
===========

- Is ``ascii()`` function necessary, or documentation is just fine? If necessary, should ``ascii()`` belong to builtin namespace?


Rejected Proposals
==================

- Add encoding and errors arguments to the builtin print() function, with defaults of sys.getfilesystemencoding() and 'backslashreplace'.

  Complicated to implement, and in general, this is not seem to good idea. [2]_

- Use character names to escape characters, instead of hex character codes. For example, ``repr('\u03b1')`` can be converted to ``"\N{GREEK SMALL LETTER ALPHA}"``.

  Using character names get verbose compared to hex-escape. e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.

Reference Implementation
========================

http://bugs.python.org/issue2630
-Line 133:
+Line 184:
+.. [2] [Python-3000] Displaying strings containing unicode escapes
        (http://mail.python.org/pipermail/python-3000/2008-April/013366.html)

Page

User