PEP 393 — Flexible String Representation

This document serves as the documentation for the initial implementation of PEP 393 during the Google Summer of Code 2011. The project's repository is hosted on bitbucket in the pep-393 branch.

PEP 393 proposes to change Python's unicode string representation so that strings use a flexible character size and require only minimum necessary space to store. Strings containing only Latin-1 code points will use 1-byte wide characters, strings which contain code points with values less than 65536 will require 2-byte characters and all other ones 4-byte characters.

Please see the PEP document for details on the rationale and the proposed approach. The announcement email on python-dev also gives some more context information.

Implemented Features

During the Google Summer of Code all the changes to Python's structures and additional functions which have been specified in the PEP have been implemented. The update for the GDB debugging hooks is missing currently, though.

New unicode objects are now allocated in one memory block instead of two. Previously one block was used for the struct and another for the unicode buffer.

PEP 393 supports two string representations, the legacy Py_UNICODE representation and a new flexible representation. The patch is source code compatible with previous API, but using the legacy representation can be less efficient now because the Py_UNICODE representation might have to be computed on demand.

The interpreter startup has been profiled and the most called functions for the interpreter start have been ported to the new unicode representation. Currently there are still about 3000 old representations being created for a startup and 200 old strings are created while about 62000 strings in the new representation are being created.

Currently all unit tests besides test_packaging (which is also broken in the default branch) pass with the applied patch.

Changed Paradigms

Besides the changes which are already outlined in PEP 393, such as conversion to the old representation through PyUnicode_AsUnicodeSize(), the following paradigm changes take place when working with the new flexible string representation instead of Py_UNICODE buffers:

  1. To access a string, it is not possible to access a string directly via C's vector operator. Instead use the PyUnicode_READ() macro or one of the PyUnicode_1BYTE_DATA(), PyUnicode_2BYTE_DATA(), or PyUnicode_4BYTE_DATA() macros. Use PyUnicode_KIND() to check for the string kind and PyUnicode_DATA() to access the data buffer.

  2. Functions which are passed strings should call PyUnicode_FAST_READY() to ensure the canonical representation is initialized.

  3. Use a for-loop with index based access instead of pointer based access to access all characters of a string. This sample code shows the above three methods in action:

    int kind;
    void *data;
    Py_ssize_t i, size;
    if (PyUnicode_FAST_READY(unicode_obj) == -1)
        return NULL;
    kind = PyUnicode_KIND(unicode_obj);
    data = PyUnicode_DATA(unicode_obj);
    size = PyUnicode_GET_LENGTH(unicode_obj);
    for (i = 0; i < size; ++i) {
        Py_UCS4 ch = PyUnicode_READ(kind, data, i);
  1. PyUnicode_GET_LENGTH() returns the length of the canonical representation. Calling PyUnicode_GET_SIZE() will generate the legacy representation on demand and returns it's size.

  2. Since Py_UNICODE might be 2 bytes only (depending on sizeof(wchar_t)), use Py_UCS4 to store individual characters.
  3. The new preferred way of UTF-8 encoding unicode objects with custom error handlers from C code is now

    PyObject *utf8 = PyUnicode_AsEncodedString(u, "utf8", "surrogatepass"); 

instead of

    PyObject *utf8 = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u), "surrogatepass");
  1. Resizing strings is discouraged, because new unicode objects are allocated in one memory block and only Py_UNICODE representations are resizable. Resizing is however stil possible for unicode objects which were allocated using the old API such as PyUnicode_FromUnicode(NULL, len) and is used for error handlers in several locations for convenience.

  2. Some functions such as PyUnicode_FromFormatV() and PyUnicode_DecodeUTF8Stateful() do an additional pre-processing step to calculate exactly how much space is required for the new string to avoid resizing.

  3. It is possible to check if a string is ASCII only in constant time now with a check if PyUnicode_MAX_CHAR_VALUE(u) < 128.

  4. The autoconf script was changed. Py_UNICODE is now a typedef for wchar_t, Py_UNICODE_SIZE is the same as SIZEOF_WCHAR_T, and Py_UNICODE_WIDE is set when SIZEOF_WCHAR_T >=4.

New Functions and Macros

New functions are documented in the source code (diff of the unicodeobject.h header), but as a overview here are the most important new functions and macros:

PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)

int PyUnicode_Ready(PyUnicodeObject *unicode)

int PyUnicode_FAST_READY(unicode)

Py_ssize_t PyUnicode_GET_SIZE(unicode)
Py_UNICODE* PyUnicode_AS_UNICODE(unicode)

Py_ssize_t PyUnicode_GET_LENGTH(unicode)

int PyUnicode_KIND(unicode)

void * PyUnicode_DATA(unicode)

void PyUnicode_WRITE(kind, data, index, value)

Py_UCS4 PyUnicode_READ(kind, data, index)

Py_UCS4 PyUnicode_READ_CHAR(unicode, index)

Py_UCS4 PyUnicode_MAX_CHAR_VALUE(unicode)

int PyUnicode_CHARACTER_SIZE(unicode)

int PyUnicode_IS_COMPACT(unicode)

void PyUnicode_CopyCharacters(PyUnicodeObject *to, Py_ssize_t to_start, PyUnicodeObject *from, Py_ssize_t from_start, Py_ssize_t how_many)

Removed Optimizations and Incompatible Changes

Some optimizations were removed during the implementation of PEP 393 and some changes are incompatible with previous behaviour:

Performance Measurements

Initial micro benchmarks show a reduction in CPU and memory for certain ported operations. No comprehensive overall benchmark has been done yet as not all functions are ported to the new representation. Some chains of operations cause the creation of many legacy representations and this unnecessary copying.

The debug build has been instrumented with counters which measure how often certain key functions are being called:

pep-393$ gdb python.exe
(gdb) r
Python 3.3.0a0 (pep-393:c603eabf4a52+, Aug 22 2011, 12:17:59) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> ^Z
(gdb) p unicode_as_unicode_calls
$1 = 3244
(gdb) p unicode_ready_calls
$2 = 216
(gdb) p unicode_old_new_calls
$3 = 263
(gdb) p unicode_new_new_calls
$4 = 62321

This shows that for a plain interpreter startup there are still 3244 Py_UNICODE representations being created. 62321 strings are created using the new representation and 263 using the old representation.

Additionally, there exists a set of micro benchmarks to test the behaviour of the implementation in a separate bitbucket repository.

Runníng the measurement script via ./ produces the following output of measurements. The most impressive one is the join benchmarks which demonstrates CPU and memory reduction at the same time.

=== Results === with pep-393-release       : CPU_min=0.81904 MEM=5052 KiB with py3k-default-release  : CPU_min=0.86020 MEM=5120 KiB with pep-393-release       : CPU_min=0.19980 MEM=4976 KiB with py3k-default-release  : CPU_min=0.22005 MEM=4956 KiB with pep-393-release       : CPU_min=0.39217 MEM=5048 KiB with py3k-default-release  : CPU_min=0.38138 MEM=5120 KiB with pep-393-release       : CPU_min=1.89221 MEM=7180 KiB with py3k-default-release  : CPU_min=3.85390 MEM=9208 KiB

I am positive that converting more of Python's code base will lead to memory reduction for more benchmarks.

Future Work

Most likely next parts of Python to port to the new string representation are:

Profiling actual Python applications for further hot spots for conversions to the old string will lead to further porting targets.

SummerOfCode/2011/PEP393 (last edited 2011-08-24 01:05:23 by bas9-toronto12-1177579710)

Unable to edit the page? See the FrontPage for instructions.