Differences between revisions 3 and 4
Revision 3 as of 2007-12-26 23:04:09
Size: 820
Editor: JamesAbley
Comment: Added Unicode Ate My Brain link.
Revision 4 as of 2008-01-22 01:04:31
Size: 1682
Editor: JamesAbley
Comment: dumping out what little knowledge I've gained
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup:
 1. Given a unicode character, retrieve a property of that character.
 1. Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function).

It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized.

Breaking it down further, the obvious implementation is to have two lookup tables.
 1. The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character.
 1. The second one will be a dictionary lookup or similar, to find a codepoint given a name.

TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.

Initially, I am just going to dump stuff out so that I don't forget it. Later on, I hope to savagely re-edit it into a more coherent structure.

Overview

unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup:

  1. Given a unicode character, retrieve a property of that character.
  2. Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function).

It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized.

Breaking it down further, the obvious implementation is to have two lookup tables.

  1. The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character.
  2. The second one will be a dictionary lookup or similar, to find a codepoint given a name.

TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.

Bibliography

UnicodeData (last edited 2008-11-15 09:16:01 by localhost)