Size: 820
Comment: Added Unicode Ate My Brain link.
|
Size: 1682
Comment: dumping out what little knowledge I've gained
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. | unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup: 1. Given a unicode character, retrieve a property of that character. 1. Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function). It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized. Breaking it down further, the obvious implementation is to have two lookup tables. 1. The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character. 1. The second one will be a dictionary lookup or similar, to find a codepoint given a name. TODO better analysis of CPython version, describing the table encoding and space-time trade-offs. |
Initially, I am just going to dump stuff out so that I don't forget it. Later on, I hope to savagely re-edit it into a more coherent structure.
Overview
unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup:
- Given a unicode character, retrieve a property of that character.
- Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function).
It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized.
Breaking it down further, the obvious implementation is to have two lookup tables.
- The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character.
- The second one will be a dictionary lookup or similar, to find a codepoint given a name.
TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.
Bibliography
[http://www.unicode.org/ The Unicode Consortium]
[http://en.wikipedia.org/wiki/Unicode Wikipedia article]
A [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ discussion] of Unicode support in the Java platform.
[http://home.ccil.org/~cowan/uamb.pdf Unicode Ate My Brain] Nice discussion about the different table lookups, for people like myself without a classical CS background or strong knowledge about data structures and algorithms.