Size: 1682
Comment: dumping out what little knowledge I've gained
|
Size: 1940
Comment: notes about testing
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
Line 7: | Line 8: |
Line 10: | Line 10: |
Breaking it down further, the obvious implementation is to have two lookup tables. | Breaking it down further, the obvious implementation is to have two lookup tables. |
Line 13: | Line 14: |
TODO better analysis of CPython version, describing the table encoding and space-time trade-offs. | |
Line 14: | Line 16: |
TODO better analysis of CPython version, describing the table encoding and space-time trade-offs. | TODO testing notes. Existing tests, why java.lang.Character and ICU aren't considered suitable (the latter without OSGi / classloader shenanigans anyway), more tests around surrogate characters (c.f. Sam Ruby on the subject, which is always entertaining). |
Initially, I am just going to dump stuff out so that I don't forget it. Later on, I hope to savagely re-edit it into a more coherent structure.
Overview
unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup:
- Given a unicode character, retrieve a property of that character.
- Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function).
It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized.
Breaking it down further, the obvious implementation is to have two lookup tables.
- The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character.
- The second one will be a dictionary lookup or similar, to find a codepoint given a name.
TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.
TODO testing notes. Existing tests, why java.lang.Character and ICU aren't considered suitable (the latter without OSGi / classloader shenanigans anyway), more tests around surrogate characters (c.f. Sam Ruby on the subject, which is always entertaining).
Bibliography
[http://www.unicode.org/ The Unicode Consortium]
[http://en.wikipedia.org/wiki/Unicode Wikipedia article]
A [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ discussion] of Unicode support in the Java platform.
[http://home.ccil.org/~cowan/uamb.pdf Unicode Ate My Brain] Nice discussion about the different table lookups, for people like myself without a classical CS background or strong knowledge about data structures and algorithms.