Diff for "UnicodeData"

Differences between revisions 4 and 5

Initially, I am just going to dump stuff out so that I don't forget it. Later on, I hope to savagely re-edit it into a more coherent structure.

Overview

unicodedata is essentially a lookup table of the [http://www.unicode.org/ucd/ Unicode Character Database] that is published as part of the Unicode specification. There seem to be effectively two types of lookup:

Given a unicode character, retrieve a property of that character.
Given a unicode character name, retrieve the unicode character (the unicodedata.lookup function).

It's probably worthwhile trying to compile a list of known clients of this module, so that we have a clear idea of the usages that should be optimized.

Breaking it down further, the obvious implementation is to have two lookup tables.

The first one will take an integer (or unichr - a Unicode character is effectively an integer in this context) and return an O(1) index into a table which defines the properties of the character.
The second one will be a dictionary lookup or similar, to find a codepoint given a name.

TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.

TODO testing notes. Existing tests, why java.lang.Character and ICU aren't considered suitable (the latter without OSGi / classloader shenanigans anyway), more tests around surrogate characters (c.f. Sam Ruby on the subject, which is always entertaining).

Bibliography

[http://www.unicode.org/ The Unicode Consortium]
[http://en.wikipedia.org/wiki/Unicode Wikipedia article]
A [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ discussion] of Unicode support in the Java platform.
[http://home.ccil.org/~cowan/uamb.pdf Unicode Ate My Brain] Nice discussion about the different table lookups, for people like myself without a classical CS background or strong knowledge about data structures and algorithms.

-  ⇤ ← Revision 4 as of 2008-01-22 01:04:31 → 
  Size: 1682
  Editor: JamesAbley
  Comment: dumping out what little knowledge I've gained
+   ← Revision 5 as of 2008-01-22 01:07:33 → ⇥
  Size: 1940
  Editor: JamesAbley
  Comment: notes about testing
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
-Line 7:
+Line 8:
 Line 10:
 Breaking it down further, the obvious implementation is to have two lookup tables.
-Line 13:
+Line 14:
+TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.
-Line 14:
+Line 16:
-TODO better analysis of CPython version, describing the table encoding and space-time trade-offs.
+TODO testing notes. Existing tests, why java.lang.Character and ICU aren't considered suitable (the latter without OSGi / classloader shenanigans anyway), more tests around surrogate characters (c.f. Sam Ruby on the subject, which is always entertaining).