Differences between revisions 4 and 5
Revision 4 as of 2010-09-12 06:33:09
Size: 2493
Editor: AES-Static-002
Comment:
Revision 5 as of 2010-09-20 10:58:27
Size: 2513
Editor: 115
Comment: Adding more affected languages
Deletions are marked like this. Additions are marked like this.
Line 47: Line 47:
 * Arabi
 * Khmer

Inroduction

Python 3 Supports Non-ASCII Identifiers as per PEP 3131. But this support is incomplete for certain languages where special characters such as ZWJ, ZWNJ are used extensively. Example for such languages are Malayalam, Kannada, Sinhala, Farsi etc.

Unicode standard on Using ZWJ/ZWNJ etc in Identifiers

ZWJ and ZWNJ are format control characters and unicode defines the usage of these characters in identifiers in TR31 in section 2.3 Layout and Format Control Characters

Unicode recommends allowing usage of ZWJ/ZWNJ or "the Join_Control characters" in Identifiers limited to 3 contexts.

  • Allow ZWNJ in breaking a cursive connection : That is, in the context based on the Joining_Type property, consisting of:
    • A Left-Joining or Dual-Joining character, followed by zero or more Transparent characters, followed by a ZWNJ, followed by zero or more Transparent characters, followed by a Right-Joining or Dual-Joining character
    • This corresponds to the following regular expression (in Perl-style syntax): /$LJ $T* ZWNJ $T* $RJ/
      • where:
        • $T = [:Joining_Type=Transparent:] $RJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:] ] $LJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:] ]
  • Allow ZWNJ in a conjunct context. That is, a sequence of the form:
    • A Letter, followed by a Virama, followed by a ZWNJ
    • This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWNJ/
      • where:
        • $L = [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]
  • Allow ZWJ in a conjunct context. That is, a sequence of the form:
    • A Letter, followed by a Virama, followed by a ZWJ
    • This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWJ/ where:
      • $L= [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]

Affected Languages

  • Malayalam
  • Kannada
  • Bengali
  • Farsi
  • Sinhala
  • Arabi
  • Khmer

References

ZwjAndZwnjAsIdentifiers (last edited 2010-09-21 03:52:17 by BaijuMuthukadan)

Unable to edit the page? See the FrontPage for instructions.