Differences between revisions 6 and 7
Revision 6 as of 2010-09-20 11:07:11
Size: 2584
Editor: 115
Comment:
Revision 7 as of 2010-09-21 03:52:17
Size: 2726
Comment:
Deletions are marked like this. Additions are marked like this.
Line 59: Line 59:
 * [[http://www.reddit.com/r/Python/comments/dgf1q/how_to_approach_a_complex_issue_where_python_core/|Suggestions from /r/Python community]]

Inroduction

Python 3 Supports Non-ASCII Identifiers as per PEP 3131. But this support is incomplete for certain languages where special characters such as ZWJ, ZWNJ are used extensively. Example for such languages are Malayalam, Kannada, Sinhala, Farsi etc.

Unicode standard on Using ZWJ/ZWNJ etc in Identifiers

ZWJ and ZWNJ are format control characters and unicode defines the usage of these characters in identifiers in TR31 in section 2.3 Layout and Format Control Characters

Unicode recommends allowing usage of ZWJ/ZWNJ or "the Join_Control characters" in Identifiers limited to 3 contexts.

  • Allow ZWNJ in breaking a cursive connection : That is, in the context based on the Joining_Type property, consisting of:
    • A Left-Joining or Dual-Joining character, followed by zero or more Transparent characters, followed by a ZWNJ, followed by zero or more Transparent characters, followed by a Right-Joining or Dual-Joining character
    • This corresponds to the following regular expression (in Perl-style syntax): /$LJ $T* ZWNJ $T* $RJ/
      • where:
        • $T = [:Joining_Type=Transparent:] $RJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:] ] $LJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:] ]
  • Allow ZWNJ in a conjunct context. That is, a sequence of the form:
    • A Letter, followed by a Virama, followed by a ZWNJ
    • This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWNJ/
      • where:
        • $L = [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]
  • Allow ZWJ in a conjunct context. That is, a sequence of the form:
    • A Letter, followed by a Virama, followed by a ZWJ
    • This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWJ/ where:
      • $L= [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]

Affected Languages

  • Malayalam
  • Kannada
  • Bengali
  • Languages that use Devanagari Script (Hindi, Marathi..)
  • Telugu
  • Farsi
  • Sinhala
  • Arabi
  • Khmer

References

ZwjAndZwnjAsIdentifiers (last edited 2010-09-21 03:52:17 by BaijuMuthukadan)

Unable to edit the page? See the FrontPage for instructions.