Inroduction
Python 3 Supports Non-ASCII Identifiers as per PEP 3131. But this support is incomplete for certain languages where special characters such as ZWJ, ZWNJ are used extensively. Example for such languages are Malayalam, Kannada, Sinhala, Farsi etc.
Unicode standard on Using ZWJ/ZWNJ etc in Identifiers
ZWJ and ZWNJ are format control characters and unicode defines the usage of these characters in identifiers in TR31 in section 2.3 Layout and Format Control Characters
Unicode recommends allowing usage of ZWJ/ZWNJ or "the Join_Control characters" in Identifiers limited to 3 contexts.
- Allow ZWNJ in breaking a cursive connection : That is, in the context based on the Joining_Type property, consisting of:
- A Left-Joining or Dual-Joining character, followed by zero or more Transparent characters, followed by a ZWNJ, followed by zero or more Transparent characters, followed by a Right-Joining or Dual-Joining character
- This corresponds to the following regular expression (in Perl-style syntax): /$LJ $T* ZWNJ $T* $RJ/
- where:
- $T = [:Joining_Type=Transparent:] $RJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:] ] $LJ = [ [:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:] ]
- where:
- Allow ZWNJ in a conjunct context. That is, a sequence of the form:
- A Letter, followed by a Virama, followed by a ZWNJ
- This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWNJ/
- where:
- $L = [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]
- where:
- Allow ZWJ in a conjunct context. That is, a sequence of the form:
- A Letter, followed by a Virama, followed by a ZWJ
- This corresponds to the following regular expression (in Perl-style syntax): /$L $V ZWJ/ where:
- $L= [:General_Category=Letter:] $V = [:Canonical_Combining_Class=Virama:]
Affected Languages
- Malayalam
- Kannada
- Bengali
- Languages that use Devanagari Script (Hindi, Marathi..)
- Telugu
- Farsi
- Sinhala
- Arabi
- Khmer