989
Comment:
|
← Revision 13 as of 2019-10-19 22:00:22 ⇥
2343
Remove Python 2-specific information, leaving a link to previous revision for accessibility
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
== Python 2.x == | = Text handling in Python 3 = |
Line 3: | Line 3: |
Python 2.x has two types to store a string: * str: bytes string procced as character string which is a mistake * unicode: character string (unicode) |
Python 3 uses two very different types: * {{{bytes}}}: intended to represent raw byte data. For more information on this type, please consult [[http://www.python.org/dev/peps/pep-0358/|PEP 358]]. * {{{str}}}: a unicode character string |
Line 7: | Line 7: |
Both classes has same methods and are very similar. | == Choosing Between "bytes" and "str" == |
Line 9: | Line 9: |
== Python 3000 == | When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples: |
Line 11: | Line 11: |
Python 3000 use two very different types: * bytes: bytes string which can be see as a list of [0..255] integers * str: character string (unicode), exactly the same type than Python 2.x "unicode" |
* a network socket manipulates bytes * a text parser manipulates characters (uses lower, strip, etc. methods) |
Line 15: | Line 14: |
== old str and new bytes == | == Iterating over "bytes" == |
Line 17: | Line 16: |
Differences between Python 2.x "str" and Python 3000 "bytes": * str is immutable, bytes is mutable * bytes "lacks" many methods: strip, lstrip, rstrip, lower, upper, etc. |
It's important to note that the {{{bytes}}} iterator generates integers and not characters: |
Line 21: | Line 18: |
== choose between bytes and str == | {{{ >>> for item in b'abc': ... print item 97 98 99 }}} |
Line 23: | Line 26: |
When you migration from Python 2.x to Python 3000, you have to ask youself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples: * a network socket manipulate bytes * a text parser manipulates characters (use lower, strip, etc. methods) |
== Comparing "bytes" == Comparing one {{{bytes}}} object to another works as expected: {{{ >>> b'xyz' == b'xyz' True >>> b'xyz' == b'abc' False }}} However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3, and comparisons between them do ''not'' work: {{{ >>> b'xyz' == 'xyz' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare bytes and str }}} This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well: {{{ >>> L = ["1", b"1"] >>> "1" in L True >>> "2" in L Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare str and bytes }}} As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object: {{{ >>> b'xyz'[0] == b'x' False >>> b'xyz'[0] 120 }}} === Hashing "bytes" === {{{bytes}}} is mutable, and as a result, it's not hashable. Among other things, this means that {{{bytes}}} objects can't be used as keys in dictionaries. Hacks and workarounds for this include: * use {{{buffer(value)}}} Other solutions include: * create an immutable {{{frozenbytes}}} type * avoid using hash = Historical information = For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see [[https://wiki.python.org/moin/BytesStr?action=recall&rev=12|previous page revisions]]. |
Text handling in Python 3
Python 3 uses two very different types:
bytes: intended to represent raw byte data. For more information on this type, please consult PEP 358.
str: a unicode character string
Choosing Between "bytes" and "str"
When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
- a network socket manipulates bytes
- a text parser manipulates characters (uses lower, strip, etc. methods)
Iterating over "bytes"
It's important to note that the bytes iterator generates integers and not characters:
>>> for item in b'abc': ... print item 97 98 99
Comparing "bytes"
Comparing one bytes object to another works as expected:
>>> b'xyz' == b'xyz' True >>> b'xyz' == b'abc' False
However, it is important to note that the bytes type is completely distinct from the str type in Python 3, and comparisons between them do not work:
>>> b'xyz' == 'xyz' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare bytes and str
This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:
>>> L = ["1", b"1"] >>> "1" in L True >>> "2" in L Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare str and bytes
As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:
>>> b'xyz'[0] == b'x' False >>> b'xyz'[0] 120
Hashing "bytes"
bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.
Hacks and workarounds for this include:
use buffer(value)
Other solutions include:
create an immutable frozenbytes type
- avoid using hash
Historical information
For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see previous page revisions.