3597
Comment: Bytes can be stripped.
|
← Revision 13 as of 2019-10-19 22:00:22 ⇥
2343
Remove Python 2-specific information, leaving a link to previous revision for accessibility
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
== Strings in Python 2.x == | = Text handling in Python 3 = |
Line 3: | Line 3: |
Python 2.x has two types that can be used to store a string: * {{{str}}}: raw byte data; each element represents a single byte, which can range in value from 0-255. This is the default type for string literals, which is widely considered to be a mistake due to the encoding problems it raises (for more information, see Joel Spolsky's [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets]). * {{{unicode}}}: a string in which each element represents a unicode character. |
Python 3 uses two very different types: * {{{bytes}}}: intended to represent raw byte data. For more information on this type, please consult [[http://www.python.org/dev/peps/pep-0358/|PEP 358]]. * {{{str}}}: a unicode character string |
Line 7: | Line 7: |
Both classes have the same methods and are very similar. | == Choosing Between "bytes" and "str" == |
Line 9: | Line 9: |
== Strings in Python 3000 == Python 3000 uses two very different types: * {{{bytes}}}: similar, but not identical, to Python 2.x's {{{str}}} type. It is intended to represent raw byte data. For more information on this type, please consult [http://www.python.org/dev/peps/pep-0358/ PEP 358]. * {{{str}}}: a unicode character string which is exactly the same type as Python 2.x's {{{unicode}}} type. == Differences Between Python 2.x's "str" and Python 3000's "bytes" == Differences between Python 2.x's {{{str}}} and Python 3000's {{{bytes}}}include: * {{{str}}} is immutable, whereas {{{bytes}}} is mutable. * {{{bytes}}} "lacks" many methods present in {{{str}}}: {{{lower()}}}, {{{upper()}}}, {{{splitlines()}}}, etc. * indexing an item of a {{{bytes}}} object yields an ''integer'', not a bytes object, whereas indexing an item of a Python 2.x {{{str}}} yields another {{{str}}} instance. == Choosing Between "bytes" and "str" in Python 3000 == When you migrate from Python 2.x to Python 3000, you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples: |
When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples: |
Line 27: | Line 12: |
* a text parser manipulates characters (use lower, strip, etc. methods) | * a text parser manipulates characters (uses lower, strip, etc. methods) |
Line 52: | Line 37: |
However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3000, and comparisons between them do ''not'' work: | However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3, and comparisons between them do ''not'' work: |
Line 61: | Line 46: |
This should make clearly evident some incomplete transitions. But you also means that you really cant mix then very well: | This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well: |
Line 83: | Line 68: |
This behaviour is different than Python 2.x: {{{ # In Python 2.x >>> "xyz"[0] 'x' >>> type("xyz"), type("xyz"[0]) (<type 'str'>, <type 'str'>) }}} |
|
Line 103: | Line 79: |
= Historical information = For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see [[https://wiki.python.org/moin/BytesStr?action=recall&rev=12|previous page revisions]]. |
Text handling in Python 3
Python 3 uses two very different types:
bytes: intended to represent raw byte data. For more information on this type, please consult PEP 358.
str: a unicode character string
Choosing Between "bytes" and "str"
When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
- a network socket manipulates bytes
- a text parser manipulates characters (uses lower, strip, etc. methods)
Iterating over "bytes"
It's important to note that the bytes iterator generates integers and not characters:
>>> for item in b'abc': ... print item 97 98 99
Comparing "bytes"
Comparing one bytes object to another works as expected:
>>> b'xyz' == b'xyz' True >>> b'xyz' == b'abc' False
However, it is important to note that the bytes type is completely distinct from the str type in Python 3, and comparisons between them do not work:
>>> b'xyz' == 'xyz' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare bytes and str
This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:
>>> L = ["1", b"1"] >>> "1" in L True >>> "2" in L Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare str and bytes
As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:
>>> b'xyz'[0] == b'x' False >>> b'xyz'[0] 120
Hashing "bytes"
bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.
Hacks and workarounds for this include:
use buffer(value)
Other solutions include:
create an immutable frozenbytes type
- avoid using hash
Historical information
For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see previous page revisions.