Strings in Python 2.x
Python 2.x has two types that can be used to store a string:
str: raw byte data; each element represents a single byte, which can range in value from 0-255. This is the default type for string literals, which is widely considered to be a mistake due to the encoding problems it raises (for more information, see Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets).
unicode: a string in which each element represents a unicode character.
Both classes have the same methods and are very similar.
Strings in Python 3000
Python 3000 uses two very different types:
bytes: similar, but not identical, to Python 2.x's str type. It is intended to represent raw byte data. For more information on this type, please consult PEP 358.
str: a unicode character string which is exactly the same type as Python 2.x's unicode type.
Differences Between Python 2.x's "str" and Python 3000's "bytes"
Differences between Python 2.x's str and Python 3000's bytesinclude:
str is immutable, whereas bytes is mutable.
bytes "lacks" many methods present in str: lower(), upper(), splitlines(), etc.
indexing an item of a bytes object yields an integer, not a bytes object, whereas indexing an item of a Python 2.x str yields another str instance.
Choosing Between "bytes" and "str" in Python 3000
When you migrate from Python 2.x to Python 3000, you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
- a network socket manipulates bytes
- a text parser manipulates characters (use lower, strip, etc. methods)
Iterating over "bytes"
It's important to note that the bytes iterator generates integers and not characters:
>>> for item in b'abc': ... print item 97 98 99
Comparing one bytes object to another works as expected:
>>> b'xyz' == b'xyz' True >>> b'xyz' == b'abc' False
However, it is important to note that the bytes type is completely distinct from the str type in Python 3000, and comparisons between them do not work:
>>> b'xyz' == 'xyz' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare bytes and str
This should make clearly evident some incomplete transitions. But it also means that you really cant mix then very well:
>>> L = ["1", b"1"] >>> "1" in L True >>> "2" in L Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't compare str and bytes
As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:
>>> b'xyz' == b'x' False >>> b'xyz' 120
This behaviour is different than Python 2.x:
# In Python 2.x >>> "xyz" 'x' >>> type("xyz"), type("xyz") (<type 'str'>, <type 'str'>)
bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.
Hacks and workarounds for this include:
Other solutions include:
create an immutable frozenbytes type
- avoid using hash