Differences between revisions 11 and 13 (spanning 2 versions)
Revision 11 as of 2007-09-06 05:04:55
Size: 3596
Comment:
Revision 13 as of 2019-10-19 22:00:22
Size: 2343
Comment: Remove Python 2-specific information, leaving a link to previous revision for accessibility
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Strings in Python 2.x == = Text handling in Python 3 =
Line 3: Line 3:
Python 2.x has two types that can be used to store a string:
 * {{{str}}}: raw byte data; each element represents a single byte, which can range in value from 0-255. This is the default type for string literals, which is widely considered to be a mistake due to the encoding problems it raises (for more information, see Joel Spolsky's [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets]).
 * {{{unicode}}}: a string in which each element represents a unicode character.
Python 3 uses two very different types:
 * {{{bytes}}}: intended to represent raw byte data. For more information on this type, please consult [[http://www.python.org/dev/peps/pep-0358/|PEP 358]].
 * {{{str}}}: a unicode character string
Line 7: Line 7:
Both classes have the same methods and are very similar. == Choosing Between "bytes" and "str" ==
Line 9: Line 9:
== Strings in Python 3000 ==

Python 3000 uses two very different types:
 * {{{bytes}}}: similar, but not identical, to Python 2.x's {{{str}}} type. It is intended to represent raw byte data. For more information on this type, please consult [http://www.python.org/dev/peps/pep-0358/ PEP 358].
 * {{{str}}}: a unicode character string which is exactly the same type as Python 2.x's {{{unicode}}} type.

== Differences Between Python 2.x's "str" and Python 3000's "bytes" ==

Differences between Python 2.x's {{{str}}} and Python 3000's {{{bytes}}}include:
 * {{{str}}} is immutable, whereas {{{bytes}}} is mutable.
 * {{{bytes}}} "lacks" many methods present in {{{str}}}: {{{lower()}}}, {{{upper()}}}, {{{splitlines()}}}, etc.
 * indexing an item of a {{{bytes}}} object yields an ''integer'', not a bytes object, whereas indexing an item of a Python 2.x {{{str}}} yields another {{{str}}} instance.

== Choosing Between "bytes" and "str" in Python 3000 ==

When you migrate from Python 2.x to Python 3000, you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
Line 27: Line 12:
 * a text parser manipulates characters (use lower, strip, etc. methods)  * a text parser manipulates characters (uses lower, strip, etc. methods)
Line 52: Line 37:
However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3000, and comparisons between them do ''not'' work: However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3, and comparisons between them do ''not'' work:
Line 61: Line 46:
This should make clearly evident some incomplete transitions. But it also means that you really cant mix then very well: This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:
Line 83: Line 68:
This behaviour is different than Python 2.x:

{{{
# In Python 2.x
>>> "xyz"[0]
'x'
>>> type("xyz"), type("xyz"[0])
(<type 'str'>, <type 'str'>)
}}}
Line 103: Line 79:


= Historical information =

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see [[https://wiki.python.org/moin/BytesStr?action=recall&rev=12|previous page revisions]].

Text handling in Python 3

Python 3 uses two very different types:

  • bytes: intended to represent raw byte data. For more information on this type, please consult PEP 358.

  • str: a unicode character string

Choosing Between "bytes" and "str"

When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:

  • a network socket manipulates bytes
  • a text parser manipulates characters (uses lower, strip, etc. methods)

Iterating over "bytes"

It's important to note that the bytes iterator generates integers and not characters:

>>> for item in b'abc':
...   print item
97
98
99

Comparing "bytes"

Comparing one bytes object to another works as expected:

>>> b'xyz' == b'xyz'
True
>>> b'xyz' == b'abc'
False

However, it is important to note that the bytes type is completely distinct from the str type in Python 3, and comparisons between them do not work:

>>> b'xyz' == 'xyz'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't compare bytes and str

This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:

>>> L = ["1", b"1"]
>>> "1" in L
True
>>> "2" in L
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: can't compare str and bytes

As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:

>>> b'xyz'[0] == b'x'
False
>>> b'xyz'[0]
120

Hashing "bytes"

bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.

Hacks and workarounds for this include:

  • use buffer(value)

Other solutions include:

  • create an immutable frozenbytes type

  • avoid using hash

Historical information

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see previous page revisions.

BytesStr (last edited 2019-10-19 22:00:22 by FrancesHocutt)

Unable to edit the page? See the FrontPage for instructions.