Differences between revisions 11 and 13 (spanning 2 versions)

Text handling in Python 3

Python 3 uses two very different types:

bytes: intended to represent raw byte data. For more information on this type, please consult PEP 358.
str: a unicode character string

Choosing Between "bytes" and "str"

When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:

a network socket manipulates bytes
a text parser manipulates characters (uses lower, strip, etc. methods)

Iterating over "bytes"

It's important to note that the bytes iterator generates integers and not characters:

>>> for item in b'abc':
...   print item
97
98
99

Comparing "bytes"

Comparing one bytes object to another works as expected:

>>> b'xyz' == b'xyz'
True
>>> b'xyz' == b'abc'
False

However, it is important to note that the bytes type is completely distinct from the str type in Python 3, and comparisons between them do not work:

>>> b'xyz' == 'xyz'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't compare bytes and str

This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:

>>> L = ["1", b"1"]
>>> "1" in L
True
>>> "2" in L
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: can't compare str and bytes

As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:

>>> b'xyz'[0] == b'x'
False
>>> b'xyz'[0]
120

Hashing "bytes"

bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.

Hacks and workarounds for this include:

use buffer(value)

Historical information

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see previous page revisions.

-  ⇤ ← Revision 11 as of 2007-09-06 05:04:55 → 
  Size: 3596
  Editor: EduardoPadoan
  Comment:
+   ← Revision 13 as of 2019-10-19 22:00:22 → ⇥
  Size: 2343
  Editor: FrancesHocutt
  Comment: Remove Python 2-specific information, leaving a link to previous revision for accessibility
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-== Strings in Python 2.x ==
+= Text handling in Python 3 =
 Line 3:
-Python 2.x has two types that can be used to store a string:
 * {{{str}}}: raw byte data; each element represents a single byte, which can range in value from 0-255.  This is the default type for string literals, which is widely considered to be a mistake due to the encoding problems it raises (for more information, see Joel Spolsky's [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets]).
 * {{{unicode}}}: a string in which each element represents a unicode character.
+Python 3 uses two very different types:
 * {{{bytes}}}: intended to represent raw byte data.  For more information on this type, please consult [[http://www.python.org/dev/peps/pep-0358/|PEP 358]].
 * {{{str}}}: a unicode character string
 Line 7:
-Both classes have the same methods and are very similar.
+== Choosing Between "bytes" and "str" ==
 Line 9:
-== Strings in Python 3000 ==

Python 3000 uses two very different types:
 * {{{bytes}}}: similar, but not identical, to Python 2.x's {{{str}}} type.  It is intended to represent raw byte data.  For more information on this type, please consult [http://www.python.org/dev/peps/pep-0358/ PEP 358].
 * {{{str}}}: a unicode character string which is exactly the same type as Python 2.x's {{{unicode}}} type.

== Differences Between Python 2.x's "str" and Python 3000's "bytes" ==

Differences between Python 2.x's {{{str}}} and Python 3000's {{{bytes}}}include:
 * {{{str}}} is immutable, whereas {{{bytes}}} is mutable.
 * {{{bytes}}} "lacks" many methods present in {{{str}}}: {{{lower()}}}, {{{upper()}}}, {{{splitlines()}}}, etc.
 * indexing an item of a {{{bytes}}} object yields an ''integer'', not a bytes object, whereas indexing an item of a Python 2.x {{{str}}} yields another {{{str}}} instance.

== Choosing Between "bytes" and "str" in Python 3000 ==

When you migrate from Python 2.x to Python 3000, you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
+When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
-Line 27:
+Line 12:
- * a text parser manipulates characters (use lower, strip, etc. methods)
+ * a text parser manipulates characters (uses lower, strip, etc. methods)
-Line 52:
+Line 37:
-However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3000, and comparisons between them do ''not'' work:
+However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3, and comparisons between them do ''not'' work:
-Line 61:
+Line 46:
-This should make clearly evident some incomplete transitions. But it also means that you really cant mix then very well:
+This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:
-Line 83:
+Line 68:
-This behaviour is different than Python 2.x:

{{{
# In Python 2.x
>>> "xyz"[0]
'x'
>>> type("xyz"), type("xyz"[0])
(<type 'str'>, <type 'str'>)
}}}
-Line 103:
+Line 79:
+= Historical information =

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see [[https://wiki.python.org/moin/BytesStr?action=recall&rev=12|previous page revisions]].

Page