Differences between revisions 2 and 13 (spanning 11 versions)

Text handling in Python 3

Python 3 uses two very different types:

bytes: intended to represent raw byte data. For more information on this type, please consult PEP 358.
str: a unicode character string

Choosing Between "bytes" and "str"

When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:

a network socket manipulates bytes
a text parser manipulates characters (uses lower, strip, etc. methods)

Iterating over "bytes"

It's important to note that the bytes iterator generates integers and not characters:

>>> for item in b'abc':
...   print item
97
98
99

Comparing "bytes"

Comparing one bytes object to another works as expected:

>>> b'xyz' == b'xyz'
True
>>> b'xyz' == b'abc'
False

However, it is important to note that the bytes type is completely distinct from the str type in Python 3, and comparisons between them do not work:

>>> b'xyz' == 'xyz'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't compare bytes and str

This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:

>>> L = ["1", b"1"]
>>> "1" in L
True
>>> "2" in L
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: can't compare str and bytes

As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:

>>> b'xyz'[0] == b'x'
False
>>> b'xyz'[0]
120

Hashing "bytes"

bytes is mutable, and as a result, it's not hashable. Among other things, this means that bytes objects can't be used as keys in dictionaries.

Hacks and workarounds for this include:

use buffer(value)

Historical information

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see previous page revisions.

-  ⇤ ← Revision 2 as of 2007-08-10 23:23:48 → 
  Size: 989
  Editor: neu67-4-88-160-66-91
  Comment:
+   ← Revision 13 as of 2019-10-19 22:00:22 → ⇥
  Size: 2343
  Editor: FrancesHocutt
  Comment: Remove Python 2-specific information, leaving a link to previous revision for accessibility
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-== Python 2.x ==
+= Text handling in Python 3 =
 Line 3:
-Python 2.x has two types to store a string:
 * str: bytes string procced as character string which is a mistake
 * unicode: character string (unicode)
+Python 3 uses two very different types:
 * {{{bytes}}}: intended to represent raw byte data.  For more information on this type, please consult [[http://www.python.org/dev/peps/pep-0358/|PEP 358]].
 * {{{str}}}: a unicode character string
 Line 7:
-Both classes has same methods and are very similar.
+== Choosing Between "bytes" and "str" ==
 Line 9:
-== Python 3000 ==
+When choosing the type you want to use to work with text you have to ask yourself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
 Line 11:
-Python 3000 use two very different types:
 * bytes: bytes string which can be see as a list of [0..255] integers
 * str: character string (unicode), exactly the same type than Python 2.x "unicode"
+ * a network socket manipulates bytes
 * a text parser manipulates characters (uses lower, strip, etc. methods)
-Line 15:
+Line 14:
-== old str and new bytes ==
+== Iterating over "bytes" ==
-Line 17:
+Line 16:
-Differences between Python 2.x "str" and Python 3000 "bytes":
 * str is immutable, bytes is mutable
 * bytes "lacks" many methods: strip, lstrip, rstrip, lower, upper, etc.
+It's important to note that the {{{bytes}}} iterator generates integers and not characters:
-Line 21:
+Line 18:
-== choose between bytes and str ==
+{{{
>>> for item in b'abc':
...   print item
97
98
99
}}}
-Line 23:
+Line 26:
-When you migration from Python 2.x to Python 3000, you have to ask youself: do I manipulate characters or bytes (integers)? "A" is a character and 65 is an integer. Examples:
 * a network socket manipulate bytes
 * a text parser manipulates characters (use lower, strip, etc. methods)
+== Comparing "bytes" ==

Comparing one {{{bytes}}} object to another works as expected:

{{{
>>> b'xyz' == b'xyz'
True
>>> b'xyz' == b'abc'
False
}}}

However, it is important to note that the {{{bytes}}} type is completely distinct from the {{{str}}} type in Python 3, and comparisons between them do ''not'' work:

{{{
>>> b'xyz' == 'xyz'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't compare bytes and str
}}}

This should make clearly evident some incomplete transitions. But it also means that you really can't mix then very well:

{{{
>>> L = ["1", b"1"]
>>> "1" in L
True
>>> "2" in L
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: can't compare str and bytes
}}}


As mentioned earlier, getting an item of a bytes returns an integer, not a bytes object:

{{{
>>> b'xyz'[0] == b'x'
False
>>> b'xyz'[0]
120
}}}


=== Hashing "bytes" ===

{{{bytes}}} is mutable, and as a result, it's not hashable.  Among other things, this means that {{{bytes}}} objects can't be used as keys in dictionaries.

Hacks and workarounds for this include:
 * use {{{buffer(value)}}}

Other solutions include:
 * create an immutable {{{frozenbytes}}} type
 * avoid using hash


= Historical information =

For historical information that may be useful in porting or maintaining remaining Python 2 systems, please see [[https://wiki.python.org/moin/BytesStr?action=recall&rev=12|previous page revisions]].

Page