Differences between revisions 3 and 4
Revision 3 as of 2005-01-14 11:30:02
Size: 6061
Editor: 212
Comment:
Revision 4 as of 2005-01-14 23:16:20
Size: 7525
Editor: c-67-165-222-53
Comment: organizational rewrite
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Why lists are not used as dictionary keys = = Why Lists Can't Be Dictionary Keys =
Line 5: Line 5:
The existence of lists and tuples and the impossibility to use
lists as dictionary keys is a constant source of irritation for
Python newcomers. To understand why tuples but not lists can be
used as dictionary keys let us first take a closer look at
dictionaries.
Newcomers to Python often wonder why, while the language includes both a tuple and a list type, tuples are usable as a dictionary keys, while lists are not. This was a deliberate design decision, and can best be explained by first understanding how Python dictionaries work.
Line 11: Line 7:
== How dictionaries work == == How Dictionaries Work ==
Line 13: Line 9:
Unlike strings, lists and tuples a dictionary is not a sequence but
a mapping. A dictionary d maps key values to data values:
Dictionaries, in Python, are also known as "mappings", because they "map" or "associate" key objects to value objects:
Line 16: Line 11:
{{{
data = d[key]
{{{#!python
# retrieve the value for a particular key
value = d[key]
Line 20: Line 16:
To accomplish this the dictionary has to search the key in a key set
and to return the associated data value. If the dictionary would search
the key set sequentially this could take some time because the search
time is proportional to the size of the key set: O(n). Fortunately the
dictionary does better: It computes the storage location from the key
and jumps directly to this storage location. The search time is the
same no matter how large the key set is: O(1).
Thus, Python mappings must be able to, given a particular key object, determine which (if any) value object is associated with a given key.
One simple approach would be to store a list of (key, value) pairs, and then search the list sequentially every time a value was requested.
However, this approach would be very slow with a large number of items - in complexity terms, this algorithm would be O(n), where n is the number of items in the mapping.
Line 28: Line 20:
This search method is called hashing and the function that computes
the storage location is called a hash function. Different hashes must
be a proof of inequality (otherwise equal items could be inserted into
different storage locations) leading to the following strict condition
for a hash function hf:

{{{
for all i1, i2: hf(i1) != hf(i2) => i1 != i2
}}}

With Python's functions hash() and cmp() this reads:

{{{
(*) for all i1, i2: hash(i1) != hash(i2) => cmp(i1, i2) != 0
}}}
 
Example of a hash function:

{{{
#!python
def hashfunc(string):
    return ord(string[0])
}}}

This function maps strings to integers in the range [0:256] and could
be used to find strings in an array of length 256. What happens if
two strings start with the same character? They have to be stored
at the same array location and we have a hash collision. If too many
strings are stored in the array then hash collisions occur frequently
and the strings stored at the same location have to be searched
sequentially. Of course this is not desired and the hash function has
to be carefully designed in order to avoid frequent hash collisions.
This means that equal hashes should be be a good hint for equality.
Thus a hash function should satisfy in most cases:

{{{
for all i1, i2: hash(i1) == hash(i2) => cmp(i1, i2) == 0
}}}

Fortunately this is the case for Python's dictionary hash function
so that dictionary search is very fast.
 
== Pseudo code of a dictionary lookup ==

To get a complete picture here is a sketch of a dictionary lookup
(data = d[key] expressed as: data = lookup(d, key)):

{{{
#!python
Python's dictionary implementation reduces the average complexity of dictionary lookups to O(1) by requiring that key objects provide a "hash" function.
Such a hash function takes the information in a key object and uses it to produce an integer, called a hash value.
This hash value is then used to determine which "bucket" this (key, value) pair should be placed into.
Pseudocode for this lookup function might look something like: {{{#!python
Line 82: Line 29:
          supposed to be an array of collision lists. A collision
          list contains (key,value) pairs and should have only one
          entry
.
          supposed to be an array of "buckets" or "collision lists"
          which contain the (key,value) pairs.
Line 88: Line 34:
          return value of the lookup is then pair[1]. Equality (==) is
          defined by the function cmp(). The return value for equality
          is 0, for inequality some other value.
          return value of the lookup is then pair[1].
Line 95: Line 39:
        if cmp(key, pair[0]) == 0:         if key == pair[0]:
Line 101: Line 45:
== Data types usable as dictionary keys == For such a lookup algorithm to work ''correctly'', the hash functions provided must guarantee that if two keys produce different hash values then the two key objects are not equivalent, that is,
{{{
for all i1, i2, if hash(i1) != hash(i2), then i1 != i2
}}}
Otherwise, checking the hash value of a key object might make us look in the wrong bucket and thus never find the associated value.
Line 103: Line 51:
The lookup function above shows which data types can be used as dictionary
keys:
For such a lookup algorithm to work ''efficiently'', most buckets should have only a small number of items (preferably only one).
Consider what would happen with the following hash function: {{{#!python
def hash(obj):
    return 1
}}}
Note that this function meets the requirements of a hash function - every time two keys have different hash values, they represent different objects.
(This is trivially true because ''no'' keys have different hash values - they all have the value 1.)
But this is a bad hash function because it means that all (key, value) pairs will be placed in a single list, and so each lookup will require searching this entire list.
Thus a (very) desirable property of a hash function is that if two keys produce the same hash values, then the key objects are equivalent, that is,
{{{
for all i1, i2, if hash(i1) == hash(i2), then i1 == i2
}}}
Hash functions that can approximate this property well will distribute (key, value) pairs evenly across the buckets, and keep lookup time down.
Line 106: Line 65:
----
'''A data type must be a valid input for hash() and cmp() to be usable
as a dictionary key. hash() and cmp() must satisfy the condition (*)
above for this data type.'''
----
== Types Usable as Dictionary Keys ==
Line 112: Line 67:
=== Why tuples are OK and lists are not === The discussion above should explain why Python requires that:
Line 114: Line 69:
Is a list a suitable candidate for a dictionary key? hash(list) = id(list) and cmp(list1, list2) = cmp(id(list1), id(list2)) would satisfy condition (*). But this definition completely ignores list contents, causing big surprises for programmers: '''To be used as a dictionary key, an object must support the hash function (e.g. through `__hash__`), equality comparison (e.g. through `__eq__` or `__cmp__`), and must satisfy the correctness condition above.'''
Line 116: Line 71:
 * Different lists with the same content would be mapped to different dictionary values.
 * Dictionary lookup with list literals would be impossible.
=== Lists as Dictionary Keys ===
Line 119: Line 73:
To avoid these surprises dictionary lookup would have to use list contents
instead of list id. Consider the (hypothetical, not working) code:
That said, the simple answer to why lists cannot be used as dictionary keys is that lists do not provide a valid `__hash__` method. Of course, the obvious question is, "Why not?"
Line 122: Line 75:
{{{
>>> d = dict()
>>> li = [1,2,3]
>>> d[li] = '123'
>>> d[[1,2,3]]
'123'
Consider what kinds of hash functions could be provided for lists.

If lists hashed by id, this would certainly be valid given Python's definition of a hash function -- lists with different hash values would have different ids.
But lists are containers, and most other operations on them deal with them as such.
So hashing lists by their id instead would produce unexpected behavior such as:
 * Looking up different lists with the same contents would produce different results, even though ''comparing'' lists with the same contents would indicate them as equivalent.
 * Using a list literal in a dictionary lookup would be pointless -- it would ''always'' produce a KeyError.

If lists were hashed by their contents (as tuples are), this, too, would be a valid hash function - lists with different hash values would have different contents.
So, again, the problem is not in the definition of the hash function.
But what should happen when a list, being used as a dictionary key, is modified?
If the modification changed the hash value of the list (because it changed the contents of the list), then the list would be in the wrong "bucket" of the dictionary.
This could end up with unexpected errors like:
{{{#!python
>>> l = [1, 2]
>>> d = {}
>>> d[l] = 42
>>> l.append(3)
>>> d[l]
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
KeyError: [1, 2, 3]
>>> d[[1, 2]]
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
KeyError: [1, 2]
Line 129: Line 102:
where the value `42` is no longer available because the the list that hashes to the same value, `[1, 2]`, is not equivalent to the modified list, and the value that is equivalent to the modified list, `[1, 2, 3]` does not hash to the same value.
Since the dictionary doesn't know when a key object is modified, such errors could only be produced at key lookup time, not at object modification time, which could make such errors quite hard to debug.
Line 130: Line 105:
Now assume li is changed (e.g. li[2] = 4). There are two options to handle this.
Let the dictionary ignore changes
Line 133: Line 106:
{{{
>>> d[li]
KeyError: [1,2,4] not found in dictionary. (even worse: a previously existing [1,2,4] map is returned).
}}}
Having found that both ways of hashing lists have some undesirable side-effects, it should be more obvious why Python takes the stance that:
Line 138: Line 108:
or let the dictionary follow changes '''
The builtin list type should not be used as a dictionary key.
'''
Line 140: Line 112:
{{{
>>> d[[1,2,3]]
MovedError: Please address future lookups to [1,2,4] :)
}}}
Note that since tuples are immutable, they do not run into the troubles of lists - they can be hashed by their contents without worries about modification.
Thus, in Python, they provide a valid `__hash__` method, and are thus usable as dictionary keys.
Line 145: Line 115:
Both are pretty unsatisfactory. Conclusion: === User Defined Types as Dictionary Keys ===
Line 147: Line 117:
----
'''
Dictionary lookup with
mutable types like lists is a source of unpleasant surprises for the
programmer and therefore impossible in Python.
'''
----
What about instances of user defined types?
Line 155: Line 119:
This is the point where tuples come in. They have nearly the same interface as lists but cannot
be changed once they have been created thereby avoiding the problems discussed for lists. Thus tuples can be used as dictionary keys.
By default, all user defined types are usable as dictionary keys with `hash(object)` defaulting to `id(object)`, and `cmp(object1, object2)` defaulting to `cmp(id(object1), id(object2))`.
This same suggestion was discussed above for lists and found unsatisfactory. Why are user defined types different?
Line 158: Line 122:
=== Instances of user defined classes ===  1. In the cases where an object must be placed in a mapping, object identity is often much more important than object contents.
 2. In the cases where object content really is important, the default settings can be redefined by overridding `__hash__` and `__cmp__` or `__eq__`.
Line 160: Line 125:
What about instances of user defined classes?

User defined classes can be anything the programmer likes but yet their instances are usable as dictionary keys because there is a default: hash(object) = id(object) and cmp(object1, object2) = cmp(id(object1), id(object2)). The same setting has been discussed above for lists and has been found unsatisfactory. What is different here?

 1. It is a default setting which can be adapted to each class by defining/overriding the methods `__hash__` and `__cmp__`.
 2. For objects identity is more important than for lists.
Note that it is often better practice, when an object is to be associated with a value, to simply assign that value as one of the object's attributes.

Why Lists Can't Be Dictionary Keys

Newcomers to Python often wonder why, while the language includes both a tuple and a list type, tuples are usable as a dictionary keys, while lists are not. This was a deliberate design decision, and can best be explained by first understanding how Python dictionaries work.

How Dictionaries Work

Dictionaries, in Python, are also known as "mappings", because they "map" or "associate" key objects to value objects:

   1 # retrieve the value for a particular key
   2 value = d[key]

Thus, Python mappings must be able to, given a particular key object, determine which (if any) value object is associated with a given key. One simple approach would be to store a list of (key, value) pairs, and then search the list sequentially every time a value was requested. However, this approach would be very slow with a large number of items - in complexity terms, this algorithm would be O(n), where n is the number of items in the mapping.

Python's dictionary implementation reduces the average complexity of dictionary lookups to O(1) by requiring that key objects provide a "hash" function. Such a hash function takes the information in a key object and uses it to produce an integer, called a hash value. This hash value is then used to determine which "bucket" this (key, value) pair should be placed into. Pseudocode for this lookup function might look something like:

   1 def lookup(d, key):
   2     '''dictionary lookup is done in three steps:
   3        1. A hash value of the key is computed using a hash function.
   4 
   5        2. The hash value addresses a location in d.data which is
   6           supposed to be an array of "buckets" or "collision lists"
   7           which contain the (key,value) pairs.
   8 
   9        3. The collision list addressed by the hash value is searched
  10           sequentially until a pair is found with pair[0] == key. The
  11           return value of the lookup is then pair[1].
  12     '''
  13     h = hash(key)                  # step 1
  14     cl = d.data[h]                 # step 2
  15     for pair in cl:                # step 3
  16         if key == pair[0]:
  17             return pair[1]
  18     else:
  19         raise KeyError, "Key %s not found." % key

For such a lookup algorithm to work correctly, the hash functions provided must guarantee that if two keys produce different hash values then the two key objects are not equivalent, that is,

for all i1, i2, if hash(i1) != hash(i2), then i1 != i2

Otherwise, checking the hash value of a key object might make us look in the wrong bucket and thus never find the associated value.

For such a lookup algorithm to work efficiently, most buckets should have only a small number of items (preferably only one). Consider what would happen with the following hash function:

   1 def hash(obj):
   2     return 1

Note that this function meets the requirements of a hash function - every time two keys have different hash values, they represent different objects. (This is trivially true because no keys have different hash values - they all have the value 1.) But this is a bad hash function because it means that all (key, value) pairs will be placed in a single list, and so each lookup will require searching this entire list. Thus a (very) desirable property of a hash function is that if two keys produce the same hash values, then the key objects are equivalent, that is,

for all i1, i2, if hash(i1) == hash(i2), then i1 == i2

Hash functions that can approximate this property well will distribute (key, value) pairs evenly across the buckets, and keep lookup time down.

Types Usable as Dictionary Keys

The discussion above should explain why Python requires that:

To be used as a dictionary key, an object must support the hash function (e.g. through __hash__), equality comparison (e.g. through __eq__ or __cmp__), and must satisfy the correctness condition above.

Lists as Dictionary Keys

That said, the simple answer to why lists cannot be used as dictionary keys is that lists do not provide a valid __hash__ method. Of course, the obvious question is, "Why not?"

Consider what kinds of hash functions could be provided for lists.

If lists hashed by id, this would certainly be valid given Python's definition of a hash function -- lists with different hash values would have different ids. But lists are containers, and most other operations on them deal with them as such. So hashing lists by their id instead would produce unexpected behavior such as:

  • Looking up different lists with the same contents would produce different results, even though comparing lists with the same contents would indicate them as equivalent.

  • Using a list literal in a dictionary lookup would be pointless -- it would always produce a KeyError.

If lists were hashed by their contents (as tuples are), this, too, would be a valid hash function - lists with different hash values would have different contents. So, again, the problem is not in the definition of the hash function. But what should happen when a list, being used as a dictionary key, is modified? If the modification changed the hash value of the list (because it changed the contents of the list), then the list would be in the wrong "bucket" of the dictionary. This could end up with unexpected errors like:

   1 >>> l = [1, 2]
   2 >>> d = {}
   3 >>> d[l] = 42
   4 >>> l.append(3)
   5 >>> d[l]
   6 Traceback (most recent call last):
   7   File "<interactive input>", line 1, in ?
   8 KeyError: [1, 2, 3]
   9 >>> d[[1, 2]]
  10 Traceback (most recent call last):
  11   File "<interactive input>", line 1, in ?
  12 KeyError: [1, 2]

where the value 42 is no longer available because the the list that hashes to the same value, [1, 2], is not equivalent to the modified list, and the value that is equivalent to the modified list, [1, 2, 3] does not hash to the same value. Since the dictionary doesn't know when a key object is modified, such errors could only be produced at key lookup time, not at object modification time, which could make such errors quite hard to debug.

Having found that both ways of hashing lists have some undesirable side-effects, it should be more obvious why Python takes the stance that:

The builtin list type should not be used as a dictionary key.

Note that since tuples are immutable, they do not run into the troubles of lists - they can be hashed by their contents without worries about modification. Thus, in Python, they provide a valid __hash__ method, and are thus usable as dictionary keys.

User Defined Types as Dictionary Keys

What about instances of user defined types?

By default, all user defined types are usable as dictionary keys with hash(object) defaulting to id(object), and cmp(object1, object2) defaulting to cmp(id(object1), id(object2)). This same suggestion was discussed above for lists and found unsatisfactory. Why are user defined types different?

  1. In the cases where an object must be placed in a mapping, object identity is often much more important than object contents.
  2. In the cases where object content really is important, the default settings can be redefined by overridding __hash__ and __cmp__ or __eq__.

Note that it is often better practice, when an object is to be associated with a value, to simply assign that value as one of the object's attributes.

DictionaryKeys (last edited 2022-01-21 13:47:38 by eriky)

Unable to edit the page? See the FrontPage for instructions.