The address of keys are stored very far from each other - python

I'd like to explore the hash table,
In [1]: book = {"apple":0.67, "milk":1.49, "avocado":1.49, "python":2}
In [5]: [hex(id(key)) for key in book]
Out[5]: ['0x10ffffc70', '0x10ffffab0', '0x10ffffe68', '0x10ee1cca8']
The addresses tell that the keys are far away from each other, especially key "python",
I assumed that they are adjacent to one another.
How could this happen? Is it running in high performance?

There are two ways we can interpret your confusion: either you expected the id() to be the hash function for the keys, or you expected keys to be relocated to the hash table and, since in CPython the id() value is a memory location, that the id() values would say something about the hash table size. We can address both by talking about Python's dictionary implementation and how Python deals with objects in general.
Python dictionaries are implemented as a hash table, which is a table of limited size. To store keys, a hash function generates an integer (same integer for equal values), and the key is stored in a slot based on that number using a modulo function:
slot = hash(key) % len(table)
This can lead to collisions, so having a large range of numbers for the hash function to pick from is going to help reduce the chances there are such collisions. You still have to deal with collisions anyway, but you want to minimise that.
Python does not use the id() function as a hash function here, because that would not produce the same hash for equal values! If you didn't produce the same hash for equal values, then you couldn't use multiple "hello world" strings as a means to find the right slot again, as dictionary["hello world"] = "value" then "hello world" in dictionary would produce different id() values and thus hash to different slots and you would not that the specific string value has already been used as a key.
Instead, objects are expected to implement a __hash__ method, and you can see what that method produces for various objects with the hash() function.
Because keys stored in a dictionary must remain unchanged, Python won't let you store mutable types in a dictionary. Otherwise, if you can change their value, they would no longer be equal to another such object with the old value and shame hash, and you wouldn't find them in the slot that their new hash would map to.
Note that Python puts all objects in a dynamic heap, and uses references everywhere to relate the objects. Dictionaries hold references to keys and values; putting a key into a dictionary does not re-locate the key in memory and the id() of the key won't change. If keys were relocated, then a requirement for the id() function would be violated, the documentation states: This is an integer which is guaranteed to be unique and constant for this object during its lifetime.
As for those collisions: Python deals with collisions by looking for a new slot with a fixed formula, finding an empty slot in a predictable but psuedorandom series of slot numbers; see the dictobject.c source code comments if you want to know the details. As the table fills up, Python will dynamically grow the table to fit more elements, so there will always be empty slots.

Related

handling hash collisions in python dictionaries

I have a bunch of dictionaries in python, each dictionary containing user information e.g.:
NewUserDict={'name': 'John', 'age':27}
I collect all these user info dictionaries within a larger dictionary container, using the hash value of each dictionary as the key (Hashing a dictionary?).
What would be the best way to handle hash collisions, when adding new unique users to the dictionary? I was going to manually compare the dictionaries with colliding hash values, and just add some random number to the more recent hash value, e.g.:
if new_hash in larger_dictionary:
if larger_dictionary[new_hash] != NewUserDict:
new_hash = new_hash + somerandomnumber
What would be standard way of handling this? Alternatively, how do I know if I should be worrying about collisions in the first place?
Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.
THAT unique key therefore should be the key in your map to allow you to find a user record.
However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.
A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.
Here's an example:
mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]
In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.
This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).
Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.
However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.
using the hash value of each dictionary as the key
You are not using the hash value of a dict. Dicts don't have hash values. From the link, it looks like you're using
hash(frozenset(my_dict.items()))
in which case you should instead just be using
frozenset(my_dict.items())
as the key directly. Hash collisions will then be handled for you by the normal dict collision handling.
In general, you should not use hashes as dict keys, as doing so defeats collision resolution. You should use whatever hashed to that hash value as the key.
In general, collision happens when multiple keys hash to the same bucket. In that case, we need to make sure that we can distinguish between those keys.
Chaining collision resolution is one of the popular techniques which is used for collision resolution for hash tables. For example, two strings "welcome to stackoverflow" and "how to earn reputation in SO?" yield hash codes 100 and 200 respectively. Assuming the total array size is 10, both of them end up in the same bucket (100 % 10 and 200 % 10). Another approach is Open Addressing to resolve collision while hashing.
You can read this article on Python Dictionary Implementations which talks about handling collision because python dictionaries are implemented using hash tables.

How to accept a database and print out a list in python

So I have to create a simple program that accepts a database with a bunch of artists and their works of art with the details of the art. There is a given artist and I have to search through the database to find all the ones where they have to same artist and return them. I'm not allowed to use other built in functions nor import anything. Can someone tell me why its creating the error and what it means?
def works_by_artists(db,artists):
newlist = {}
for a in db.keys():
for b in db[artists]:
if a == b:
newlist.append(a);
return newlist
This prints out an error:
for b in db[artists]:
TypeError: unhashable type: 'list'
A dictionary can accept only some kinds of values as keys. In particular, they must be hashable, which mean they cannot change while in the dictionary ("immutable"), and there must be a function known to Python that takes the value and returns a nonnegative integer that somehow represents the integer. Integers, floats, strings, and tuples are immutable and hashable, while standard lists, dictionaries, and sets are not. The hash value of the key is used to store the key-value pair in the standard Python dictionary. This method was chosen for the sake of speed of looking up a key, but the speed comes at the cost of limiting the possible key values. Other ways could have been chosen, but this works well for the dictionary's intended purpose in Python.
You tried to execute the line for b in db[artists]: while the current value of artists was not hashable. This is an error. I cannot say what the value of artists was, since the value was a parameter to the function and you did not show the calling code.
So check which value of artists was given to function works_by_artists() that caused the displayed error. Whatever created that value is the actual error in your code.

Python - hash() and dict

If we have 2 separate dict, both with the same keys and values, when we print them it will come in different orders, as expected.
So, let's say I want to to use hash() on those dict:
hash(frozenset(dict1.items()))
hash(frozenset(dict2.items()))
I'm doing this to make a new dict with the hash() value created as the new keys .
Even showing up different when printing dict, the value createad by hash() will always be equal? If no, how to make it always the same so I can make comparisons successfully?
If the keys and values hash the same, frozenset is designed to be a stable and unique representation of the underlying values. The docs explicitly state:
Two sets are equal if and only if every element of each set is contained in the other (each is a subset of the other).
And the rules for hashable types require that:
Hashable objects which compare equal must have the same hash value.
So by definition frozensets with equal, hashable elements are equal and hash to the same value. This can only be violated if a user-defined class which does not obey the rules for hashing and equality is contained in the resulting frozenset (but then you've got bigger problems).
Note that this does not mean they'll iterate in the same order or produce the same repr; thanks to chaining on hash collisions, two frozensets constructed from the same elements in a different order need not iterate in the same order. But they're still equal to one another, and hash the same (precise outputs and ordering is implementation dependent, could easily vary between different versions of Python; this just happens to work on my Py 3.5 install to create the desired "different iteration order" behavior):
>>> frozenset([1,9])
frozenset({1, 9})
>>> frozenset([9,1])
frozenset({9, 1}) # <-- Different order; consequence of 8 buckets colliding for 1 and 9
>>> hash(frozenset([1,9]))
-7625378979602737914
>>> hash(frozenset([9,1]))
-7625378979602737914 # <-- Still the same hash though
>>> frozenset([1,9]) == frozenset([9,1])
True # <-- And still equal

What makes lists unhashable?

So lists are unhashable:
>>> { [1,2]:3 }
TypeError: unhashable type: 'list'
The following page gives an explanation:
A list is a mutable type, and cannot be used as a key in a dictionary
(it could change in-place making the key no longer locatable in the
internal hash table of the dictionary).
I understand why it is undesirable to use mutable objects as dictionary keys. However, Python raises the same exception even when I am simply trying to hash a list (independently of dictionary creation)
>>> hash( [1,2] )
TypeError: unhashable type: 'list'
Does Python do this as a guarantee that mutable types will never be used as dictionary keys? Or is there another reason that makes mutable objects impossible to hash, regardless of how I plan to use them?
Dictionaries and sets use hashing algorithms to uniquely determine an item. And those algorithms make use of the items used as keys to come up the unique hash value. Since lists are mutable, the contents of a list can change. After allowing a list to be in a dictionary as a key, if the contents of the list changes, the hash value will also change. If the hash value changes after it gets stored at a particular slot in the dictionary, it will lead to an inconsistent dictionary. For example, initially the list would have gotten stored at location A, which was determined based on the hash value. If the hash value changes, and if we look for the list we might not find it at location A, or as per the new hash value, we might find some other object.
Since, it is not possible to come up with a hash value, internally there is no hashing function defined for lists.
PyObject_HashNotImplemented, /* tp_hash */
As the hashing function is not implemented, when you use it as a key in the dictionary, or forcefully try to get the hash value with hash function, it fails to hash it and so it fails with unhashable type
TypeError: unhashable type: 'list'

What dictates the order of data in a dictionary in Python?

What determines the order of items in a dictionary(specifically in Python, though this may apply to other languages)? For example:
>>> spam = {'what':4, 'shibby':'cream', 'party':'rock'}
>>> spam
{'party': 'rock', 'what': 4, 'shibby': 'cream'}
If I call on spam again, the items will still be in that same order. But how is this order decided?
According to python docs,
Dictionaries are sometimes found in other languages as “associative
memories” or “associative arrays”. Unlike sequences, which are indexed
by a range of numbers, dictionaries are indexed by keys, which can be
any immutable type; strings and numbers can always be keys.
They are arbitary, again from docs:
A dictionary’s keys are almost arbitrary values. Values that are not
hashable, that is, values containing lists, dictionaries or other
mutable types (that are compared by value rather than by object
identity) may not be used as keys. Numeric types used for keys obey
the normal rules for numeric comparison: if two numbers compare equal
(such as 1 and 1.0) then they can be used interchangeably to index the
same dictionary entry. (Note however, that since computers store
floating-point numbers as approximations it is usually unwise to use
them as dictionary keys.)
The order in an ordinary dictionary is based on an internal hash value, so you're not supposed to make any assumptions about it.
Use collections.OrderedDict for a dictionary whose order you control.
Because dictionary keys are stored in a hash table. According to http://en.wikipedia.org/wiki/Hash_table:
The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some pseudo-random order.

Categories

Resources