handling hash collisions in python dictionaries

handling hash collisions in python dictionaries - python

I have a bunch of dictionaries in python, each dictionary containing user information e.g.:
NewUserDict={'name': 'John', 'age':27}
I collect all these user info dictionaries within a larger dictionary container, using the hash value of each dictionary as the key (Hashing a dictionary?).
What would be the best way to handle hash collisions, when adding new unique users to the dictionary? I was going to manually compare the dictionaries with colliding hash values, and just add some random number to the more recent hash value, e.g.:
if new_hash in larger_dictionary:
if larger_dictionary[new_hash] != NewUserDict:
new_hash = new_hash + somerandomnumber
What would be standard way of handling this? Alternatively, how do I know if I should be worrying about collisions in the first place?

Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.
THAT unique key therefore should be the key in your map to allow you to find a user record.
However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.
A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.
Here's an example:
mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]
In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.
This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).
Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.
However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.

using the hash value of each dictionary as the key
You are not using the hash value of a dict. Dicts don't have hash values. From the link, it looks like you're using
hash(frozenset(my_dict.items()))
in which case you should instead just be using
frozenset(my_dict.items())
as the key directly. Hash collisions will then be handled for you by the normal dict collision handling.
In general, you should not use hashes as dict keys, as doing so defeats collision resolution. You should use whatever hashed to that hash value as the key.

In general, collision happens when multiple keys hash to the same bucket. In that case, we need to make sure that we can distinguish between those keys.
Chaining collision resolution is one of the popular techniques which is used for collision resolution for hash tables. For example, two strings "welcome to stackoverflow" and "how to earn reputation in SO?" yield hash codes 100 and 200 respectively. Assuming the total array size is 10, both of them end up in the same bucket (100 % 10 and 200 % 10). Another approach is Open Addressing to resolve collision while hashing.
You can read this article on Python Dictionary Implementations which talks about handling collision because python dictionaries are implemented using hash tables.

Related

What determines the ordering of dictionaries in Python 3.6

I am using a dictionary of dataframes to do some analysis on NFL teams. I need to loop through the dictionaries (backwards, ordered by time of insertion) for the analysis I plan to do. Each NFL team gets its own dictionary.
My functions iterate through the dictionary with code similar to the line displayed at the top. Each key is a tuple, and the second entry in the tuple denotes the week (of the NFL season) the game was played. I initially inserted week 1's key and value, then week 2's key and value, then week 3's key and value. Seeing the output, this works as planned and means my functions should work as they are meant to. No problems in practice. However, if you view the dictionary itself, the keys are out of order (see the second output).
So what exactly determines the order of the keys when you view the dictionary? The Buccaneers dictionary goes 2 -> 1 -> 3. But this is not the case for each team's dictionary; the order seems completely random. What determines this order? I am curious (I definitely inserted them in 1 -> 2 -> 3 order for every team). I am using Python 3.6

See this question for details. To summarize, dictionaries are ordered in the insertion order since CPython 3.6, but that was an implementation detail before Python 3.7 specifications. The doc states:
Changed in version 3.7: Dictionary order is guaranteed to be insertion order.
Hence the answer to your question is:
if you mean CPython specifically, the dictionary order is the insertion order (though that is not guaranted by the specs and one can imagine, in theory, a patch to CPython 3.6 that breaks this behaviour)
if you mean any implementation (CPython, Jython, PyPy...), the implementation determines the dictionary order: there is no guarantee on the order (unless specified by the implementation).
You might ask why there are implementations of dictionaries that are not ordered by insertion order. I suggest you check the hash table data structure. Basically, the values are put in an array, depending on the hash of the key. The hash is a function that maps a key to the index of an array cell. This is why the lookup is so fast: take the key, compute the hash, read the value in the cell (I ignore the collision resolution details), instead of scanning a whole list of (key, value) pairs for instance.
There is no guarantee that the order of the hashed keys is the same as the order of insertion of the keys (or the order of the keys themselves). If you list the keys by scanning the array, the order of the keys appears to be random.
Remark: you can use the OrderDict class to force the keys to be ordered, but that's the order of keys (e.g. 'Opponent' < 'Reference').

The address of keys are stored very far from each other

I'd like to explore the hash table,
In [1]: book = {"apple":0.67, "milk":1.49, "avocado":1.49, "python":2}
In [5]: [hex(id(key)) for key in book]
Out[5]: ['0x10ffffc70', '0x10ffffab0', '0x10ffffe68', '0x10ee1cca8']
The addresses tell that the keys are far away from each other, especially key "python",
I assumed that they are adjacent to one another.
How could this happen? Is it running in high performance?

There are two ways we can interpret your confusion: either you expected the id() to be the hash function for the keys, or you expected keys to be relocated to the hash table and, since in CPython the id() value is a memory location, that the id() values would say something about the hash table size. We can address both by talking about Python's dictionary implementation and how Python deals with objects in general.
Python dictionaries are implemented as a hash table, which is a table of limited size. To store keys, a hash function generates an integer (same integer for equal values), and the key is stored in a slot based on that number using a modulo function:
slot = hash(key) % len(table)
This can lead to collisions, so having a large range of numbers for the hash function to pick from is going to help reduce the chances there are such collisions. You still have to deal with collisions anyway, but you want to minimise that.
Python does not use the id() function as a hash function here, because that would not produce the same hash for equal values! If you didn't produce the same hash for equal values, then you couldn't use multiple "hello world" strings as a means to find the right slot again, as dictionary["hello world"] = "value" then "hello world" in dictionary would produce different id() values and thus hash to different slots and you would not that the specific string value has already been used as a key.
Instead, objects are expected to implement a __hash__ method, and you can see what that method produces for various objects with the hash() function.
Because keys stored in a dictionary must remain unchanged, Python won't let you store mutable types in a dictionary. Otherwise, if you can change their value, they would no longer be equal to another such object with the old value and shame hash, and you wouldn't find them in the slot that their new hash would map to.
Note that Python puts all objects in a dynamic heap, and uses references everywhere to relate the objects. Dictionaries hold references to keys and values; putting a key into a dictionary does not re-locate the key in memory and the id() of the key won't change. If keys were relocated, then a requirement for the id() function would be violated, the documentation states: This is an integer which is guaranteed to be unique and constant for this object during its lifetime.
As for those collisions: Python deals with collisions by looking for a new slot with a fixed formula, finding an empty slot in a predictable but psuedorandom series of slot numbers; see the dictobject.c source code comments if you want to know the details. As the table fills up, Python will dynamically grow the table to fit more elements, so there will always be empty slots.

Dictionaries/hashmaps setting in Python

I've been studying Python for a few days now on the famous tutorial Learn Python The Hard Way. At a certain point, talking about dictionaries on Exercise 39, there is a couple of little functions that read like this:
def hash_key(aMap, key):
"""Given a key this will create a number and then convert it to
an index for the aMap's buckets."""
return hash(key) % len(aMap)
def get_bucket(aMap, key):
"""Given a key, find the bucket where it would go."""
bucket_id = hash_key(aMap, key)
return aMap[bucket_id]
Now, what sounds obscure to me is the way the bucket id is decided on the first function.
Assuming I wanted to find the bucket for the key "myCoolKey", Python would go: hash('myCoolKey') % len(aMap), which in case of len(aMap) being "256" would result "139".
So reading on afterwards, if I'm not being wrong, 'myCoolKey' is going to be put assigned to aMap slot 139.
Now:
Is there a particular reason I can't see for this being done?
What about collisions? Isn't it possible that being the map limited, two keys could result being assigned the same slot while other slots still being unused at the same time?

The purpose of a hash table is to provide you with immediate lookup times. The % modulo function is used to ensure that you will always have a key that is within the bounds of your hash table (so there are no IndexError issues). There is often additional hashing before this (such as in your case) to try and ensure that the keys are as evenly distributed as possible, to reduce collisions.
Yes, it's possible for a general hash table. Hash tables can resolve this by 1) re-hashing the value to put it into another slot, 2) just putting it in the next available slot, or 3) putting a list of values into that slot, instead of just a single value. It appears that your code goes with option 3.

I think this link gives a good walkthrough of how the dictionary or hashmap works for exercise 39 - https://nicolasgkruk.wordpress.com/2014/07/11/understanding-making-your-own-dictionary-module-from-learning-python-the-hard-way-exercise-39/

Storing key-value pairs with or without a hash in redis

I am storing key-value pairs in a Redis database via the Redis-py client. All keys are unique, there are no duplicates. Here's an example :
key = 133735570
value = {"key":133735570,"value":[[141565041,1.2],[22592300,1.0],[162439394,1.0],[19397942,1.0],[79996146,1.0],[84352985,1.0],[123276403,1.0],[18356816,1.0],[113839687,1.0],[16235789,1.0],[144779115,1.0],[94628304,1.0],[134973120,1.0],[138501363,1.0],[34351681,1.0],[80202522,1.0],[81561595,1.0],[18913677,1.0],[130488590,1.0],[128208311,1.0],[93912155,0.5]]}
Would adding a hash (same as the key name) improve performance? For example,
key = 133735570
hash = 133735570
value = {"key":133735570,"value":[[141565041,1.2],[22592300,1.0],[162439394,1.0],[19397942,1.0],[79996146,1.0],[84352985,1.0],[123276403,1.0],[18356816,1.0],[113839687,1.0],[16235789,1.0],[144779115,1.0],[94628304,1.0],[134973120,1.0],[138501363,1.0],[34351681,1.0],[80202522,1.0],[81561595,1.0],[18913677,1.0],[130488590,1.0],[128208311,1.0],[93912155,0.5]]}
My requirement is to look-up keys so as to retrieve the corresponding values from it.

You can try to store your key-value pairs (value part in your example) within hash data structure (where key part of your pair would be stored as hash field and value part as hash value; check out HMSET) which is more flexible for data manipulation and might consume less memory than plain value strings.

Most efficient way to update attribute of one instance

I'm creating an arbitrary number of instances (using for loops and ranges). At some event in the future, I need to change an attribute for only one of the instances. What's the best way to do this?
Right now, I'm doing the following:
1) Manage the instances in a list.
2) Iterate through the list to find a key value.
3) Once I find the right object within the list (i.e. key value = value I'm looking for), change whatever attribute I need to change.
for Instance within ListofInstances:
if Instance.KeyValue == SearchValue:
Instance.AttributeToChange = 10
This feels really inefficient: I'm basically iterating over the entire list of instances, even through I only need to change an attribute in one of them.
Should I be storing the Instance references in a structure more suitable for random access (e.g. dictionary with KeyValue as the dictionary key?) Is a dictionary any more efficient in this case? Should I be using something else?
Thanks,
Mike

Should I be storing the Instance references in a structure more suitable for random access (e.g. dictionary with KeyValue as the dictionary key?)
Yes, if you are mapping from a key to a value (which you are in this case), such that one typically accesses an element via its key, then a dict rather than a list is better.
Is a dictionary any more efficient in this case?
Yes, it is much more efficient. A dictionary takes O(1) on average to lookup an item by its key whereas a list takes O(n) to lookup an item by its key, which is what you are currently doing.
Using a Dictionary
# Construct the dictionary
d = {}
# Insert items into the dictionary
d[key1] = value1
d[key2] = value2
# ...
# Checking if an item exists
if key in d:
# Do something requiring d[key]
# such as updating an attribute:
d[key].attr = val

As you mention, you need to keep an auxiliary dictionary with the key value as the key and the instance (or list of instance with that value for their attribute) as the value(s) -- way more efficient. Indeed, there's nothing more efficient than a dictionary for such uses.

It depends on what the other needs of your program are. If all you ever do with these objects is access the one with that particular key value, then sure, a dictionary is perfect. But if you need to preserve the order of the elements, storing them in a dictionary won't do that. (You could store them in both a dict and a list, or there might be a data structure that provides a compromise between random access and order preservation) Alternatively, if more than one object can have the same key value, then you can't store both of them in a single dict at the same time, at least not directly. (You could have a dict of lists or something)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.