Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter):
# ...
def __hash__(self):
return hash(tuple(sorted(self.items())))
Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self):
return functools.reduce(lambda a, b: a ^ b, self.items(), 0)
I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)
What I've gathered from the comments
Both #marcin and #senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. #marcin was first but I gave the check mark to #senderle because of the good research on the big-O running times for different solutions. #marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter):
# Edit: A previous version of this code included a __slots__ definition.
# But, from the Python documentation: "When inheriting from a class without
# __slots__, the __dict__ attribute of that class will always be accessible,
# so a __slots__ definition in the subclass is meaningless."
# http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
# ...
def __hash__(self):
"Implements hash(self) -> int"
if not hasattr(self, '_hash'):
self._hash = hash(frozenset(self.items()))
return self._hash
Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.
To provide an explicit example:
>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990
To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.
See also this great answer from Raymond Hettinger describing his implementation of the frozenset hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.
Have you considered hash(sorted(hash(x) for x in self.items()))? That way, you are only sorting integers, and don't have to build a list.
You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__ method?
Alternatively, similar to my answer here, hash(frozenset(self.items())).
Related
I've been trying to understand built-in view objects return by .items(), .values(), .keys() in Python 3 or similarly by .viewitems(), .viewvalues(), .viewkeys(). There are other threads on that subject but none (even the doc) seems to described how they work internally.
The main gain here seems to be efficienty compared to the copy of type list returned in Python 2. There are often compared to a window to the dictionnary items (like in this thread).
But what is that window and why is it more efficient ?
The only thing I can see is that the view objects seems to be set-like objects, which are generally faster for membership testing. But is this the only factor ?
Code sample
>>> example_dict = {'test':'test'}
>>> example_dict.items()
dict_items([('test', 'test')])
>>> type(example_dict.items())
<class 'dict_items'>
So, my question is regarding this dict_items class. How does that work internally?
Dict views store a reference to their parent dict, and they translate operations on the view to corresponding operations on the dict.
Iteration over a dict view is more efficient than building a list and iterating over that, because building a list takes time and memory that you don't have to spend with the view. The old way, Python would iterate over the dict's underlying storage to build a new list, and then you would iterate over the list. Iterating over a dict view uses an iterator that walks through the dict's underlying storage directly, skipping the unnecessary list step.
Dict views also support efficient containment tests and setlike intersection/difference/etc. operations, because they get to perform direct hash lookups on the underlying dict instead of iterating through a list and checking equality element by element.
If you want to see the concrete implementation used by CPython, you can take a look in the official repository, but this implementation is subject to change. It has changed, repeatedly.
One of the main advantages is that views are dynamic:
>>> di={1:'one',2:'two',3:'three'}
>>> view=di.viewitems()
>>> view
dict_items([(1, 'one'), (2, 'two'), (3, 'three')])
>>> di[2]='new two'
>>> view
dict_items([(1, 'one'), (2, 'new two'), (3, 'three')])
Therefore you do not need to regenerate the item, key or value list (as you would with dict.items()) if the dictionary changes.
Think of the Python 2 dict.items() as a type of copy of the dict -- the way it was when the copy was made.
Think of Python 3 dict.items() or the Python 2 equivalent of dict.viewitems() as an up-to-date copy of the way the dict is now. (Same with .viewkeys(), .viewvalues() obviously.)
The Python 3.6 documents have good examples of why and when you would use one.
Value views are not set-like, since dicts can have duplicate values. Key views are set-like, and items views are set-like for dicts with hashable values.
Note: With Python 3, the view replaces what Python 2 had with .keys() .values() or .items() Some may relying on dict.keys() or dict.values() being a static representation of a dict's previous state may have a surprise.
For legibility purposes, I would like to have a custom class that behaves exactly like a dict (but carries a meaningful type instead of the more general dict type):
class Derivatives(dict):
"Dictionary that represents the derivatives."
Now, is there a way of building new objects of this class in a way that does not involve copies? The naive usage
derivs = Derivatives({var: 1}) # var is a Python object
in fact creates a copy of the dictionary passed as an argument, which I would like to avoid, for efficiency reasons.
I tried to bypass the copy but then the class of the dict cannot be changed, in CPython:
class Derivatives(dict):
def __new__(cls, init_dict):
init_dict.__class__ = cls # Fails with __class__ assignment: only for heap types
return init_dict
I would like to have both the ability to give an explicit class name to the dictionaries that the program manipulates and an efficient way of building such dictionaries (instead of being forced to copy a Python dict). Is this doable efficiently in Python?
PS: The use case is maybe 100,000 creations of single-key Derivatives, where the key is a variable (not a string, so no keyword initialization). This is actually not slow, so "efficiency reasons" here means more something like "elegance": there is ideally no need to waste time doing a copy when the copy is not needed. So, in this particular case the question is more about the elegance/clarity that Python can bring here than about running speed.
By inheriting from dict you are given three possibilities for constructor arguments: (baring the {} literal)
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
This means that, in order to instantiate your instance you must do one of the following:
Pass the variables as keywords D(x=1) which are then packed into an intermediate dictionary anyway.
Create a plain dictionary and pass it as a mapping.
Pass an iterable of (key,value) pairs.
So in all three of these cases you will need to create intermediate objects to satisfy the dict constructor.
The third option for a single pair it would look like D(((var,1),)) which I highly recommend against for readability sake.
So if you want your class to inherit from a dictionary, using Derivatives({var: 1}) is your most efficient and most readable option.
As a personal note if you will have thousands of single pair dictionaries I'm not sure how the dict setup is the best in the first place, you may just reconsider the basis of your class.
TL;DR: There's not general-purpose way to do it unless you do it in C.
Long answer:
The dict class is implemented in C. Thus, there is no way to access it's internal properties - and most importantly, it's internal hash table, unless you use C.
In C, you could simply copy the pointer representing the hash table into your object without having to iterate over the dict (key, value) pairs and insert them into your object. (Of course, it's a bit more complicated than this. Note that I omit memory management details).
Longer answer:
I'm not sure why you are concerned about efficiency.
Python passes arguments as references. It rarely every copies unless you explicitly tell it to.
I read in the comments that you can't use named parameters, as the keys are actual Python objects. That leaves me to understand that you're worried about copying the dict keys (and maybe values). However, even the dictionary keys are not copied, and passed by reference! Consider this code:
class Test:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return self.x
t = Test(1, 2)
print(t.y) # prints 2
d = {t: 1}
print(d[t]) # prints 1
keys = list(d.keys())
keys[0].y = 10
print(t.y) # prints 10! No copying was made when inserting object into dictionary.
Thus, the only remaining area of concern is iterating through the dict and inserting the values in your Derivatives class. This is unavoidable, unless you can somehow set the internal hash table of your class to the dict's internal hash table. There is no way to do this in pure python, as the dict class is implemented in C (as mentioned above).
Note that others have suggested using generators. This seems like a good idea too - say if you were reading the derivatives from a file or if you were generating them with a simple formula. It would avoid creating the dict object in the first place. However, there will be no noticable improvements in efficiency if the generators are just wrappers around lists (or any other data structure that can contain an arbritary set of values).
Your best bet is do stick with your original method. Generators are great, but they can't efficiently represent an arbritary set of values (which might be the case in your scenario). It's also not worth it to do it in C.
EDIT: It might be worth it to do it in C, after all!
I'm not too big on the details of the Python C API, but consider defining a class in C, for example,DerivativesBase (deriving from dict). All you do is define an __init__ function in C for DerivativesBase that takes a dict as a parameter and copies the hash table pointer from the dict into your DerivativesBase object. Then, in python, your Derivatives class derives from DerivativesBase and implements the bulk of the functionality.
In Python 3, dict_values, dict_keys and dict_items do not support indexing
my_dict = {'a': 0', 'b': 1', 'c': 2}
All of the queries below fail:
my_dict.keys()[1]
my_dict.values()[1]
my_dict.items()[1]
for that reason.
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
The key types are explained under Dictionary view objects, and also guaranteed to be subclasses of collections.abc.KeysView and friends. Basically, this means you can only count on them having __contains__, __iter__, and __len__.
They don't directly support indexing because their ordering can be invalidated.* But practically, in any implementation of Python, they're only actually invalidated if you mutate the dictionary. Which means you can safely do things like this:
next(itertools.islice(my_dict.keys(), i, None))
Basically, the same way you'd index a set, or any other non-iterator iterable.
* The actual rules as to what behavior is documented have changed a few times. The current version actually says "They provide a dynamic view on the dictionary’s entries, which means that when the dictionary changes, the view reflects these changes," which implies the practical rule can now be relied on. But even if you're using an older version that, e.g., explicitly only guarantees consistency between adjacent calls to keys, values, items, and related functions, unless you're worried about someone writing a new implementation of Python 2.6 or 3.1 or something, there's no reason to worry about that.
Of course you probably want to wrap that up in a function that's more readable. In fact, I'd do it in two steps. First, use the nth function from the itertools recipes:
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
Then wrap up the key indexing:
def getkey(mapping, index, default=None):
return nth(mapping.keys(), index, default)
What if you want a random sample? Well, dictionary views are Sized, as are dictionaries themselves, so you can always use randrange:
def choosekey(mapping):
return getkey(mapping, random.randrange(len(mapping)))
If you just want a key, value or item, use next() and iter():
next(iter(my_dict))
next(iter(my_dict.values()))
next(iter(my_dict.items()))
I would've expected Python's keys method to return a set instead of a list. Since it most closely resembles the kind of guarantees that keys of a hashmap would give. Specifically, they are unique and not sorted, like a set. However, this method returns a list:
>>> d = {}
>>> d.keys().__class__
<type 'list'>
Is this just a mistake in the Python API or is there some other reason I am missing?
One reason is that dict.keys() predates the introduction of sets into the language.
Note that the return type of dict.keys() has changed in Python 3: the function now returns a "set-like" view rather than a list.
For set-like views, all of the operations defined for the abstract base class collections.abc.Set are available (for example, ==, <, or ^).
In python 2, it's less efficient to construct a set than a list.
Don't want to make an assumption that the user of the return value will want to search within the result. Iteration is also likely.
In python 3, it's no longer a list. It's an ordered iterable because ordering is guaranteed.
I am trying to understand the Python hash function under the hood. I created a custom class where all instances return the same hash value.
class C:
def __hash__(self):
return 42
I just assumed that only one instance of the above class can be in a dict at any time, but in fact a dict can have multiple elements with the same hash.
c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements
I experimented a little more and found that if I override the __eq__ method such that all the instances of the class compare equal, then the dict only allows one instance.
class D:
def __hash__(self):
return 42
def __eq__(self, other):
return True
p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element
So I am curious to know how a dict can have multiple elements with the same hash.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a continguous block of memory (sort of like an array, so you can do O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important
Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)
The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1| ... |
-+-----------------+
.| ... |
-+-----------------+
i| ... |
-+-----------------+
.| ... |
-+-----------------+
n| ... |
-+-----------------+
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)
There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.
Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
Edit: the answer below is one of possible ways to deal with hash collisions, it is however not how Python does it. Python's wiki referenced below is also incorrect. The best source given by #Duncan below is the implementation itself: https://github.com/python/cpython/blob/master/Objects/dictobject.c I apologize for the mix-up.
It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:
Here you see John Smith and Sandra Dee both hash to 152. Bucket 152 contains both of them. When looking up Sandra Dee it first finds the list in bucket 152, then loops through that list until Sandra Dee is found and returns 521-6955.
The following is wrong it's only here for context: On Python's wiki you can find (pseudo?) code how Python performs the lookup.
There's actually several possible solutions to this problem, check out the wikipedia article for a nice overview: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution
Hash tables, in general have to allow for hash collisions! You will get unlucky and two things will eventually hash to the same thing. Underneath, there is a set of objects in a list of items that has that same hash key. Usually, there is only one thing in that list, but in this case, it'll keep stacking them into the same one. The only way it knows they are different is through the equals operator.
When this happens, your performance will degrade over time, which is why you want your hash function to be as "random as possible".
In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.
User-defined classes have __cmp__() and
__hash__() methods by default; with them, all objects
compare unequal (except with themselves) and
x.__hash__() returns a result derived from id(x).
So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary.
In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.
class A(object):
def __hash__(self):
return 42
class B(object):
def __eq__(self, other):
return True
class C(A, B):
pass
dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}
print(dict_a)
print(dict_b)
print(dict_c)
Output
{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}