Related
Perhaps my understanding of python's dictionary is not good. But here's the problem.
Does it ever happen that a {yolk: shell} pair exists in the dictionary say eggs, but a eggs.get(yolk) can return None?
So, in a large code, I do multiple get operations for a dictionary, and after certain iterations, I observe this situation.
>>> for key, value in nodehashes.items():
... print(key, nodehashes.get(key), value)
............................
...........................
<Graph.Node object at 0x00000264128C4DA0> 3309678211443697093 3309678211443697093
<Graph.Node object at 0x00000264128C4DD8> 3554035049990170053 3554035049990170053
<Graph.Node object at 0x00000264128C4E10> None -7182124040890112571 # Look at this!!
<Graph.Node object at 0x00000264128C4E48> 3268020121048950213 3268020121048950213
<Graph.Node object at 0x00000264128C4E80> -1243862058694105659 -1243862058694105659
............................
............................
At first sight, It looks like somewhere in the code, the key is deleted, but then how does nodehashes.items() return the correct key-value pair? I swept the entire region, I am not popping an item at all. How can this happen?
I know it's wrong on my part not to post an example, but I really don't know where to start looking in the code, The Nodes are hashed in the beginning and they are only accessed with get. Surprisingly, even PyCharm's debugger shows the key-value pair to exist. But the get returns None. So if anyone else has hit upon this before, I am all ears.
def __eq__(self, other):
if (self.x == other.x) and (self.y == other.y):
return True
else:
return False
def __hash__(self):
return hash(tuple([self.x, self.y]))
You can reproduce that if you have a custom __hash__ method on mutable objects:
class A:
def __hash__(self):
return hash(self.a)
>>> a1 = A()
>>> a2 = A()
>>> a1.a = 1
>>> a2.a = 2
>>> d = {a1: 1, a2: 2}
>>> a1.a = 3
>>> d.items()
dict_items([(<__main__.A object at 0x7f1762a8b668>, 1), (<__main__.A object at 0x7f17623d76d8>, 2)])
>>> d.get(a1)
None
You can see that d.items() still has access to both A objects, but get can't find it anymore, because the hash value has changed.
Having a class with overridden __eq__ and __hash__ method I use sets for easy lookups (has also other reasons).
class Foo:
def __init__(self, name, value):
self.name = name
self.value = value
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return self.value
def __repr__(self):
return "{} {}".format(self.name, self.value)
mylist = [Foo(None, 1), Foo(None, 2), Foo(None,3)]
reference = [Foo("a", 1), Foo("b", 2), Foo("c", 3)]
Apparently python uses the first set when I perform operations like a union for the result set:
print(set(mylist) | set(reference)) # {None 1, None 2, None 3}
print(set(reference) | set(mylist)) # {a 1, b 2, c 3}
I could not find any documentation on this behaviour.
Is there a formal definition for this?
Or is it just undefined which set the interpreter takes on unions?
EDIT To make it clear:
A union on two sets is mathematically a symmetric operation, the behavior here is not symmetric. Can I rely on it?
I think the answer is in your class definition.
__eq__ checks for value equality, while __hash__ returns a value. As a result, when sets are unified, the elements of one set are equal to their counterparts of the other set (respective values are equal), which causes the unified set return the first specified set.
You're eq method implementation is based only on 'value' property
Changing it to the following implementation should solve the problem:
def __eq__(self, other):
return (self.value == other.value) and (self.name == other.name)
I have nested dictionaries that may contain other dictionaries or lists. I need to be able to compare a list (or set, really) of these dictionaries to show that they are equal.
The order of the list is not uniform. Typically, I would turn the list into a set, but it is not possible since there are values that are also dictionaries.
a = {'color': 'red'}
b = {'shape': 'triangle'}
c = {'children': [{'color': 'red'}, {'age': 8},]}
test_a = [a, b, c]
test_b = [b, c, a]
print(test_a == test_b) # False
print(set(test_a) == set(test_b)) # TypeError: unhashable type: 'dict'
Is there a good way to approach this to show that test_a has the same contents as test_b?
You can use a simple loop to check if each of one list is in the other:
def areEqual(a, b):
if len(a) != len(b):
return False
for d in a:
if d not in b:
return False
return True
I suggest writing a function that turns any Python object into something orderable, with its contents, if it has any, in sorted order. If we call it canonicalize, we can compare nested objects with:
canonicalize(test_a) == canonicalize(test_b)
Here's my attempt at writing a canonicalize function:
def canonicalize(x):
if isinstance(x, dict):
x = sorted((canonicalize(k), canonicalize(v)) for k, v in x.items())
elif isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
x = sorted(map(canonicalize, x))
else:
try:
bool(x < x) # test for unorderable types like complex
except TypeError:
x = repr(x) # replace with something orderable
return x
This should work for most Python objects. It won't work for lists of heterogeneous items, containers that contain themselves (which will cause the function to hit the recursion limit), nor float('nan') (which has bizarre comparison behavior, and so may mess up the sorting of any container it's in).
It's possible that this code will do the wrong thing for non-iterable, unorderable objects, if they don't have a repr function that describes all the data that makes up their value (e.g. what is tested by ==). I picked repr as it will work on any kind of object and might get it right (it works for complex, for example). It should also work as desired for classes that have a repr that looks like a constructor call. For classes that have inherited object.__repr__ and so have repr output like <Foo object at 0xXXXXXXXX> it at least won't crash, though the objects will be compared by identity rather than value. I don't think there's any truly universal solution, and you can add some special cases for classes you expect to find in your data if they don't work with repr.
If the elements in both lists are shallow, the idea of sorting them, and then comparing with equality can work. The problem with #Alex's solution is that he is only using "id" - but if instead of id, one uses a function that will sort dictionaries properly, things shuld just work:
def sortkey(element):
if isinstance(element, dict):
element = sorted(element.items())
return repr(element)
sorted(test_a, key=sortkey) == sorted(test_b, key=sotrkey)
(I use an repr to wrap the key because it will cast all elements to string before comparison, which will avoid typerror if different elements are of unorderable types - which would almost certainly happen if you are using Python 3.x)
Just to be clear, if your dictionaries and lists have nested dictionaries themselves, you should use the answer by #m_callens. If your inner lists are also unorderd, you can fix this to work, jsut sorting them inside the key function as well.
In this case they are the same dicts so you can compare ids (docs). Note that if you introduced a new dict whose values were identical it would still be treated differently. I.e. d = {'color': 'red'} would be treated as not equal to a.
sorted(map(id, test_a)) == sorted(map(id, test_b))
As #jsbueno points out, you can do this with the kwarg key.
sorted(test_a, key=id) == sorted(test_b, key=id)
An elegant and relatively fast solution:
class QuasiUnorderedList(list):
def __eq__(self, other):
"""This method isn't as ineffiecient as you think! It runs in O(1 + 2 + 3 + ... + n) time,
possibly better than recursively freezing/checking all the elements."""
for item in self:
for otheritem in other:
if otheritem == item:
break
else:
# no break was reached, item not found.
return False
return True
This runs in O(1 + 2 + 3 + ... + n) flat. While slow for dictionaries of low depth, this is faster for dictionaries of high depth.
Here's a considerably longer snippet which is faster for dictionaries where depth is low and length is high.
class FrozenDict(collections.Mapping, collections.Hashable): # collections.Hashable = portability
"""Adapated from http://stackoverflow.com/a/2704866/1459669"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
self._hash = None
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
# It would have been simpler and maybe more obvious to
# use hash(tuple(sorted(self._d.iteritems()))) from this discussion
# so far, but this solution is O(n). I don't know what kind of
# n we are going to run into, but sometimes it's hard to resist the
# urge to optimize when it will gain improved algorithmic performance.
# Now thread safe by CrazyPython
if self._hash is None:
_hash = 0
for pair in self.iteritems():
_hash ^= hash(pair)
self._hash = _hash
return _hash
def freeze(obj):
if type(obj) in (str, int, ...): # other immutable atoms you store in your data structure
return obj
elif issubclass(type(obj), list): # ugly but needed
return set(freeze(item) for item in obj)
elif issubclass(type(obj), dict): # for defaultdict, etc.
return FrozenDict({key: freeze(value) for key, value in obj.items()})
else:
raise NotImplementedError("freeze() doesn't know how to freeze " + type(obj).__name__ + " objects!")
class FreezableList(list, collections.Hashable):
_stored_freeze = None
_hashed_self = None
def __eq__(self, other):
if self._stored_freeze and (self._hashed_self == self):
frozen = self._stored_freeze
else:
frozen = freeze(self)
if frozen is not self._stored_freeze:
self._stored_hash = frozen
return frozen == freeze(other)
def __hash__(self):
if self._stored_freeze and (self._hashed_self == self):
frozen = self._stored_freeze
else:
frozen = freeze(self)
if frozen is not self._stored_freeze:
self._stored_hash = frozen
return hash(frozen)
class UncachedFreezableList(list, collections.Hashable):
def __eq__(self, other):
"""No caching version of __eq__. May be faster.
Don't forget to get rid of the declarations at the top of the class!
Considerably more elegant."""
return freeze(self) == freeze(other)
def __hash__(self):
"""No caching version of __hash__. See the notes in the docstring of __eq__2"""
return hash(freeze(self))
Test all three (QuasiUnorderedList, FreezableList, and UncachedFreezableList) and see which one is faster in your situation. I'll betcha it's faster than the other solutions.
What's a correct and good way to implement __hash__()?
I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.
As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions).
What's a good practice to get such values? Are collisions a problem?
In my case I have a small class which acts as a container class holding some ints, some floats and a string.
An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.
Here's an example of using a key for hash and equality:
class A:
def __key(self):
return (self.attr_a, self.attr_b, self.attr_c)
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
if isinstance(other, A):
return self.__key() == other.__key()
return NotImplemented
Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.
John Millikin proposed a solution similar to this:
class A(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
return (isinstance(othr, type(self))
and (self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
def __hash__(self):
return hash((self._a, self._b, self._c))
The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?
Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states
The only required property is that objects which compare equal have the same hash value
Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.
Outdated/bad solution
The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:
class B(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
if isinstance(othr, type(self)):
return ((self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
return NotImplemented
def __hash__(self):
return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
hash((self._a, self._b, self._c)))
Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).
Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that
for c in some_string:
hash = 101 * hash + ord(c)
worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.
A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:
class A:
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __iter__(self):
yield "a", self._a
yield "b", self._b
yield "c", self._c
def __hash__(self):
return hash(tuple(self))
def __eq__(self, other):
return (isinstance(other, type(self))
and tuple(self) == tuple(other))
(here __eq__ is not required for hash, but it's easy to implement).
Now add some mutable members to see how it works:
a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c)) # -5279839567404192660
dict(A(a, b, c)) # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c)) # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))
things only fall apart if you try to put non-hashable members in the object model:
hash(A(a, b, [1])) # TypeError: unhashable type: 'list'
I can try to answer the second part of your question.
The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.
In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!
Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.
Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.
In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.
Here are some links if you're interested more:
coalesced hashing on wikipedia
Wikipedia also has a summary of various collision resolution methods:
Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.
A very good explanation on when and how implement the __hash__ function is on programiz website:
Just a screenshot to provide an overview:
(Retrieved 2019-12-13)
As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.
class Person:
def __init__(self, age, name):
self.age = age
self.name = name
def __eq__(self, other):
return self.age == other.age and self.name == other.name
def __hash__(self):
print('The hash is:')
return hash((self.age, self.name))
person = Person(23, 'Adam')
print(hash(person))
Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.
I would favor bit operations. Like, the following C pseudo code:
int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);
Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.
For strings, I've got little/no idea.
#dataclass(frozen=True) (Python 3.7)
This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets:
dataclass_cheat.py
from dataclasses import dataclass, FrozenInstanceError
#dataclass(frozen=True)
class MyClass1:
n: int
s: str
#dataclass(frozen=True)
class MyClass2:
n: int
my_class_1: MyClass1
d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6
assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6
# Due to `frozen=True`
o = MyClass1(n=1, s='a')
try:
o.n = 2
except FrozenInstanceError as e:
pass
else:
raise 'error'
As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:
d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1
works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.
frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html
Tested on Python 3.10.7, Ubuntu 22.10.
As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)
The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.
Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
tl'dr: How do I get dicts that can be used as keys to other dicts?
Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)
All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.
If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.
The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).
A reasonably clean, straightforward implementation is
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.
What if you jam key, value pairs into a frozenset? What would that require, how would it work?
Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
That alone puts you in spitting distance of an immutable mapping:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Which results in the less problematic:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
#classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
which, ultimately, does answer my own question:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
The accepted answer by #Unknown, as well as the answer by #AlexMartelli work perfectly fine, but only under the following constraints:
The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.
The much faster answer by #ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).
The faster yet answer by #RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:
Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().
If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).
I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.
I'd much appreciate any feedback on this, since I only tested this lightly so far.
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.
def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)
serialize the dict as string with json package:
d = {'a': 1, 'b': 2}
s = json.dumps(d)
restore the dict when you need:
d2 = json.loads(s)
If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:
cache[id(rule)] = "whatever"
since id() is unique for every dictionary
EDIT:
Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like
cache[ 'foo:bar' ] = 'baz'
If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like
cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )
I guess the advantage of this is that you wouldn't have to write as much code.