Suppose I want to create a dict (or dict-like object) that returns a default value if I attempt to access a key that's not in the dict.
I can do this either by using a defaultdict:
from collections import defaultdict
foo = defaultdict(lambda: "bar")
print(foo["hello"]) # "bar"
or by using a regular dict and always using dict.get(key, default) to retrieve values:
foo = dict()
print(foo.get("hello", "bar")) # "bar"
print(foo["hello"]) # KeyError (as expected)
Other than the obvious ergonomic overhead of having to remember to use .get() with a default value instead of the expected bracket syntax, what's the difference between these 2 approaches?
Asides from the ergonomics of having .get everwhere, one important difference is if you lookup a missing key in defaultdict it will insert a new element into itself rather than just returning the default. The most important implications of this are:
Later iterations will retrieve all keys looked up in a defaultdict
As more ends up stored in the dictionary, more memory is typically used
Mutation of the default will store that mutation in a defaultdict, with .get the default is lost unless stored explicty
from collections import defaultdict
default_foo = defaultdict(list)
dict_foo = dict()
for i in range(1024):
default_foo[i]
dict_foo.get(i, [])
print(len(default_foo.items())) # 1024
print(len(dict_foo.items())) # 0
# Defaults in defaultdict's can be mutated where as with .get mutations are lost
default_foo[1025].append("123")
dict_foo.get(1025, []).append("123")
print(default_foo[1025]) # ["123"]
print(dict_foo.get(1025, [])) # []
The difference here really comes down to how you want your program to handle a KeyError.
foo = dict()
def do_stuff_with_foo():
print(foo["hello"])
# Do something here
if __name__ == "__main__":
try:
foo["hello"] # The key exists and has a value
except KeyError:
# The first code snippet does this
foo["hello"] = "bar"
do_stuff_with_foo()
# The second code snippet does this
exit(-1)
It's a matter of do we want to stop the program entirely? Do we want the user to fill in a value for foo["hello"] or do we want to use a default value?
The first approach is a more compact way to do foo.get("hello", "bar")
But the kicker is the matter of is this what we really want to happen?
My code looks something like this:
class SomeClass(str):
pass
some_dict = {'s':42}
>>> type(some_dict.keys()[0])
str
>>> s = SomeClass('s')
>>> some_dict[s] = 40
>>> some_dict # expected: Two different keys-value pairs
{'s': 40}
>>> type(some_dict.keys()[0])
str
Why did Python convert the object s to the string "s" while updating the dictionary some_dict?
Whilst the hash value is related, it is not the main factor.
It is equality that is more important here. That is, objects may have the same hash value and not be equal, but equal objects must have the same hash value (though this is not strictly enforced). Otherwise you will end up with some strange bugs when using dict and set.
Since you have not defined the __eq__ method on SomeClass you inherit the one on str. Python's builtins are built to allow subclassing, so __eq__ returns true, if the object would otherwise be equal were it not for them having different types. eg. 's' == SomeClass('s') is true. Thus it is right and proper that 's' and SomeClass('s') are equivalent as keys to a dictionary.
To get the behaviour you want you must redefine the __eq__ dunder method to take into account type. However, when you define a custom equals, python stops giving you an automatic __hash__ dunder method, and you must redefine it as well. But in this case we can just reuse str.__hash__.
class SomeClass(str):
def __eq__(self, other):
return (
type(self) is SomeClass
and type(other) is SomeClass
and super().__eq__(other)
)
__hash__ = str.__hash__
d = {'s': 1}
d[SomeClass('s')] = 2
assert len(d) == 2
print(d)
prints: {'s': 2, 's': 1}
This is a really good question. Firstly, when put (key, value) pair into dict, it uses hash function to get the hash value of key and check if this hash code is present. If present, then dict compares the object with same hash code. If two objects are equal (__eq__(self, other) return True), then, it would update the value, which is why your code encounters such behavior.
Given SomeClass is not even modified, so 's' and SomeClass('s') should have the same hash code and 's'.__eq__(SomeClass('s')) will return True.
I have a list called columnVariable. This list contains a bunch of instances of a class. Each class has a property sequenceNumber
when I run the following command:
for obj in columnVariable:
print obj.sequenceNumber
I get the following output
3
1
2
10
11
I'd like to sort the list by the property sequenceNumber. The output I like is:
1
2
3
10
11
When I try the following:
for obj in columnVariable:
print sorted(obj.sequenceNumber)
I get the output
[u'3']
[u'1']
[u'2']
[u'0', u'1']
[u'1', u'1']
It looks like each individual sequence number is being sorted instead of sorting the items in the list based on the sequence number.
I'm newer to python and so a little help would be appreciated. Thanks!
you may want to try this:
print sorted(columnVariable, key = lambda x: int(x.sequenceNumber))
You should use sorted with a key argument. eg.
from operator import attrgetter
for obj in sorted(columnVariable, key=attrgetter('sequenceNumber')):
print(obj)
Edit:
Since you want to sort strings numerically, it's more appropriate to use a lambda function here
for obj in sorted(columnVariable, key=lambda x: int(x.sequenceNumber)):
print(obj)
Some people struggle to understand lambda functions, so I'll mention that it's ok to write a normal function definition for your key function
def int_sequence_number_key(obj):
return int(obj.sequenceNumber)
for obj in sorted(columnVariable, key=int_sequence_number_key):
print(obj)
Now it's also possible to write tests for int_sequence_number_key which is important for covering corner cases (eg what happens if you can have None or some other objects that can't be converted to int)
The python sorted function takes a key function as an argument, which can be used to specify how to sort. In this case, you would use sorted(columnVariable, key=lambda x:x.sequenceNumber).
The above code can be made faster by using the operator module:
from operator import attrgetter
sorted(columnVariable, key=attergetter("sequenceNumber")).
How do I check if two instances of a
class FooBar(object):
__init__(self, param):
self.param = param
self.param_2 = self.function_2(param)
self.param_3 = self.function_3()
are identical? By identical I mean they have the same values in all of their variables.
a = FooBar(param)
b = FooBar(param)
I thought of
if a == b:
print "a and b are identical"!
Will this do it without side effects?
The background for my question is unit testing. I want to achieve something like:
self.failUnlessEqual(self.my_object.a_function(), another_object)
If you want the == to work, then implement the __eq__ method in your class to perform the rich comparison.
If all you want to do is compare the equality of all attributes, you can do that succinctly by comparison of __dict__ in each object:
class MyClass:
def __eq__(self, other) :
return self.__dict__ == other.__dict__
For an arbitrary object, the == operator will only return true if the two objects are the same object (i.e. if they refer to the same address in memory).
To get more 'bespoke' behaviour, you'll want to override the rich comparison operators, in this case specifically __eq__. Try adding this to your class:
def __eq__(self, other):
if self.param == other.param \
and self.param_2 == other.param_2 \
and self.param_3 == other.param_3:
return True
else:
return False
(the comparison of all params could be neatened up here, but I've left them in for clarity).
Note that if the parameters are themselves objects you've defined, those objects will have to define __eq__ in a similar way for this to work.
Another point to note is that if you try to compare a FooBar object with another type of object in the way I've done above, python will try to access the param, param_2 and param_3 attributes of the other type of object which will throw an AttributeError. You'll probably want to check the object you're comparing with is an instance of FooBar with isinstance(other, FooBar) first. This is not done by default as there may be situations where you would like to return True for comparison between different types.
See AJ's answer for a tidier way to simply compare all parameters that also shouldn't throw an attribute error.
For more information on the rich comparison see the python docs.
For python 3.7 onwards you can also use dataclass to check exactly what you want very easily. For example:
from dataclasses import dataclass
#dataclass
class FooBar:
param: str
param2: float
param3: int
a = Foobar("test_text",2.0,3)
b = Foobar("test_text",2.0,3)
print(a==b)
would return True
According to Learning Python by Lutz, the "==" operator tests value equivalence, comparing all nested objects recursively. The "is" operator tests whether two objects are the same object, i.e. of the same address in memory (same pointer value).
Except for cache/reuse of small integers and simple strings, two objects such as x = [1,2] and y = [1,2] are equal "==" in value, but y "is" x returns false. Same true with two floats x = 3.567 and y = 3.567. This means their addresses are different, or in other words, hex(id(x)) != hex(id(y)).
For class object, we have to override the method __eq__() to make two class A objects like x = A(1,[2,3]) and y = A(1,[2,3]) "==" in content. By default, class object "==" resorts to comparing id only and id(x) != id(y) in this case, so x != y.
In summary, if x "is" y, then x == y, but opposite is not true.
If this is something you want to use in your tests where you just want to verify fields of simple object to be equal, look at compare from testfixtures:
from testfixtures import compare
compare(a, b)
To avoid the possibility of adding or removing attributes to the model and forgetting to do the appropriate changes to your __eq__ function, you can define it as follows.
def __eq__(self, other):
if self.__class__ == other.__class__:
fields = [field.name for field in self._meta.fields]
for field in fields:
if not getattr(self, field) == getattr(other, field):
return False
return True
else:
raise TypeError('Comparing object is not of the same type.')
In this way, all the object attributes are compared. Now you can check for attribute equality either with object.__eq__(other) or object == other.
For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))