Hashing a dictionary? - python

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.

Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.

If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.

EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code

The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))

MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()

While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.

Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result

Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]

To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.

You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.

You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here

One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))

This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.

For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.

I do it like this:
hash(str(my_dict))

Related

How to properly specify argument type accepting dictionary values?

Here are a couple of functions:
from typing import Sequence
def avg(vals: Sequence[float]):
return sum(val for val in vals) / len(vals)
def foo():
the_dict = {'a': 1., 'b': 2.}
return avg(the_dict.values())
PyCharm 2022.3 warns about the_dict.values() in the last line:
Expected type 'Sequence[float]', got _dict_values[float, str] instead
But those values can be iterated across and have their length taken.
I tried
from typing import Sequence, Union
def avg(vals: Union[Sequence[float], _dict_values]):
...
which seems insane, but that also didn't work.
Suggestions?
I can turn off the typing for that argument, but I am curious what the right annotation is.
_dict_values is not a Sequence (it's closer to an Iterator). Lucky avg doesn't require everything Sequence ensures. You only need Iterable[float] for sum and Sized for len().
from collections.abc import Iterable, Sized
from typing import Protocol
class SupportsFloatMean(Iterable[float], Sized, Protocol):
...
def avg(vals: SupportsFloatMean):
return sum(val for val in vals) / len(vals)
If you want the broadest possible type compatible with your avg function, you'll just need one that supports the Iterable protocol (used in the for-loop) and the Sized protocol (used by the len function). Unfortunately, none of those two inherits from the other, as you can see here.
Thus, an intersection of the two is what you'll need to create. This is made possible (in this case) with typing.Protocol as mentioned here in PEP 544 and as used beautifully in #PeterSutton's answer.
If you want to be a bit less verbose, without the need for a custom protocol, and almost just as broad, you can simply use the Collection ABC. It inherits from Iterable and Sized as well as Container. The latter defines the __contains__ method, meaning you can do in-checks with it, which you technically do not need in your avg function.
The Collection is obviously still a superclass of ValuesView, as you can see here, so you'll have no trouble calling avg with your dict.values.
from collections.abc import Collection
def avg(vals: Collection[float]) -> float:
return sum(val for val in vals) / len(vals)
def foo() -> float:
the_dict = {'a': 1., 'b': 2.}
return avg(the_dict.values())
Also, don't forget your return type annotations. ;-)

What's the point of using [object instance].__self__?

I was checking the code of the toolz library's groupby function in Python and I found this:
def groupby(key, seq):
""" Group a collection by a key function
"""
if not callable(key):
key = getter(key)
d = collections.defaultdict(lambda: [].append)
for item in seq:
d[key(item)](item)
rv = {}
for k, v in d.items():
rv[k] = v.__self__
return rv
Is there any reason to use rv[k] = v.__self__ instead of rv[k] = v?
This is a somewhat confusing trick to save a small amount of time:
We are creating a defaultdict with a factory function that returns a bound append method of a new list instance with [].append. Then we can just do d[key(item)](item) instead of d[key(item)].append(item) like we would have if we create a defaultdict that contains lists. If we don't lookup append everytime, we gain a small amount of time.
But now the dict contains bound methods instead of the lists, so we have to get the original list instance back via __self__.
__self__ is an attribute described for instance methods that returns the original instance. You can verify that with this for example:
>>> a = []
>>> a.append.__self__ is a
True
This is a somewhat convoluted, but possibly more efficient approach to creating and using a defaultdict of lists.
First, remember that the default item is lambda: [].append. This means create a new list, and store a bound append method in the dictionary. This saves you a method bind on every further append to the same key, and the garbage collect that follows. For example, the following more standard approach is less efficient:
d = collections.defaultdict(list)
for item in seq:
d[key(item)].append(item)
The problem then becomes how to get the original lists back out of the dictionary, since the reference is not stored explicitly. Luckily, bound methods have a __self__ attribute which does just that. Here, [].append.__self__ is a reference to the original [].
As a side note, the last loop could be a comprehension:
return {k: v.__self__ for k, v in d.items()}

Approximation and modification of dictionary variables

I have a question
I have a dictionary called (plane) where I can call variables such as 'name' or 'speed'.
These values ​​give me this kind of output for example:
print(plane.get('name'))
print(plane.get('altitude'))
Output:
'name'= FLRDD8EFC
'speed'= 136.054323
My question is, how can I approximate the values ​​as follows?
'name'= DD8EFC (Always deleting the first three lines)
'speed'= 136. (Approximate whole)
Thank you so much for your help!
G
My question is, how can I approximate the values ​​as follows?
You have to write the code explicitly.
'name'= DD8EFC (Always deleting the first three lines)
Fetch the string, then slice it:
name = plane.get('name')[3:]
print(f"'name' = {name}'")
However, the fact that you're using get rather than [] implies that you're expecting to handle the possibility that name doesn't exist in plane.
If that isn't a possibility, you should just use []:
name = plane['name'][3:]
If it is, you'll need to provide a default that can be sliced:
name = plane.get('name', '')[3:]
'speed'= 136. (Approximate whole)
It looks like you want to round to 0 fractional digits, but keep it a float? Call round with 0 digits on it. And again, either you don't need get, or you need a different default:
speed = round(plane['speed'], 0)
… or:
speed = round(plane.get('speed', 0.0), 0)
As for printing it: Python doesn't like to print a . after a float without also printing any fractional values. You can monkey with format fields, but it's probably simpler to just put the . in manually:
print(f"'speed': {speed}.")
>>> plane.get('name')[3:]
'DD8EFC'
>>> round(plane.get('speed'))
136
You could, possibly, subclass collections.UserDict (or maybe dict as #abamert suggested), and code some sort of switch/filter/formatter in the special method__getitem__:
something like this, maybe:
from collections import UserDict
class MyDict(UserDict):
def __getitem__(self, key):
if key == 'name':
return self.data[key][3:]
if key == 'speed':
return round(self.data[key], 3)
return self.data[key]
if __name__ == '__main__':
mydata = MyDict({'name': 'ABCDEF', 'speed': 12.123456})
print(mydata['name'], mydata['speed'])
output:
DEF 12.123
or subclassing dict:
class MyDict(dict):
def __getitem__(self, key):
if key == 'name':
return self.get(key)[3:]
if key == 'speed':
return round(self.get(key), 3)
return self[key]
Disclaimer: This is more a proof of concept than anything; I do not recommend this approach; the 'switch' is ugly and could get out of hand as soon as the list of constraints grows a bit.

Python 3 changing value of dictionary key in for loop not working

I have python 3 code that is not working as expected:
def addFunc(x,y):
print (x+y)
def subABC(x,y,z):
print (x-y-z)
def doublePower(base,exp):
print(2*base**exp)
def RootFunc(inputDict):
for k,v in inputDict.items():
if v[0]==1:
d[k] = addFunc(*v[1:])
elif v[0] ==2:
d[k] = subABC(*v[1:])
elif v[0]==3:
d[k] = doublePower(*v[1:])
d={"s1_7":[1,5,2],"d1_6":[2,12,3,3],"e1_3200":[3,40,2],"s2_13":[1,6,7],"d2_30":[2,42,2,10]}
RootFunc(d)
#test to make sure key var assignment works
print(d)
I get:
{'d2_30': None, 's2_13': None, 's1_7': None, 'e1_3200': None, 'd1_6': None}
I expected:
{'d2_30': 30, 's2_13': 13, 's1_7': 7, 'e1_3200': 3200, 'd1_6': 6}
What's wrong?
Semi related: I know dictionaries are unordered but is there any reason why python picked this order? Does it run the keys through a randomizer?
print does not return a value. It returns None, so every time you call your functions, they're printing to standard output and returning None. Try changing all print statements to return like so:
def addFunc(x,y):
return x+y
This will give the value x+y back to whatever called the function.
Another problem with your code (unless you meant to do this) is that you define a dictionary d and then when you define your function, you are working on this dictionary d and not the dictionary that is 'input':
def RootFunc(inputDict):
for k,v in inputDict.items():
if v[0]==1:
d[k] = addFunc(*v[1:])
Are you planning to always change d and not the dictionary that you are iterating over, inputDict?
There may be other issues as well (accepting a variable number of arguments within your functions, for instance), but it's good to address one problem at a time.
Additional Notes on Functions:
Here's some sort-of pseudocode that attempts to convey how functions are often used:
def sample_function(some_data):
modified_data = []
for element in some_data:
do some processing
add processed crap to modified_data
return modified_data
Functions are considered 'black box', which means you structure them so that you can dump some data into them and they always do the same stuff and you can call them over and over again. They will either return values or yield values or update some value or attribute or something (the latter are called 'side effects'). For the moment, just pay attention to the return statement.
Another interesting thing is that functions have 'scope' which means that when I just defined it with a fake-name for the argument, I don't actually have to have a variable called "some_data". I can pass whatever I want to the function, but inside the function I can refer to the fake name and create other variables that really only matter within the context of the function.
Now, if we run my function above, it will go ahead and process the data:
sample_function(my_data_set)
But this is often kind of pointless because the function is supposed to return something and I didn't do anything with what it returned. What I should do is assign the value of the function and its arguments to some container so I can keep the processed information.
my_modified_data = sample_function(my_data_set)
This is a really common way to use functions and you'll probably see it again.
One Simple Way to Approach Your Problem:
Taking all this into consideration, here is one way to solve your problem that comes from a really common programming paradigm:
def RootFunc(inputDict):
temp_dict = {}
for k,v in inputDict.items():
if v[0]==1:
temp_dict[k] = addFunc(*v[1:])
elif v[0] ==2:
temp_dict[k] = subABC(*v[1:])
elif v[0]==3:
temp_dict[k] = doublePower(*v[1:])
return temp_dict
inputDict={"s1_7":[1,5,2],"d1_6":[2,12,3,3],"e1_3200":[3,40,2],"s2_13":[1,6,7],"d2_30"[2,42,2,10]}
final_dict = RootFunc(inputDict)
As erewok stated, you are using "print" and not "return" which may be the source of your error. And as far as the ordering is concerned, you already know that dictionaries are unordered, according to python doc at least, the ordering is not random, but rather implemented as hash tables.
Excerpt from the python doc: [...]A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is currently only one standard mapping type, the dictionary. [...]
Now key here is that the order of the element is not really random. I have often noticed that the order stays the same no matter how I construct a dictionary on some values... using lambda or just creating it outright, the order has always remained the same, so it can't be random, but it's definitely arbitrary.

comparing itemgetter objects

I noticed that operator.itemgetter objects don't define __eq__, and so their comparison defaults to checking identity (is).
Is there any disadvantage to defining two itemgetter instances as equal whenever their initialization argument lists compare as equal?
Here's one use case of such a comparison. Suppose you define a sorted data structure whose constructor requires a key function to define the sort. Suppose you want to check if two such data structures have identical key functions (e.g., in an assert statement; or to verify that they can be safely merged; etc.).
It would be nice if we could answer that question in the affirmative when the two key functions are itemgetter('id'). But currently, itemgetter('id') == itemgetter('id') would evaluate to False.
Niklas's answer is quite clever, but needs a stronger condition as itemgetter can take multiple arguments
from collections import defaultdict
from operator import itemgetter
from itertools import count
def cmp_getters(ig1, ig2):
if any(not isinstance(x, itemgetter) for x in (ig1, ig2)):
return False
d1 = defaultdict(count().next)
d2 = defaultdict(count().next)
ig1(d1) # populate d1 as a sideeffect
ig2(d2) # populate d2 as a sideeffect
return d1==d2
Some testcases
>>> cmp_getters(itemgetter('foo'), itemgetter('bar'))
False
>>> cmp_getters(itemgetter('foo'), itemgetter('bar','foo'))
False
>>> cmp_getters(itemgetter('foo','bar'), itemgetter('bar','foo'))
False
>>> cmp_getters(itemgetter('bar','foo'), itemgetter('bar','foo'))
True
itemgetter returns a callable. I hope you do not want to compare callables. Correct? Because the returned callable's id are not gauranteed to be same even if you pass the same arguments.
def fun(a):
def bar(b):
return a*b
return bar
a = fun(10)
print id(a(10))
a = fun(10)
print id(a(10))
On the other hand, when you use the itemgetter callable as an accessor to access the underlying object, then that object's comparison would be used to perform the comparison
It is illustrated in the Sorting Howto using the operating module functions.

Categories

Resources