I would like to ask what is the cheapest data type (in term of memory consumption and cost to hold/process it) to be used as dummy value in python dict (only key of the dict matters to me, values are just placeholder)
For examples :
d1 = {1: None, 2: None, 3: None}
d2 = {1: -1, 2: -1, 3: -1}
d3 = {1: False, 2: False, 3: False}
Here only keys (1, 2, 3) are useful to me, the values are not so they can be any value (just used as a place holder. What I want to know is what dummy data I should use here. For now I use None, but not sure if it is the "cheapest" one.
P.S., I know the best option to store only keys may be to use Set instead of dict (with dummy values). However, the reason that I am doing so is because I want to exchange data between Python and C++ using SWIG. And for now I have figured out how to pass Python dict to C++ as std::map using SWIG, but cannot find anything about how to pass Python Set to C++ as std::set...
Helps / Guidances are very appreciated here!
python 3.4 64bit:
>>> import sys
>>> sys.getsizeof(None)
16
>>> sys.getsizeof(False)
24
>>> sys.getsizeof(1)
28
>>>
So None would appear to be the best choice (I've only listed immutable objects, and disregarded strings and tuples). Note that it doesn't matter much as those objects are usually cached, so the size isn't multiplied by the number of elements of your dictionary (furthermore None is guaranteed to be a singleton)
That said, the cost of the actual object is neglectable compared to the cost of storing a reference to that object for each key/value pair. If your dictionary holds 1000 values, you have 1000 references to store, whatever the size of the value.
Conclusion: it doesn't matter much as long as you're using the same reference everywhere, and it's going to cost much more than a set anyway because of the references being stored as the values of each dictionary entry.
One possible alternative would be to pass the set as json representation (in a list, then) as a pointer of characters to the C++ side, which will parse it using a good json parser. Unless your values are big floating point values (or huge integers), that would save memory all right because the object aspect is eliminated with the serialization.
>>> json.dumps(list(set(range(4,10))))
'[4, 5, 6, 7, 8, 9]' # hard to beat that in terms of size!
You can use a set, but SWIG seems to only support passing Python lists as the set parameter (or use the named template) without writing your own typemap. Example (Windows):
test.i*
%module test
%include <std_set.i>
%template(seti) std::set<int>;
%inline %{
#include <set>
#include <iostream>
void func(std::set<int> a)
{
for(auto i : a)
std::cout << i << std::endl;
}
%}
Output:
>>> import set
>>> s = test.seti([1,1,2,2,3,3]) # pass named template
>>> test.func(s)
1
2
3
>>> test.func([1,2,3,3,4,4]) # pass a list that converts to a set
1
2
3
4
>>> test.func({1,1,2,2,3}) # Actual set doesn't work.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: in method 'func', argument 1 of type 'std::set< int,std::less< int >,std::allocator< int > >'
Even though sets are unhashable, membership check in other set works:
>>> set() in {frozenset()}
True
I expected TypeError: unhashable type: 'set', consistent with other behaviours in Python:
>>> set() in {} # doesn't work when checking in dict
TypeError: unhashable type: 'set'
>>> {} in {frozenset()} # looking up some other unhashable type doesn't work
TypeError: unhashable type: 'dict'
So, how is set membership in other set implemented?
set_contains is implemented like this:
static int
set_contains(PySetObject *so, PyObject *key)
{
PyObject *tmpkey;
int rv;
rv = set_contains_key(so, key);
if (rv < 0) {
if (!PySet_Check(key) || !PyErr_ExceptionMatches(PyExc_TypeError))
return -1;
PyErr_Clear();
tmpkey = make_new_set(&PyFrozenSet_Type, key);
if (tmpkey == NULL)
return -1;
rv = set_contains_key(so, tmpkey);
Py_DECREF(tmpkey);
}
return rv;
}
So this will delegate directly to set_contains_key which will essentially hash the object and then look up the element using its hash.
If the object is unhashable, set_contains_key returns -1, so we get inside that if. Here, we check explicitly whether the passed key object is a set (or an instance of a set subtype) and whether we previously got a type error. This would suggest that we tried a containment check with a set but that failed because it is unhashable.
In that exact situation, we now create a new frozenset from that set and attempt the containment check using set_contains_key again. And since frozensets are properly hashable, we are able to find our result that way.
This explains why the following examples will work properly even though the set itself is not hashable:
>>> set() in {frozenset()}
True
>>> set(('a')) in { frozenset(('a')) }
True
The last line of the documentation for sets discusses this:
Note, the elem argument to the __contains__(), remove(), and discard()
methods may be a set. To support searching for an equivalent
frozenset, a temporary one is created from elem.
Trying to create a custom case-insensitive dictionary, I came the following inconvenient and (from my point-of-view) unexpected behaviour. If deriving a class from dict, the overloaded __iter__, keys, values functions are ignored when converting back to dict. I have condensed it to the following test case:
import collections
class Dict(dict):
def __init__(self):
super(Dict, self).__init__(x = 1)
def __getitem__(self, key):
return 2
def values(self):
return 3
def __iter__(self):
yield 'y'
def keys(self):
return 'z'
if hasattr(collections.MutableMapping, 'items'):
items = collections.MutableMapping.items
if hasattr(collections.MutableMapping, 'iteritems'):
iteritems = collections.MutableMapping.iteritems
d = Dict()
print(dict(d)) # {'x': 1}
print(dict(d.items())) # {'y': 2}
The values for keys,values and __iter__,__getitem__ are inconsistent only for demonstration which methods are actually called.
The documentation for dict.__init__ says:
If a positional argument is given and it is a mapping object, a
dictionary is created with the same key-value pairs as the mapping
object. Otherwise, the positional argument must be an iterator object.
I guess it has something to do with the first sentence and maybe with optimizations for builtin dictionaries.
Why exactly does the call to dict(d) not use any of keys, __iter__?
Is it possible to overload the 'mapping' somehow to force the dict constructor to use my presentation of key-value pairs?
Why did I use this? For a case-insensitive but -preserving dictionary, I wanted to:
store (lowercase => (original_case, value)) internally, while appearing as (any_case => value).
derive from dict in order to work with some external library code that uses isinstance checks
not use 2 dictionary lookups: lower_case=>original_case, followed by original_case=>value (this is the solution which I am doing now instead)
If you are interested in the application case: here is corresponding branch
In the file dictobject.c, you see in line 1795ff. the relevant code:
static int
dict_update_common(PyObject *self, PyObject *args, PyObject *kwds, char *methname)
{
PyObject *arg = NULL;
int result = 0;
if (!PyArg_UnpackTuple(args, methname, 0, 1, &arg))
result = -1;
else if (arg != NULL) {
_Py_IDENTIFIER(keys);
if (_PyObject_HasAttrId(arg, &PyId_keys))
result = PyDict_Merge(self, arg, 1);
else
result = PyDict_MergeFromSeq2(self, arg, 1);
}
if (result == 0 && kwds != NULL) {
if (PyArg_ValidateKeywordArguments(kwds))
result = PyDict_Merge(self, kwds, 1);
else
result = -1;
}
return result;
}
This tells us that if the object has an attribute keys, the code which is called is a mere merge. The code called there (l. 1915 ff.) makes a distinction between real dicts and other objects. In the case of real dicts, the items are read out with PyDict_GetItem(), which is the "most inner interface" to the object and doesn't bother using any user-defined methods.
So instead of inheriting from dict, you should use the UserDict module.
Is it possible to overload the 'mapping' somehow to force the dict constructor to use my presentation of key-value pairs?
No.
Being an inherent type, redefining the semantics of dict would certainly cause outright breakage elsewhere.
You've got a library that you can't override the behavior of dict in, that's tough, but redefining the language primitives isn't the answer. You'd probably find it irksome if someone screwed with the commutative property of integer addition behind your back; that's why they can't.
And with regard to your comment "UserDict (correctly) gives False in isinstance(d, dict) checks", of course it does because it isn't a dict and dict has very specific invariants which UserDict can't assure.
Without subclassing dict, what would a class need to be considered a mapping so that it can be passed to a method with **.
from abc import ABCMeta
class uobj:
__metaclass__ = ABCMeta
uobj.register(dict)
def f(**k): return k
o = uobj()
f(**o)
# outputs: f() argument after ** must be a mapping, not uobj
At least to the point where it throws errors of missing functionality of mapping, so I can begin implementing.
I reviewed emulating container types but simply defining magic methods has no effect, and using ABCMeta to override and register it as a dict validates assertions as subclass, but fails isinstance(o, dict). Ideally, I dont even want to use ABCMeta.
The __getitem__() and keys() methods will suffice:
>>> class D:
def keys(self):
return ['a', 'b']
def __getitem__(self, key):
return key.upper()
>>> def f(**kwds):
print kwds
>>> f(**D())
{'a': 'A', 'b': 'B'}
If you're trying to create a Mapping — not just satisfy the requirements for passing to a function — then you really should inherit from collections.abc.Mapping. As described in the documentation, you need to implement just:
__getitem__
__len__
__iter__
The Mixin will implement everything else for you: __contains__, keys, items, values, get, __eq__, and __ne__.
The answer can be found by digging through the source.
When attempting to use a non-mapping object with **, the following error is given:
TypeError: 'Foo' object is not a mapping
If we search CPython's source for that error, we can find the code that causes that error to be raised:
case TARGET(DICT_UPDATE): {
PyObject *update = POP();
PyObject *dict = PEEK(oparg);
if (PyDict_Update(dict, update) < 0) {
if (_PyErr_ExceptionMatches(tstate, PyExc_AttributeError)) {
_PyErr_Format(tstate, PyExc_TypeError,
"'%.200s' object is not a mapping",
Py_TYPE(update)->tp_name);
PyDict_Update is actually dict_merge, and the error is thrown when dict_merge returns a negative number. If we check the source for dict_merge, we can see what leads to -1 being returned:
/* We accept for the argument either a concrete dictionary object,
* or an abstract "mapping" object. For the former, we can do
* things quite efficiently. For the latter, we only require that
* PyMapping_Keys() and PyObject_GetItem() be supported.
*/
if (a == NULL || !PyDict_Check(a) || b == NULL) {
PyErr_BadInternalCall();
return -1;
The key part being:
For the latter, we only require that PyMapping_Keys() and PyObject_GetItem() be supported.
For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))