How to implement a lazy setdefault? - python

One minor annoyance with dict.setdefault is that it always evaluates its second argument (when given, of course), even when the first the first argument is already a key in the dictionary.
For example:
import random
def noisy_default():
ret = random.randint(0, 10000000)
print 'noisy_default: returning %d' % ret
return ret
d = dict()
print d.setdefault(1, noisy_default())
print d.setdefault(1, noisy_default())
This produces ouptut like the following:
noisy_default: returning 4063267
4063267
noisy_default: returning 628989
4063267
As the last line confirms, the second execution of noisy_default is unnecessary, since by this point the key 1 is already present in d (with value 4063267).
Is it possible to implement a subclass of dict whose setdefault method evaluates its second argument lazily?
EDIT:
Below is an implementation inspired by BrenBarn's comment and Pavel Anossov's answer. While at it, I went ahead and implemented a lazy version of get as well, since the underlying idea is essentially the same.
class LazyDict(dict):
def get(self, key, thunk=None):
return (self[key] if key in self else
thunk() if callable(thunk) else
thunk)
def setdefault(self, key, thunk=None):
return (self[key] if key in self else
dict.setdefault(self, key,
thunk() if callable(thunk) else
thunk))
Now, the snippet
d = LazyDict()
print d.setdefault(1, noisy_default)
print d.setdefault(1, noisy_default)
produces output like this:
noisy_default: returning 5025427
5025427
5025427
Notice that the second argument to d.setdefault above is now a callable, not a function call.
When the second argument to LazyDict.get or LazyDict.setdefault is not a callable, they behave the same way as the corresponding dict methods.
If one wants to pass a callable as the default value itself (i.e., not meant to be called), or if the callable to be called requires arguments, prepend lambda: to the appropriate argument. E.g.:
d1.setdefault('div', lambda: div_callback)
d2.setdefault('foo', lambda: bar('frobozz'))
Those who don't like the idea of overriding get and setdefault, and/or the resulting need to test for callability, etc., can use this version instead:
class LazyButHonestDict(dict):
def lazyget(self, key, thunk=lambda: None):
return self[key] if key in self else thunk()
def lazysetdefault(self, key, thunk=lambda: None):
return (self[key] if key in self else
self.setdefault(key, thunk()))

This can be accomplished with defaultdict, too. It is instantiated with a callable which is then called when a nonexisting element is accessed.
from collections import defaultdict
d = defaultdict(noisy_default)
d[1] # noise
d[1] # no noise
The caveat with defaultdict is that the callable gets no arguments, so you can not derive the default value from the key as you could with dict.setdefault. This can be mitigated by overriding __missing__ in a subclass:
from collections import defaultdict
class defaultdict2(defaultdict):
def __missing__(self, key):
value = self.default_factory(key)
self[key] = value
return value
def noisy_default_with_key(key):
print key
return key + 1
d = defaultdict2(noisy_default_with_key)
d[1] # prints 1, sets 2, returns 2
d[1] # does not print anything, does not set anything, returns 2
For more information, see the collections module.

You can do that in a one-liner using a ternary operator:
value = cache[key] if key in cache else cache.setdefault(key, func(key))
If you are sure that the cache will never store falsy values, you can simplify it a little bit:
value = cache.get(key) or cache.setdefault(key, func(key))

No, evaluation of arguments happens before the call. You can implement a setdefault-like function that takes a callable as its second argument and calls it only if it is needed.

There seems to be no one-liner that doesn't require an extra class or extra lookups. For the record, here is a easy (even not concise) way of achieving that without either of them.
try:
value = dct[key]
except KeyError:
value = noisy_default()
dct[key] = value
return value

Related

python dictionary getter with default value not behaving as expected [duplicate]

I am trying to provide a function as the default argument for the dictionary's get function, like this
def run():
print "RUNNING"
test = {'store':1}
test.get('store', run())
However, when this is run, it displays the following output:
RUNNING
1
so my question is, as the title says, is there a way to provide a callable as the default value for the get method without it being called if the key exists?
Another option, assuming you don't intend to store falsy values in your dictionary:
test.get('store') or run()
In python, the or operator does not evaluate arguments that are not needed (it short-circuits)
If you do need to support falsy values, then you can use get_or_run(test, 'store', run) where:
def get_or_run(d, k, f):
sentinel = object() # guaranteed not to be in d
v = d.get(k, sentinel)
return f() if v is sentinel else v
See the discussion in the answers and comments of dict.get() method returns a pointer. You have to break it into two steps.
Your options are:
Use a defaultdict with the callable if you always want that value as the default, and want to store it in the dict.
Use a conditional expression:
item = test['store'] if 'store' in test else run()
Use try / except:
try:
item = test['store']
except KeyError:
item = run()
Use get:
item = test.get('store')
if item is None:
item = run()
And variations on those themes.
glglgl shows a way to subclass defaultdict, you can also just subclass dict for some situations:
def run():
print "RUNNING"
return 1
class dict_nokeyerror(dict):
def __missing__(self, key):
return run()
test = dict_nokeyerror()
print test['a']
# RUNNING
# 1
Subclassing only really makes sense if you always want the dict to have some nonstandard behavior; if you generally want it to behave like a normal dict and just want a lazy get in one place, use one of my methods 2-4.
I suppose you want to have the callable applied only if the key does not exist.
There are several approaches to do so.
One would be to use a defaultdict, which calls run() if key is missing.
from collections import defaultdict
def run():
print "RUNNING"
test = {'store':1}
test.get('store', run())
test = defaultdict(run, store=1) # provides a value for store
test['store'] # gets 1
test['runthatstuff'] # gets None
Another, rather ugly one, one would be to only save callables in the dict which return the apropriate value.
test = {'store': lambda:1}
test.get('store', run)() # -> 1
test.get('runrun', run)() # -> None, prints "RUNNING".
If you want to have the return value depend on the missing key, you have to subclass defaultdict:
class mydefaultdict(defaultdict):
def __missing__(self, key):
val = self[key] = self.default_factory(key)
return val
d = mydefaultdict(lambda k: k*k)
d[10] # yields 100
#mydefaultdict # decorators are fine
def d2(key):
return -key
d2[5] # yields -5
And if you want not to add this value to the dict for the next call, you have a
def __missing__(self, key): return self.default_factory(key)
instead which calls the default factory every time a key: value pair was not explicitly added.
If you only know what the callable is likely to be at he get call site you could subclass dict something like this
class MyDict(dict):
def get_callable(self,key,func,*args,**kwargs):
'''Like ordinary get but uses a callable to
generate the default value'''
if key not in self:
val = func(*args,**kwargs)
else:
val = self[key]
return val
This can then be used like so:-
>>> d = MyDict()
>>> d.get_callable(1,complex,2,3)
(2+3j)
>>> d[1] = 2
>>> d.get_callable(1,complex,2,3)
2
>>> def run(): print "run"
>>> repr(d.get_callable(1,run))
'2'
>>> repr(d.get_callable(2,run))
run
'None'
This is probably most useful when the callable is expensive to compute.
I have a util directory in my project with qt.py, general.py, geom.py, etc. In general.py I have a bunch of python tools like the one you need:
# Use whenever you need a lambda default
def dictGet(dict_, key, default):
if key not in dict_:
return default()
return dict_[key]
Add *args, **kwargs if you want to support calling default more than once with differing args:
def dictGet(dict_, key, default, *args, **kwargs):
if key not in dict_:
return default(*args, **kwargs)
return dict_[key]
Here's what I use:
def lazy_get(d, k, f):
return d[k] if k in d else f(k)
The fallback function f takes the key as an argument, e.g.
lazy_get({'a': 13}, 'a', lambda k: k) # --> 13
lazy_get({'a': 13}, 'b', lambda k: k) # --> 'b'
You would obviously use a more meaningful fallback function, but this illustrates the flexibility of lazy_get.
Here's what the function looks like with type annotation:
from typing import Callable, Mapping, TypeVar
K = TypeVar('K')
V = TypeVar('V')
def lazy_get(d: Mapping[K, V], k: K, f: Callable[[K], V]) -> V:
return d[k] if k in d else f(k)

Insert into dictionary or fail if key already present without hashing twice

Is there a way to either insert a new key into a dict or fail if that key already exists without hashing twice?
From something like this:
class MyClass:
def __init__(self):
pass
def __hash__(self):
print('MyClass.__hash__ called')
return object.__hash__(self)
my_key = MyClass()
my_value = "my_string"
my_dict = {}
if my_key not in my_dict:
my_dict[my_key] = my_value
else:
raise ValueError
you can see that __hash__ is called twice and this code doesn't express the desired behavior of insertion or failure as an atomic operation.
Use the setdefault method of the dictionary:
if my_dict.setdefault(my_key, my_value) != my_value:
raise ValueError
setdefault assigns the second argument to the key given by the first argument, but only if the key doesn't already exist in the dictionary. In any case, it returns the value that's in the dictionary afterwards (so either the original value, or the new default value if there was no old value).
My code checks the return value to see if the dictionary had a value other than my_value. It will fail to detect the same value being added twice under the same key. I don't think there's a way to handle that situation without hashing twice.
my_dict.setdefault(my_key, my_value)
setdefault(key[, default])
If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.
Using the method contains(key)
my_dictionary = {"a":1, "b":2}
print(my_dictionary.__contains__('a'))
print(my_dictionary.__contains__('b'))
print(my_dictionary.__contains__('c'))
True
True
False
What about storing the hashed key in a variable?
class MyClass:
def __init__(self):
pass
def __hash__(self):
print('MyClass.__hash__ called')
return object.__hash__(self)
my_key = MyClass()
my_value = "my_string"
my_dict = {}
hashed_key = my_key.__hash__()
if hashed_key not in my_dict:
my_dict[hashed_key] = my_value
else:
raise ValueError 
gives
MyClass.__hash__ called

Adding items to a list if it's not a function

I'm trying to write a function right now, and its purpose is to go through an object's __dict__ and add an item to a dictionary if the item is not a function.
Here is my code:
def dict_into_list(self):
result = {}
for each_key,each_item in self.__dict__.items():
if inspect.isfunction(each_key):
continue
else:
result[each_key] = each_item
return result
If I'm not mistaken, inspect.isfunction is supposed to recognize lambdas as functions as well, correct? However, if I write
c = some_object(3)
c.whatever = lambda x : x*3
then my function still includes the lambda. Can somebody explain why this is?
For example, if I have a class like this:
class WhateverObject:
def __init__(self,value):
self._value = value
def blahblah(self):
print('hello')
a = WhateverObject(5)
So if I say print(a.__dict__), it should give back {_value:5}
You are actually checking if each_key is a function, which most likely is not. You actually have to check the value, like this
if inspect.isfunction(each_item):
You can confirm this, by including a print, like this
def dict_into_list(self):
result = {}
for each_key, each_item in self.__dict__.items():
print(type(each_key), type(each_item))
if inspect.isfunction(each_item) == False:
result[each_key] = each_item
return result
Also, you can write your code with dictionary comprehension, like this
def dict_into_list(self):
return {key: value for key, value in self.__dict__.items()
if not inspect.isfunction(value)}
I can think of an easy way to find the variables of an object through the dir and callable methods of python instead of inspect module.
{var:self.var for var in dir(self) if not callable(getattr(self, var))}
Please note that this indeed assumes that you have not overrided __getattr__ method of the class to do something other than getting the attributes.

inserting into python dictionary

The default behavior for python dictionary is to create a new key in the dictionary if that key does not already exist. For example:
d = {}
d['did not exist before'] = 'now it does'
this is all well and good for most purposes, but what if I'd like python to do nothing if the key isn't already in the dictionary. In my situation:
for x in exceptions:
if masterlist.has_key(x):
masterlist[x] = False
in other words, i don't want some incorrect elements in exceptions to corrupt my masterlist. Is this as simple as it gets? it FEELS like I should be able to do this in one line inside the for loop (i.e., without explicitly checking that x is a key of masterlist)
UPDATE:
To me, my question is asking about the lack of a parallel between a list and a dict. For example:
l = []
l[0] = 2 #fails
l.append(2) #works
with the subclassing answer, you could modify the dictionary (maybe "safe_dict" or "explicit_dict" to do something similar:
d = {}
d['a'] = '1' #would fail in my world
d.insert('a','1') #what my world is missing
You could use .update:
masterlist.update((x, False) for x in exceptions if masterlist.has_key(x))
You can inherit a dict class, override it's __setitem__ to check for existance of key (or do the same with monkey-patching only one instance).
Sample class:
class a(dict):
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
dict.__setitem__(self, 'a', 'b')
def __setitem__(self, key, value):
if self.has_key(key):
dict.__setitem__(self, key, value)
a = a()
print a['a'] # prints 'b'
a['c'] = 'd'
# print a['c'] - would fail
a['a'] = 'e'
print a['a'] # prints 'e'
You could also use some function to make setting values without checking for existence simpler.
However, I though it would be shorter... Don't use it unless you need it in many places.
You can also use in instead of has_key, which is a little nicer.
for x in exceptions:
if x in masterlist:
masterlist[x] = False
But I don't see the issue with having an if statement for this purpose.
For long lists try to use the & operator with set() function embraced with ():
for x in (set(exceptions) & set(masterlist)):
masterlist[x] = False
#or masterlist[x] = exceptions[x]
It'll improve the reading and the iterations at the same time by reading the masterlist's keys only once.

Python hashable dicts

As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)
The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.
Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
tl'dr: How do I get dicts that can be used as keys to other dicts?
Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)
All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.
If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.
The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).
A reasonably clean, straightforward implementation is
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.
What if you jam key, value pairs into a frozenset? What would that require, how would it work?
Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
That alone puts you in spitting distance of an immutable mapping:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Which results in the less problematic:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
#classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
which, ultimately, does answer my own question:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
The accepted answer by #Unknown, as well as the answer by #AlexMartelli work perfectly fine, but only under the following constraints:
The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.
The much faster answer by #ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).
The faster yet answer by #RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:
Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().
If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).
I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.
I'd much appreciate any feedback on this, since I only tested this lightly so far.
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.
def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)
serialize the dict as string with json package:
d = {'a': 1, 'b': 2}
s = json.dumps(d)
restore the dict when you need:
d2 = json.loads(s)
If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:
cache[id(rule)] = "whatever"
since id() is unique for every dictionary
EDIT:
Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like
cache[ 'foo:bar' ] = 'baz'
If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like
cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )
I guess the advantage of this is that you wouldn't have to write as much code.

Categories

Resources