How to determine that a named tuple is a namedtuple object? [duplicate] - python

How do I check if an object is an instance of a Named tuple?

Calling the function collections.namedtuple gives you a new type that's a subclass of tuple (and no other classes) with a member named _fields that's a tuple whose items are all strings. So you could check for each and every one of these things:
def isnamedtupleinstance(x):
t = type(x)
b = t.__bases__
if len(b) != 1 or b[0] != tuple: return False
f = getattr(t, '_fields', None)
if not isinstance(f, tuple): return False
return all(type(n)==str for n in f)
it IS possible to get a false positive from this, but only if somebody's going out of their way to make a type that looks a lot like a named tuple but isn't one;-).

If you want to determine whether an object is an instance of a specific namedtuple, you can do this:
from collections import namedtuple
SomeThing = namedtuple('SomeThing', 'prop another_prop')
SomeOtherThing = namedtuple('SomeOtherThing', 'prop still_another_prop')
a = SomeThing(1, 2)
isinstance(a, SomeThing) # True
isinstance(a, SomeOtherThing) # False

3.7+
def isinstance_namedtuple(obj) -> bool:
return (
isinstance(obj, tuple) and
hasattr(obj, '_asdict') and
hasattr(obj, '_fields')
)

If you need to check before calling namedtuple specific functions on it, then just call them and catch the exception instead. That's the preferred way to do it in python.

Improving on what Lutz posted:
def isinstance_namedtuple(x):
return (isinstance(x, tuple) and
isinstance(getattr(x, '__dict__', None), collections.Mapping) and
getattr(x, '_fields', None) is not None)

I use
isinstance(x, tuple) and isinstance(x.__dict__, collections.abc.Mapping)
which to me appears to best reflect the dictionary aspect of the nature of named tuples.
It appears robust against some conceivable future changes too and might also work with many third-party namedtuple-ish classes, if such things happen to exist.

IMO this might be the best solution for Python 3.6 and later.
You can set a custom __module__ when you instantiate your namedtuple, and check for it later
from collections import namedtuple
# module parameter added in python 3.6
namespace = namedtuple("namespace", "foo bar", module=__name__ + ".namespace")
then check for __module__
if getattr(x, "__module__", None) == "xxxx.namespace":

Related

__iter__: int and str vs list and tuple

some_obj = "scalar"
list_like = "__iter__" in dir(some_obj) # Py2: False; Py3: True
I used it in python 2 to distinguish between "non-iterables" (str, int, bool, None) and iterables (list, dict, tuples).
This does not work with python3 anymore, since str has now the __iter__ attribute (Why do strings in python 2.7 not have the "__iter__" attribute, but strings in python 3.7 have the "__iter__" attribute).
Well, often it is desirable to regard str not as list-like. So is there a better py2+py3 way then "__iter__" in dir(some_obj) and not type(some_obj)==str or all the case checks in this question?
Do I miss other objects that are disputable like str?
I'm not sure if it's good to use __iter__ to check the type, there is a better choice for this one, the Iterable type.
It's your own opinion to divide the groups, so I think the easiest way is to set blacklists...
try:
from collections.abc import Iterable # py3
except ImportError:
from collections import Iterable #py2
def check(arg):
if not isinstance(arg, Iterable):
return False
elif isinstance(arg, (str, bytes)):
return False
else:
return True
Edited:
To not trigger confusion with my answer I am quoting the docs here.
class collections.abc.Iterable
ABC for classes that provide the iter() method.
Checking isinstance(obj, Iterable) detects classes that are registered
as Iterable or that have an iter() method, but it does not detect
classes that iterate with the getitem() method. The only reliable
way to determine whether an object is iterable is to call iter(obj).
This works in 2 and 3.
stris iterable of course.
n1 = 1
s1 = 'abc'
objs = [n1, s1]
for o in objs:
try:
iter(o)
except TypeError:
print(o, 'is not Iterable!')
else:
print(o, 'is Iterable!')
Output:
1 is not Iterable!
abc is Iterable!
The quote was taken from here

building a python function from an arbitrarily long list of parameters

I am trying to construct a python lambda function from either a single parameter or a list, and am unsure what syntax to use to build the lambda:
def check_classes_filter(*class):
return lambda x: isinstance(x, class) and isinstance(x, class[1]...)
The lambda should check if x is an instance of any number of classes that is passed to the function (either one or many).
Is there a general way to build functions from an arbitrary number of parameters in python, maybe as a kind of comprehension?
You could loop over your classes args using all.
def check_classes_filter(*classes):
return lambda x: all(isinstance(x, c) for c in classes)
>>> fun = check_classes_filter(str)
>>> fun('hello')
True
>>> fun = check_classes_filter(int, str)
>>> fun('hello')
False
Although I'd prefer just passing in a list of classes
def check_classes_filter(classes):
return lambda x: all(isinstance(x, c) for c in classes)
and calling the function as
check_classes_filter([int, str])
If you want to check that an object is an instance of any listed class, you don't need to define anything:
>>> isinstance('test', int)
False
>>> isinstance('test', str)
True
>>> isinstance('test', (int, str))
True
From isinstance documentation :
If classinfo is a tuple of type objects (or recursively, other such
tuples), return true if object is an instance of any of the types.
In your question, you mention if x is an instance of any number of classes that is passed to the function (either one or many), it would mean that you should or instead of and in your boolean logic. If you want to check that an object is an instance of every listed classes, see the answer with all.

Overriding __eq__ and __hash__ to compare a dict attribute of two instances

I'm struggling to understand how to correctly compare objects based on an underlying dict attribute that each instance possesses.
Since I'm overriding __eq__, do I need to override __hash__ as well? I haven't a firm grasp on when/where to do so and could really use some help.
I created a simple example below to illustrate the maximum recursion exception that I've run into. A RegionalCustomerCollection organizes account IDs by geographical region. RegionalCustomerCollection objects are said to be equal if the regions and their respective accountids are. Essentially, all items() should be equal in content.
from collections import defaultdict
class RegionalCustomerCollection(object):
def __init__(self):
self.region_accountids = defaultdict(set)
def get_region_accountid(self, region_name=None):
return self.region_accountids.get(region_name, None)
def set_region_accountid(self, region_name, accountid):
self.region_accountids[region_name].add(accountid)
def __eq__(self, other):
if (other == self):
return True
if isinstance(other, RegionalCustomerCollection):
return self.region_accountids == other.region_accountids
return False
def __repr__(self):
return ', '.join(["{0}: {1}".format(region, acctids)
for region, acctids
in self.region_accountids.items()])
Let's create two object instances and populate them with some sample data:
>>> a = RegionalCustomerCollection()
>>> b = RegionalCustomerCollection()
>>> a.set_region_accountid('northeast',1)
>>> a.set_region_accountid('northeast',2)
>>> a.set_region_accountid('northeast',3)
>>> a.set_region_accountid('southwest',4)
>>> a.set_region_accountid('southwest',5)
>>> b.set_region_accountid('northeast',1)
>>> b.set_region_accountid('northeast',2)
>>> b.set_region_accountid('northeast',3)
>>> b.set_region_accountid('southwest',4)
>>> b.set_region_accountid('southwest',5)
Now let's try to compare the two instances and generate the recursion exception:
>>> a == b
...
RuntimeError: maximum recursion depth exceeded while calling a Python object
Your object shouldn't return a hash because it's mutable. If you put this object into a dictionary or set and then change it afterward, you may never be able to find it again.
In order to make an object unhashable, you need to do the following:
class MyClass(object):
__hash__ = None
This will ensure that the object is unhashable.
[in] >>> m = MyClass()
[in] >>> hash(m)
[out] >>> TypeError: unhashable type 'MyClass'
Does this answer your question? I'm suspecting not because you were explicitly looking for a hash function.
As far as the RuntimeError you're receiving, it's because of the following line:
if self == other:
return True
That gets you into an infinite recursion loop. Try the following instead:
if self is other:
return True
You don't need to override __hash__ to compare two objects (you'll need to if you want custom hashing, i.e. to improve performance when inserting into sets or dictionaries).
Also, you have infinite recursion here:
def __eq__(self, other):
if (other == self):
return True
if isinstance(other, RegionalCustomerCollection):
return self.region_accountids == other.region_accountids
return False
If both objects are of type RegionalCustomerCollection then you'll have infinite recursion since == calls __eq__.

how to tell a variable is iterable but not a string

I have a function that take an argument which can be either a single item or a double item:
def iterable(arg)
if #arg is an iterable:
print "yes"
else:
print "no"
so that:
>>> iterable( ("f","f") )
yes
>>> iterable( ["f","f"] )
yes
>>> iterable("ff")
no
The problem is that string is technically iterable, so I can't just catch the ValueError when trying arg[1]. I don't want to use isinstance(), because that's not good practice (or so I'm told).
Use isinstance (I don't see why it's bad practice)
import types
if not isinstance(arg, types.StringTypes):
Note the use of StringTypes. It ensures that we don't forget about some obscure type of string.
On the upside, this also works for derived string classes.
class MyString(str):
pass
isinstance(MyString(" "), types.StringTypes) # true
Also, you might want to have a look at this previous question.
Cheers.
NB: behavior changed in Python 3 as StringTypes and basestring are no longer defined. Depending on your needs, you can replace them in isinstance by str, or a subset tuple of (str, bytes, unicode), e.g. for Cython users.
As #Theron Luhn mentionned, you can also use six.
As of 2017, here is a portable solution that works with all versions of Python:
#!/usr/bin/env python
import collections
import six
def iterable(arg):
return (
isinstance(arg, collections.Iterable)
and not isinstance(arg, six.string_types)
)
# non-string iterables
assert iterable(("f", "f")) # tuple
assert iterable(["f", "f"]) # list
assert iterable(iter("ff")) # iterator
assert iterable(range(44)) # generator
assert iterable(b"ff") # bytes (Python 2 calls this a string)
# strings or non-iterables
assert not iterable(u"ff") # string
assert not iterable(44) # integer
assert not iterable(iterable) # function
Since Python 2.6, with the introduction of abstract base classes, isinstance (used on ABCs, not concrete classes) is now considered perfectly acceptable. Specifically:
from abc import ABCMeta, abstractmethod
class NonStringIterable:
__metaclass__ = ABCMeta
#abstractmethod
def __iter__(self):
while False:
yield None
#classmethod
def __subclasshook__(cls, C):
if cls is NonStringIterable:
if any("__iter__" in B.__dict__ for B in C.__mro__):
return True
return NotImplemented
This is an exact copy (changing only the class name) of Iterable as defined in _abcoll.py (an implementation detail of collections.py)... the reason this works as you wish, while collections.Iterable doesn't, is that the latter goes the extra mile to ensure strings are considered iterable, by calling Iterable.register(str) explicitly just after this class statement.
Of course it's easy to augment __subclasshook__ by returning False before the any call for other classes you want to specifically exclude from your definition.
In any case, after you have imported this new module as myiter, isinstance('ciao', myiter.NonStringIterable) will be False, and isinstance([1,2,3], myiter.NonStringIterable)will be True, just as you request -- and in Python 2.6 and later this is considered the proper way to embody such checks... define an abstract base class and check isinstance on it.
By combining previous replies, I'm using:
import types
import collections
#[...]
if isinstance(var, types.StringTypes ) \
or not isinstance(var, collections.Iterable):
#[Do stuff...]
Not 100% fools proof, but if an object is not an iterable you still can let it pass and fall back to duck typing.
Edit: Python3
types.StringTypes == (str, unicode). The Phython3 equivalent is:
if isinstance(var, str ) \
or not isinstance(var, collections.Iterable):
Edit: Python3.3
types.StringTypes == (str, unicode). The Phython3 equivalent is:
if isinstance(var, str ) \
or not isinstance(var, collections.abc.Iterable):
I realise this is an old post but thought it was worth adding my approach for Internet posterity. The function below seems to work for me under most circumstances with both Python 2 and 3:
def is_collection(obj):
""" Returns true for any iterable which is not a string or byte sequence.
"""
try:
if isinstance(obj, unicode):
return False
except NameError:
pass
if isinstance(obj, bytes):
return False
try:
iter(obj)
except TypeError:
return False
try:
hasattr(None, obj)
except TypeError:
return True
return False
This checks for a non-string iterable by (mis)using the built-in hasattr which will raise a TypeError when its second argument is not a string or unicode string.
2.x
I would have suggested:
hasattr(x, '__iter__')
or in view of David Charles' comment tweaking this for Python3, what about:
hasattr(x, '__iter__') and not isinstance(x, (str, bytes))
3.x
the builtin basestring abstract type was removed. Use str instead. The str and bytes types don’t have functionality enough in common to warrant a shared base class.
To explicitly expand on Alex Martelli's excellent hack of collections.py and address some of the questions around it: The current working solution in python 3.6+ is
import collections
import _collections_abc as cabc
import abc
class NonStringIterable(metaclass=abc.ABCMeta):
__slots__ = ()
#abc.abstractmethod
def __iter__(self):
while False:
yield None
#classmethod
def __subclasshook__(cls, c):
if cls is NonStringIterable:
if issubclass(c, str):
return False
return cabc._check_methods(c, "__iter__")
return NotImplemented
and demonstrated
>>> typs = ['string', iter(''), list(), dict(), tuple(), set()]
>>> [isinstance(o, NonStringIterable) for o in typs]
[False, True, True, True, True, True]
If you want to add iter('') into the exclusions, for example, modify the line
if issubclass(c, str):
return False
to be
# `str_iterator` is just a shortcut for `type(iter(''))`*
if issubclass(c, (str, cabc.str_iterator)):
return False
to get
[False, False, True, True, True, True]
If you like to test if the variable is a iterable object and not a "string like" object (str, bytes, ...) you can use the fact that the __mod__() function exists in such "string like" objects for formatting proposes. So you can do a check like this:
>>> def is_not_iterable(item):
... return hasattr(item, '__trunc__') or hasattr(item, '__mod__')
>>> is_not_iterable('')
True
>>> is_not_iterable(b'')
True
>>> is_not_iterable(())
False
>>> is_not_iterable([])
False
>>> is_not_iterable(1)
True
>>> is_not_iterable({})
False
>>> is_not_iterable(set())
False
>>> is_not_iterable(range(19)) #considers also Generators or Iterators
False
As you point out correctly, a single string is a character sequence.
So the thing you really want to do is to find out what kind of sequence arg is by using isinstance or type(a)==str.
If you want to realize a function that takes a variable amount of parameters, you should do it like this:
def function(*args):
# args is a tuple
for arg in args:
do_something(arg)
function("ff") and function("ff", "ff") will work.
I can't see a scenario where an isiterable() function like yours is needed. It isn't isinstance() that is bad style but situations where you need to use isinstance().
Adding another answer here that doesn't require extra imports and is maybe more "pythonic", relying on duck typing and the fact that str has had a unicode casefold method since Python 3.
def iterable_not_string(x):
'''
Check if input has an __iter__ method and then determine if it's a
string by checking for a casefold method.
'''
try:
assert x.__iter__
try:
assert x.casefold
# could do the following instead for python 2.7 because
# str and unicode types both had a splitlines method
# assert x.splitlines
return False
except AttributeError:
return True
except AttributeError:
return False
Python 3.X
Notes:
You need implement "isListable" method.
In my case dict is not iterable because iter(obj_dict) returns an iterator of just the keys.
Sequences are iterables, but not all iterables are sequences (immutable, mutable).
set, dict are iterables but not sequence.
list is iterable and sequence.
str is an iterable and immutable sequence.
Sources:
https://docs.python.org/3/library/stdtypes.html
https://opensource.com/article/18/3/loop-better-deeper-look-iteration-python
See this example:
from typing import Iterable, Sequence, MutableSequence, Mapping, Text
class Custom():
pass
def isListable(obj):
if(isinstance(obj, type)): return isListable(obj.__new__(obj))
return isinstance(obj, MutableSequence)
try:
# Listable
#o = [Custom()]
#o = ["a","b"]
#o = [{"a":"va"},{"b":"vb"}]
#o = list # class type
# Not listable
#o = {"a" : "Value"}
o = "Only string"
#o = 1
#o = False
#o = 2.4
#o = None
#o = Custom()
#o = {1, 2, 3} #type set
#o = (n**2 for n in {1, 2, 3})
#o = bytes("Only string", 'utf-8')
#o = Custom # class type
if isListable(o):
print("Is Listable[%s]: %s" % (o.__class__, str(o)))
else:
print("Not Listable[%s]: %s" % (o.__class__, str(o)))
except Exception as exc:
raise exc

Python hashable dicts

As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)
The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.
Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
tl'dr: How do I get dicts that can be used as keys to other dicts?
Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)
All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.
If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.
The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).
A reasonably clean, straightforward implementation is
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.
What if you jam key, value pairs into a frozenset? What would that require, how would it work?
Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
That alone puts you in spitting distance of an immutable mapping:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Which results in the less problematic:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
#classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
which, ultimately, does answer my own question:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
The accepted answer by #Unknown, as well as the answer by #AlexMartelli work perfectly fine, but only under the following constraints:
The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.
The much faster answer by #ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).
The faster yet answer by #RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:
Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().
If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).
I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.
I'd much appreciate any feedback on this, since I only tested this lightly so far.
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.
def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)
serialize the dict as string with json package:
d = {'a': 1, 'b': 2}
s = json.dumps(d)
restore the dict when you need:
d2 = json.loads(s)
If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:
cache[id(rule)] = "whatever"
since id() is unique for every dictionary
EDIT:
Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like
cache[ 'foo:bar' ] = 'baz'
If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like
cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )
I guess the advantage of this is that you wouldn't have to write as much code.

Categories

Resources