How can I make as "perfect" a subclass of dict as possible? The end goal is to have a simple dict in which the keys are lowercase.
It would seem that there should be some tiny set of primitives I can override to make this work, but according to all my research and attempts it seem like this isn't the case:
If I override __getitem__/__setitem__, then get/set don't work. How can I make them work? Surely I don't need to implement them individually?
Am I preventing pickling from working, and do I need to implement __setstate__ etc?
Do I need repr, update and __init__?
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
Here is my first go at it, get() doesn't work and no doubt there are many other minor problems:
class arbitrary_dict(dict):
"""A dictionary that applies an arbitrary key-altering function
before accessing the keys."""
def __keytransform__(self, key):
return key
# Overridden methods. List from
# https://stackoverflow.com/questions/2390827/how-to-properly-subclass-dict
def __init__(self, *args, **kwargs):
self.update(*args, **kwargs)
# Note: I'm using dict directly, since super(dict, self) doesn't work.
# I'm not sure why, perhaps dict is not a new-style class.
def __getitem__(self, key):
return dict.__getitem__(self, self.__keytransform__(key))
def __setitem__(self, key, value):
return dict.__setitem__(self, self.__keytransform__(key), value)
def __delitem__(self, key):
return dict.__delitem__(self, self.__keytransform__(key))
def __contains__(self, key):
return dict.__contains__(self, self.__keytransform__(key))
class lcdict(arbitrary_dict):
def __keytransform__(self, key):
return str(key).lower()
You can write an object that behaves like a dict quite easily with ABCs (Abstract Base Classes) from the collections.abc module. It even tells you if you missed a method, so below is the minimal version that shuts the ABC up.
from collections.abc import MutableMapping
class TransformedDict(MutableMapping):
"""A dictionary that applies an arbitrary key-altering
function before accessing the keys"""
def __init__(self, *args, **kwargs):
self.store = dict()
self.update(dict(*args, **kwargs)) # use the free update to set keys
def __getitem__(self, key):
return self.store[self._keytransform(key)]
def __setitem__(self, key, value):
self.store[self._keytransform(key)] = value
def __delitem__(self, key):
del self.store[self._keytransform(key)]
def __iter__(self):
return iter(self.store)
def __len__(self):
return len(self.store)
def _keytransform(self, key):
return key
You get a few free methods from the ABC:
class MyTransformedDict(TransformedDict):
def _keytransform(self, key):
return key.lower()
s = MyTransformedDict([('Test', 'test')])
assert s.get('TEST') is s['test'] # free get
assert 'TeSt' in s # free __contains__
# free setdefault, __eq__, and so on
import pickle
# works too since we just use a normal dict
assert pickle.loads(pickle.dumps(s)) == s
I wouldn't subclass dict (or other builtins) directly. It often makes no sense, because what you actually want to do is implement the interface of a dict. And that is exactly what ABCs are for.
How can I make as "perfect" a subclass of dict as possible?
The end goal is to have a simple dict in which the keys are lowercase.
If I override __getitem__/__setitem__, then get/set don't work. How
do I make them work? Surely I don't need to implement them
individually?
Am I preventing pickling from working, and do I need to implement
__setstate__ etc?
Do I need repr, update and __init__?
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
The accepted answer would be my first approach, but since it has some issues,
and since no one has addressed the alternative, actually subclassing a dict, I'm going to do that here.
What's wrong with the accepted answer?
This seems like a rather simple request to me:
How can I make as "perfect" a subclass of dict as possible?
The end goal is to have a simple dict in which the keys are lowercase.
The accepted answer doesn't actually subclass dict, and a test for this fails:
>>> isinstance(MyTransformedDict([('Test', 'test')]), dict)
False
Ideally, any type-checking code would be testing for the interface we expect, or an abstract base class, but if our data objects are being passed into functions that are testing for dict - and we can't "fix" those functions, this code will fail.
Other quibbles one might make:
The accepted answer is also missing the classmethod: fromkeys.
The accepted answer also has a redundant __dict__ - therefore taking up more space in memory:
>>> s.foo = 'bar'
>>> s.__dict__
{'foo': 'bar', 'store': {'test': 'test'}}
Actually subclassing dict
We can reuse the dict methods through inheritance. All we need to do is create an interface layer that ensures keys are passed into the dict in lowercase form if they are strings.
If I override __getitem__/__setitem__, then get/set don't work. How do I make them work? Surely I don't need to implement them individually?
Well, implementing them each individually is the downside to this approach and the upside to using MutableMapping (see the accepted answer), but it's really not that much more work.
First, let's factor out the difference between Python 2 and 3, create a singleton (_RaiseKeyError) to make sure we know if we actually get an argument to dict.pop, and create a function to ensure our string keys are lowercase:
from itertools import chain
try: # Python 2
str_base = basestring
items = 'iteritems'
except NameError: # Python 3
str_base = str, bytes, bytearray
items = 'items'
_RaiseKeyError = object() # singleton for no-default behavior
def ensure_lower(maybe_str):
"""dict keys can be any hashable object - only call lower if str"""
return maybe_str.lower() if isinstance(maybe_str, str_base) else maybe_str
Now we implement - I'm using super with the full arguments so that this code works for Python 2 and 3:
class LowerDict(dict): # dicts take a mapping or iterable as their optional first argument
__slots__ = () # no __dict__ - that would be redundant
#staticmethod # because this doesn't make sense as a global function.
def _process_args(mapping=(), **kwargs):
if hasattr(mapping, items):
mapping = getattr(mapping, items)()
return ((ensure_lower(k), v) for k, v in chain(mapping, getattr(kwargs, items)()))
def __init__(self, mapping=(), **kwargs):
super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
def __getitem__(self, k):
return super(LowerDict, self).__getitem__(ensure_lower(k))
def __setitem__(self, k, v):
return super(LowerDict, self).__setitem__(ensure_lower(k), v)
def __delitem__(self, k):
return super(LowerDict, self).__delitem__(ensure_lower(k))
def get(self, k, default=None):
return super(LowerDict, self).get(ensure_lower(k), default)
def setdefault(self, k, default=None):
return super(LowerDict, self).setdefault(ensure_lower(k), default)
def pop(self, k, v=_RaiseKeyError):
if v is _RaiseKeyError:
return super(LowerDict, self).pop(ensure_lower(k))
return super(LowerDict, self).pop(ensure_lower(k), v)
def update(self, mapping=(), **kwargs):
super(LowerDict, self).update(self._process_args(mapping, **kwargs))
def __contains__(self, k):
return super(LowerDict, self).__contains__(ensure_lower(k))
def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
return type(self)(self)
#classmethod
def fromkeys(cls, keys, v=None):
return super(LowerDict, cls).fromkeys((ensure_lower(k) for k in keys), v)
def __repr__(self):
return '{0}({1})'.format(type(self).__name__, super(LowerDict, self).__repr__())
We use an almost boiler-plate approach for any method or special method that references a key, but otherwise, by inheritance, we get methods: len, clear, items, keys, popitem, and values for free. While this required some careful thought to get right, it is trivial to see that this works.
(Note that haskey was deprecated in Python 2, removed in Python 3.)
Here's some usage:
>>> ld = LowerDict(dict(foo='bar'))
>>> ld['FOO']
'bar'
>>> ld['foo']
'bar'
>>> ld.pop('FoO')
'bar'
>>> ld.setdefault('Foo')
>>> ld
{'foo': None}
>>> ld.get('Bar')
>>> ld.setdefault('Bar')
>>> ld
{'bar': None, 'foo': None}
>>> ld.popitem()
('bar', None)
Am I preventing pickling from working, and do I need to implement
__setstate__ etc?
pickling
And the dict subclass pickles just fine:
>>> import pickle
>>> pickle.dumps(ld)
b'\x80\x03c__main__\nLowerDict\nq\x00)\x81q\x01X\x03\x00\x00\x00fooq\x02Ns.'
>>> pickle.loads(pickle.dumps(ld))
{'foo': None}
>>> type(pickle.loads(pickle.dumps(ld)))
<class '__main__.LowerDict'>
__repr__
Do I need repr, update and __init__?
We defined update and __init__, but you have a beautiful __repr__ by default:
>>> ld # without __repr__ defined for the class, we get this
{'foo': None}
However, it's good to write a __repr__ to improve the debugability of your code. The ideal test is eval(repr(obj)) == obj. If it's easy to do for your code, I strongly recommend it:
>>> ld = LowerDict({})
>>> eval(repr(ld)) == ld
True
>>> ld = LowerDict(dict(a=1, b=2, c=3))
>>> eval(repr(ld)) == ld
True
You see, it's exactly what we need to recreate an equivalent object - this is something that might show up in our logs or in backtraces:
>>> ld
LowerDict({'a': 1, 'c': 3, 'b': 2})
Conclusion
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
Yeah, these are a few more lines of code, but they're intended to be comprehensive. My first inclination would be to use the accepted answer,
and if there were issues with it, I'd then look at my answer - as it's a little more complicated, and there's no ABC to help me get my interface right.
Premature optimization is going for greater complexity in search of performance.
MutableMapping is simpler - so it gets an immediate edge, all else being equal. Nevertheless, to lay out all the differences, let's compare and contrast.
I should add that there was a push to put a similar dictionary into the collections module, but it was rejected. You should probably just do this instead:
my_dict[transform(key)]
It should be far more easily debugable.
Compare and contrast
There are 6 interface functions implemented with the MutableMapping (which is missing fromkeys) and 11 with the dict subclass. I don't need to implement __iter__ or __len__, but instead I have to implement get, setdefault, pop, update, copy, __contains__, and fromkeys - but these are fairly trivial, since I can use inheritance for most of those implementations.
The MutableMapping implements some things in Python that dict implements in C - so I would expect a dict subclass to be more performant in some cases.
We get a free __eq__ in both approaches - both of which assume equality only if another dict is all lowercase - but again, I think the dict subclass will compare more quickly.
Summary:
subclassing MutableMapping is simpler with fewer opportunities for bugs, but slower, takes more memory (see redundant dict), and fails isinstance(x, dict)
subclassing dict is faster, uses less memory, and passes isinstance(x, dict), but it has greater complexity to implement.
Which is more perfect? That depends on your definition of perfect.
After trying out both of the top two suggestions, I've settled on a shady-looking middle route for Python 2.7. Maybe 3 is saner, but for me:
class MyDict(MutableMapping):
# ... the few __methods__ that mutablemapping requires
# and then this monstrosity
#property
def __class__(self):
return dict
which I really hate, but seems to fit my needs, which are:
can override **my_dict
if you inherit from dict, this bypasses your code. try it out.
this makes #2 unacceptable for me at all times, as this is quite common in python code
masquerades as isinstance(my_dict, dict)
rules out MutableMapping alone, so #1 is not enough
I heartily recommend #1 if you don't need this, it's simple and predictable
fully controllable behavior
so I cannot inherit from dict
If you need to tell yourself apart from others, personally I use something like this (though I'd recommend better names):
def __am_i_me(self):
return True
#classmethod
def __is_it_me(cls, other):
try:
return other.__am_i_me()
except Exception:
return False
As long as you only need to recognize yourself internally, this way it's harder to accidentally call __am_i_me due to python's name-munging (this is renamed to _MyDict__am_i_me from anything calling outside this class). Slightly more private than _methods, both in practice and culturally.
So far I have no complaints, aside from the seriously-shady-looking __class__ override. I'd be thrilled to hear of any problems that others encounter with this though, I don't fully understand the consequences. But so far I've had no problems whatsoever, and this allowed me to migrate a lot of middling-quality code in lots of locations without needing any changes.
As evidence: https://repl.it/repls/TraumaticToughCockatoo
Basically: copy the current #2 option, add print 'method_name' lines to every method, and then try this and watch the output:
d = LowerDict() # prints "init", or whatever your print statement said
print '------'
splatted = dict(**d) # note that there are no prints here
You'll see similar behavior for other scenarios. Say your fake-dict is a wrapper around some other datatype, so there's no reasonable way to store the data in the backing-dict; **your_dict will be empty, regardless of what every other method does.
This works correctly for MutableMapping, but as soon as you inherit from dict it becomes uncontrollable.
Edit: as an update, this has been running without a single issue for almost two years now, on several hundred thousand (eh, might be a couple million) lines of complicated, legacy-ridden python. So I'm pretty happy with it :)
Edit 2: apparently I mis-copied this or something long ago. #classmethod __class__ does not work for isinstance checks - #property __class__ does: https://repl.it/repls/UnitedScientificSequence
My requirements were a bit stricter:
I had to retain case info (the strings are paths to files displayed to the user, but it's a windows app so internally all operations must be case insensitive)
I needed keys to be as small as possible (it did make a difference in memory performance, chopped off 110 mb out of 370). This meant that caching lowercase version of keys is not an option.
I needed creation of the data structures to be as fast as possible (again made a difference in performance, speed this time). I had to go with a builtin
My initial thought was to substitute our clunky Path class for a case insensitive unicode subclass - but:
proved hard to get that right - see: A case insensitive string class in python
turns out that explicit dict keys handling makes code verbose and messy - and error prone (structures are passed hither and thither, and it is not clear if they have CIStr instances as keys/elements, easy to forget plus some_dict[CIstr(path)] is ugly)
So I had finally to write down that case insensitive dict. Thanks to code by #AaronHall that was made 10 times easier.
class CIstr(unicode):
"""See https://stackoverflow.com/a/43122305/281545, especially for inlines"""
__slots__ = () # does make a difference in memory performance
#--Hash/Compare
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
if isinstance(other, CIstr):
return self.lower() == other.lower()
return NotImplemented
def __ne__(self, other):
if isinstance(other, CIstr):
return self.lower() != other.lower()
return NotImplemented
def __lt__(self, other):
if isinstance(other, CIstr):
return self.lower() < other.lower()
return NotImplemented
def __ge__(self, other):
if isinstance(other, CIstr):
return self.lower() >= other.lower()
return NotImplemented
def __gt__(self, other):
if isinstance(other, CIstr):
return self.lower() > other.lower()
return NotImplemented
def __le__(self, other):
if isinstance(other, CIstr):
return self.lower() <= other.lower()
return NotImplemented
#--repr
def __repr__(self):
return '{0}({1})'.format(type(self).__name__,
super(CIstr, self).__repr__())
def _ci_str(maybe_str):
"""dict keys can be any hashable object - only call CIstr if str"""
return CIstr(maybe_str) if isinstance(maybe_str, basestring) else maybe_str
class LowerDict(dict):
"""Dictionary that transforms its keys to CIstr instances.
Adapted from: https://stackoverflow.com/a/39375731/281545
"""
__slots__ = () # no __dict__ - that would be redundant
#staticmethod # because this doesn't make sense as a global function.
def _process_args(mapping=(), **kwargs):
if hasattr(mapping, 'iteritems'):
mapping = getattr(mapping, 'iteritems')()
return ((_ci_str(k), v) for k, v in
chain(mapping, getattr(kwargs, 'iteritems')()))
def __init__(self, mapping=(), **kwargs):
# dicts take a mapping or iterable as their optional first argument
super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
def __getitem__(self, k):
return super(LowerDict, self).__getitem__(_ci_str(k))
def __setitem__(self, k, v):
return super(LowerDict, self).__setitem__(_ci_str(k), v)
def __delitem__(self, k):
return super(LowerDict, self).__delitem__(_ci_str(k))
def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
return type(self)(self)
def get(self, k, default=None):
return super(LowerDict, self).get(_ci_str(k), default)
def setdefault(self, k, default=None):
return super(LowerDict, self).setdefault(_ci_str(k), default)
__no_default = object()
def pop(self, k, v=__no_default):
if v is LowerDict.__no_default:
# super will raise KeyError if no default and key does not exist
return super(LowerDict, self).pop(_ci_str(k))
return super(LowerDict, self).pop(_ci_str(k), v)
def update(self, mapping=(), **kwargs):
super(LowerDict, self).update(self._process_args(mapping, **kwargs))
def __contains__(self, k):
return super(LowerDict, self).__contains__(_ci_str(k))
#classmethod
def fromkeys(cls, keys, v=None):
return super(LowerDict, cls).fromkeys((_ci_str(k) for k in keys), v)
def __repr__(self):
return '{0}({1})'.format(type(self).__name__,
super(LowerDict, self).__repr__())
Implicit vs explicit is still a problem, but once dust settles, renaming of attributes/variables to start with ci (and a big fat doc comment explaining that ci stands for case insensitive) I think is a perfect solution - as readers of the code must be fully aware that we are dealing with case insensitive underlying data structures.
This will hopefully fix some hard to reproduce bugs, which I suspect boil down to case sensitivity.
Comments/corrections welcome :)
All you will have to do is
class BatchCollection(dict):
def __init__(self, *args, **kwargs):
dict.__init__(*args, **kwargs)
OR
class BatchCollection(dict):
def __init__(self, inpt={}):
super(BatchCollection, self).__init__(inpt)
A sample usage for my personal use
### EXAMPLE
class BatchCollection(dict):
def __init__(self, inpt={}):
dict.__init__(*args, **kwargs)
def __setitem__(self, key, item):
if (isinstance(key, tuple) and len(key) == 2
and isinstance(item, collections.Iterable)):
# self.__dict__[key] = item
super(BatchCollection, self).__setitem__(key, item)
else:
raise Exception(
"Valid key should be a tuple (database_name, table_name) "
"and value should be iterable")
Note: tested only in python3
collections.UserDict is often the simplest option when you need a custom dict.
As shown in the other answer, it's very tricky to overwrite dict correctly, while UserDict makes it easy. To answer the original question, you can get a dict with lower keys:
import collections
class LowercaseDict(collections.UserDict):
def __getitem__(self, key):
return super().__getitem__(key.lower())
def __setitem__(self, key, value):
return super().__setitem__(key.lower(), value)
def __delitem__(self, key):
return super().__delitem__(key.lower())
# Unfortunately, __contains__ is required currently due to
# https://github.com/python/cpython/issues/91784
def __contains__(self, key):
return key.lower() in self.data
d = LowercaseDict(MY_KEY=0) # Keys normalized in .__init__
d.update({'OTHER_KEY': 1}) # Keys normalized in .update
d['Hello'] = d['other_KEY']
assert 'HELLO' in d
print(d) # All keys normalized {'my_key': 0, 'other_key': 1, 'hello': 1}
And contrary to collections.abc.MutableMapping, you don't need __iter__, __len__, __init__,...
Subclassing UserDict is much easier.
However UserDict is a MutableMapping, not a dict, so:
assert not isinstance(collections.UserDict(), dict)
assert isinstance(collections.UserDict(), collections.abc.MutableMapping)
Related
Well, the question is in the title: how do I define a python dictionary with immutable keys but mutable values? I came up with this (in python 2.x):
class FixedDict(dict):
"""
A dictionary with a fixed set of keys
"""
def __init__(self, dictionary):
dict.__init__(self)
for key in dictionary.keys():
dict.__setitem__(self, key, dictionary[key])
def __setitem__(self, key, item):
if key not in self:
raise KeyError("The key '" +key+"' is not defined")
dict.__setitem__(self, key, item)
but it looks to me (unsurprisingly) rather sloppy. In particular, is this safe or is there the risk of actually changing/adding some keys, since I'm inheriting from dict?
Thanks.
Consider proxying dict instead of subclassing it. That means that only the methods that you define will be allowed, instead of falling back to dict's implementations.
class FixedDict(object):
def __init__(self, dictionary):
self._dictionary = dictionary
def __setitem__(self, key, item):
if key not in self._dictionary:
raise KeyError("The key {} is not defined.".format(key))
self._dictionary[key] = item
def __getitem__(self, key):
return self._dictionary[key]
Also, you should use string formatting instead of + to generate the error message, since otherwise it will crash for any value that's not a string.
The problem with direct inheritance from dict is that it's quite hard to comply with the full dict's contract (e.g. in your case, update method won't behave in a consistent way).
What you want, is to extend the collections.MutableMapping:
import collections
class FixedDict(collections.MutableMapping):
def __init__(self, data):
self.__data = data
def __len__(self):
return len(self.__data)
def __iter__(self):
return iter(self.__data)
def __setitem__(self, k, v):
if k not in self.__data:
raise KeyError(k)
self.__data[k] = v
def __delitem__(self, k):
raise NotImplementedError
def __getitem__(self, k):
return self.__data[k]
def __contains__(self, k):
return k in self.__data
Note that the original (wrapped) dict will be modified, if you don't want that to happen, use copy or deepcopy.
How you prevent someone from adding new keys depends entirely on why someone might try to add new keys. As the comments state, most dictionary methods that modify the keys don't go through __setitem__, so a .update() call will add new keys just fine.
If you only expect someone to use d[new_key] = v, then your __setitem__ is fine. If they might use other ways to add keys, then you have to put in more work. And of course, they can always use this to do it anyway:
dict.__setitem__(d, new_key, v)
You can't make things truly immutable in Python, you can only stop particular changes.
How can I make as "perfect" a subclass of dict as possible? The end goal is to have a simple dict in which the keys are lowercase.
It would seem that there should be some tiny set of primitives I can override to make this work, but according to all my research and attempts it seem like this isn't the case:
If I override __getitem__/__setitem__, then get/set don't work. How can I make them work? Surely I don't need to implement them individually?
Am I preventing pickling from working, and do I need to implement __setstate__ etc?
Do I need repr, update and __init__?
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
Here is my first go at it, get() doesn't work and no doubt there are many other minor problems:
class arbitrary_dict(dict):
"""A dictionary that applies an arbitrary key-altering function
before accessing the keys."""
def __keytransform__(self, key):
return key
# Overridden methods. List from
# https://stackoverflow.com/questions/2390827/how-to-properly-subclass-dict
def __init__(self, *args, **kwargs):
self.update(*args, **kwargs)
# Note: I'm using dict directly, since super(dict, self) doesn't work.
# I'm not sure why, perhaps dict is not a new-style class.
def __getitem__(self, key):
return dict.__getitem__(self, self.__keytransform__(key))
def __setitem__(self, key, value):
return dict.__setitem__(self, self.__keytransform__(key), value)
def __delitem__(self, key):
return dict.__delitem__(self, self.__keytransform__(key))
def __contains__(self, key):
return dict.__contains__(self, self.__keytransform__(key))
class lcdict(arbitrary_dict):
def __keytransform__(self, key):
return str(key).lower()
You can write an object that behaves like a dict quite easily with ABCs (Abstract Base Classes) from the collections.abc module. It even tells you if you missed a method, so below is the minimal version that shuts the ABC up.
from collections.abc import MutableMapping
class TransformedDict(MutableMapping):
"""A dictionary that applies an arbitrary key-altering
function before accessing the keys"""
def __init__(self, *args, **kwargs):
self.store = dict()
self.update(dict(*args, **kwargs)) # use the free update to set keys
def __getitem__(self, key):
return self.store[self._keytransform(key)]
def __setitem__(self, key, value):
self.store[self._keytransform(key)] = value
def __delitem__(self, key):
del self.store[self._keytransform(key)]
def __iter__(self):
return iter(self.store)
def __len__(self):
return len(self.store)
def _keytransform(self, key):
return key
You get a few free methods from the ABC:
class MyTransformedDict(TransformedDict):
def _keytransform(self, key):
return key.lower()
s = MyTransformedDict([('Test', 'test')])
assert s.get('TEST') is s['test'] # free get
assert 'TeSt' in s # free __contains__
# free setdefault, __eq__, and so on
import pickle
# works too since we just use a normal dict
assert pickle.loads(pickle.dumps(s)) == s
I wouldn't subclass dict (or other builtins) directly. It often makes no sense, because what you actually want to do is implement the interface of a dict. And that is exactly what ABCs are for.
How can I make as "perfect" a subclass of dict as possible?
The end goal is to have a simple dict in which the keys are lowercase.
If I override __getitem__/__setitem__, then get/set don't work. How
do I make them work? Surely I don't need to implement them
individually?
Am I preventing pickling from working, and do I need to implement
__setstate__ etc?
Do I need repr, update and __init__?
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
The accepted answer would be my first approach, but since it has some issues,
and since no one has addressed the alternative, actually subclassing a dict, I'm going to do that here.
What's wrong with the accepted answer?
This seems like a rather simple request to me:
How can I make as "perfect" a subclass of dict as possible?
The end goal is to have a simple dict in which the keys are lowercase.
The accepted answer doesn't actually subclass dict, and a test for this fails:
>>> isinstance(MyTransformedDict([('Test', 'test')]), dict)
False
Ideally, any type-checking code would be testing for the interface we expect, or an abstract base class, but if our data objects are being passed into functions that are testing for dict - and we can't "fix" those functions, this code will fail.
Other quibbles one might make:
The accepted answer is also missing the classmethod: fromkeys.
The accepted answer also has a redundant __dict__ - therefore taking up more space in memory:
>>> s.foo = 'bar'
>>> s.__dict__
{'foo': 'bar', 'store': {'test': 'test'}}
Actually subclassing dict
We can reuse the dict methods through inheritance. All we need to do is create an interface layer that ensures keys are passed into the dict in lowercase form if they are strings.
If I override __getitem__/__setitem__, then get/set don't work. How do I make them work? Surely I don't need to implement them individually?
Well, implementing them each individually is the downside to this approach and the upside to using MutableMapping (see the accepted answer), but it's really not that much more work.
First, let's factor out the difference between Python 2 and 3, create a singleton (_RaiseKeyError) to make sure we know if we actually get an argument to dict.pop, and create a function to ensure our string keys are lowercase:
from itertools import chain
try: # Python 2
str_base = basestring
items = 'iteritems'
except NameError: # Python 3
str_base = str, bytes, bytearray
items = 'items'
_RaiseKeyError = object() # singleton for no-default behavior
def ensure_lower(maybe_str):
"""dict keys can be any hashable object - only call lower if str"""
return maybe_str.lower() if isinstance(maybe_str, str_base) else maybe_str
Now we implement - I'm using super with the full arguments so that this code works for Python 2 and 3:
class LowerDict(dict): # dicts take a mapping or iterable as their optional first argument
__slots__ = () # no __dict__ - that would be redundant
#staticmethod # because this doesn't make sense as a global function.
def _process_args(mapping=(), **kwargs):
if hasattr(mapping, items):
mapping = getattr(mapping, items)()
return ((ensure_lower(k), v) for k, v in chain(mapping, getattr(kwargs, items)()))
def __init__(self, mapping=(), **kwargs):
super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
def __getitem__(self, k):
return super(LowerDict, self).__getitem__(ensure_lower(k))
def __setitem__(self, k, v):
return super(LowerDict, self).__setitem__(ensure_lower(k), v)
def __delitem__(self, k):
return super(LowerDict, self).__delitem__(ensure_lower(k))
def get(self, k, default=None):
return super(LowerDict, self).get(ensure_lower(k), default)
def setdefault(self, k, default=None):
return super(LowerDict, self).setdefault(ensure_lower(k), default)
def pop(self, k, v=_RaiseKeyError):
if v is _RaiseKeyError:
return super(LowerDict, self).pop(ensure_lower(k))
return super(LowerDict, self).pop(ensure_lower(k), v)
def update(self, mapping=(), **kwargs):
super(LowerDict, self).update(self._process_args(mapping, **kwargs))
def __contains__(self, k):
return super(LowerDict, self).__contains__(ensure_lower(k))
def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
return type(self)(self)
#classmethod
def fromkeys(cls, keys, v=None):
return super(LowerDict, cls).fromkeys((ensure_lower(k) for k in keys), v)
def __repr__(self):
return '{0}({1})'.format(type(self).__name__, super(LowerDict, self).__repr__())
We use an almost boiler-plate approach for any method or special method that references a key, but otherwise, by inheritance, we get methods: len, clear, items, keys, popitem, and values for free. While this required some careful thought to get right, it is trivial to see that this works.
(Note that haskey was deprecated in Python 2, removed in Python 3.)
Here's some usage:
>>> ld = LowerDict(dict(foo='bar'))
>>> ld['FOO']
'bar'
>>> ld['foo']
'bar'
>>> ld.pop('FoO')
'bar'
>>> ld.setdefault('Foo')
>>> ld
{'foo': None}
>>> ld.get('Bar')
>>> ld.setdefault('Bar')
>>> ld
{'bar': None, 'foo': None}
>>> ld.popitem()
('bar', None)
Am I preventing pickling from working, and do I need to implement
__setstate__ etc?
pickling
And the dict subclass pickles just fine:
>>> import pickle
>>> pickle.dumps(ld)
b'\x80\x03c__main__\nLowerDict\nq\x00)\x81q\x01X\x03\x00\x00\x00fooq\x02Ns.'
>>> pickle.loads(pickle.dumps(ld))
{'foo': None}
>>> type(pickle.loads(pickle.dumps(ld)))
<class '__main__.LowerDict'>
__repr__
Do I need repr, update and __init__?
We defined update and __init__, but you have a beautiful __repr__ by default:
>>> ld # without __repr__ defined for the class, we get this
{'foo': None}
However, it's good to write a __repr__ to improve the debugability of your code. The ideal test is eval(repr(obj)) == obj. If it's easy to do for your code, I strongly recommend it:
>>> ld = LowerDict({})
>>> eval(repr(ld)) == ld
True
>>> ld = LowerDict(dict(a=1, b=2, c=3))
>>> eval(repr(ld)) == ld
True
You see, it's exactly what we need to recreate an equivalent object - this is something that might show up in our logs or in backtraces:
>>> ld
LowerDict({'a': 1, 'c': 3, 'b': 2})
Conclusion
Should I just use mutablemapping (it seems one shouldn't use UserDict
or DictMixin)? If so, how? The docs aren't exactly enlightening.
Yeah, these are a few more lines of code, but they're intended to be comprehensive. My first inclination would be to use the accepted answer,
and if there were issues with it, I'd then look at my answer - as it's a little more complicated, and there's no ABC to help me get my interface right.
Premature optimization is going for greater complexity in search of performance.
MutableMapping is simpler - so it gets an immediate edge, all else being equal. Nevertheless, to lay out all the differences, let's compare and contrast.
I should add that there was a push to put a similar dictionary into the collections module, but it was rejected. You should probably just do this instead:
my_dict[transform(key)]
It should be far more easily debugable.
Compare and contrast
There are 6 interface functions implemented with the MutableMapping (which is missing fromkeys) and 11 with the dict subclass. I don't need to implement __iter__ or __len__, but instead I have to implement get, setdefault, pop, update, copy, __contains__, and fromkeys - but these are fairly trivial, since I can use inheritance for most of those implementations.
The MutableMapping implements some things in Python that dict implements in C - so I would expect a dict subclass to be more performant in some cases.
We get a free __eq__ in both approaches - both of which assume equality only if another dict is all lowercase - but again, I think the dict subclass will compare more quickly.
Summary:
subclassing MutableMapping is simpler with fewer opportunities for bugs, but slower, takes more memory (see redundant dict), and fails isinstance(x, dict)
subclassing dict is faster, uses less memory, and passes isinstance(x, dict), but it has greater complexity to implement.
Which is more perfect? That depends on your definition of perfect.
After trying out both of the top two suggestions, I've settled on a shady-looking middle route for Python 2.7. Maybe 3 is saner, but for me:
class MyDict(MutableMapping):
# ... the few __methods__ that mutablemapping requires
# and then this monstrosity
#property
def __class__(self):
return dict
which I really hate, but seems to fit my needs, which are:
can override **my_dict
if you inherit from dict, this bypasses your code. try it out.
this makes #2 unacceptable for me at all times, as this is quite common in python code
masquerades as isinstance(my_dict, dict)
rules out MutableMapping alone, so #1 is not enough
I heartily recommend #1 if you don't need this, it's simple and predictable
fully controllable behavior
so I cannot inherit from dict
If you need to tell yourself apart from others, personally I use something like this (though I'd recommend better names):
def __am_i_me(self):
return True
#classmethod
def __is_it_me(cls, other):
try:
return other.__am_i_me()
except Exception:
return False
As long as you only need to recognize yourself internally, this way it's harder to accidentally call __am_i_me due to python's name-munging (this is renamed to _MyDict__am_i_me from anything calling outside this class). Slightly more private than _methods, both in practice and culturally.
So far I have no complaints, aside from the seriously-shady-looking __class__ override. I'd be thrilled to hear of any problems that others encounter with this though, I don't fully understand the consequences. But so far I've had no problems whatsoever, and this allowed me to migrate a lot of middling-quality code in lots of locations without needing any changes.
As evidence: https://repl.it/repls/TraumaticToughCockatoo
Basically: copy the current #2 option, add print 'method_name' lines to every method, and then try this and watch the output:
d = LowerDict() # prints "init", or whatever your print statement said
print '------'
splatted = dict(**d) # note that there are no prints here
You'll see similar behavior for other scenarios. Say your fake-dict is a wrapper around some other datatype, so there's no reasonable way to store the data in the backing-dict; **your_dict will be empty, regardless of what every other method does.
This works correctly for MutableMapping, but as soon as you inherit from dict it becomes uncontrollable.
Edit: as an update, this has been running without a single issue for almost two years now, on several hundred thousand (eh, might be a couple million) lines of complicated, legacy-ridden python. So I'm pretty happy with it :)
Edit 2: apparently I mis-copied this or something long ago. #classmethod __class__ does not work for isinstance checks - #property __class__ does: https://repl.it/repls/UnitedScientificSequence
My requirements were a bit stricter:
I had to retain case info (the strings are paths to files displayed to the user, but it's a windows app so internally all operations must be case insensitive)
I needed keys to be as small as possible (it did make a difference in memory performance, chopped off 110 mb out of 370). This meant that caching lowercase version of keys is not an option.
I needed creation of the data structures to be as fast as possible (again made a difference in performance, speed this time). I had to go with a builtin
My initial thought was to substitute our clunky Path class for a case insensitive unicode subclass - but:
proved hard to get that right - see: A case insensitive string class in python
turns out that explicit dict keys handling makes code verbose and messy - and error prone (structures are passed hither and thither, and it is not clear if they have CIStr instances as keys/elements, easy to forget plus some_dict[CIstr(path)] is ugly)
So I had finally to write down that case insensitive dict. Thanks to code by #AaronHall that was made 10 times easier.
class CIstr(unicode):
"""See https://stackoverflow.com/a/43122305/281545, especially for inlines"""
__slots__ = () # does make a difference in memory performance
#--Hash/Compare
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
if isinstance(other, CIstr):
return self.lower() == other.lower()
return NotImplemented
def __ne__(self, other):
if isinstance(other, CIstr):
return self.lower() != other.lower()
return NotImplemented
def __lt__(self, other):
if isinstance(other, CIstr):
return self.lower() < other.lower()
return NotImplemented
def __ge__(self, other):
if isinstance(other, CIstr):
return self.lower() >= other.lower()
return NotImplemented
def __gt__(self, other):
if isinstance(other, CIstr):
return self.lower() > other.lower()
return NotImplemented
def __le__(self, other):
if isinstance(other, CIstr):
return self.lower() <= other.lower()
return NotImplemented
#--repr
def __repr__(self):
return '{0}({1})'.format(type(self).__name__,
super(CIstr, self).__repr__())
def _ci_str(maybe_str):
"""dict keys can be any hashable object - only call CIstr if str"""
return CIstr(maybe_str) if isinstance(maybe_str, basestring) else maybe_str
class LowerDict(dict):
"""Dictionary that transforms its keys to CIstr instances.
Adapted from: https://stackoverflow.com/a/39375731/281545
"""
__slots__ = () # no __dict__ - that would be redundant
#staticmethod # because this doesn't make sense as a global function.
def _process_args(mapping=(), **kwargs):
if hasattr(mapping, 'iteritems'):
mapping = getattr(mapping, 'iteritems')()
return ((_ci_str(k), v) for k, v in
chain(mapping, getattr(kwargs, 'iteritems')()))
def __init__(self, mapping=(), **kwargs):
# dicts take a mapping or iterable as their optional first argument
super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
def __getitem__(self, k):
return super(LowerDict, self).__getitem__(_ci_str(k))
def __setitem__(self, k, v):
return super(LowerDict, self).__setitem__(_ci_str(k), v)
def __delitem__(self, k):
return super(LowerDict, self).__delitem__(_ci_str(k))
def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
return type(self)(self)
def get(self, k, default=None):
return super(LowerDict, self).get(_ci_str(k), default)
def setdefault(self, k, default=None):
return super(LowerDict, self).setdefault(_ci_str(k), default)
__no_default = object()
def pop(self, k, v=__no_default):
if v is LowerDict.__no_default:
# super will raise KeyError if no default and key does not exist
return super(LowerDict, self).pop(_ci_str(k))
return super(LowerDict, self).pop(_ci_str(k), v)
def update(self, mapping=(), **kwargs):
super(LowerDict, self).update(self._process_args(mapping, **kwargs))
def __contains__(self, k):
return super(LowerDict, self).__contains__(_ci_str(k))
#classmethod
def fromkeys(cls, keys, v=None):
return super(LowerDict, cls).fromkeys((_ci_str(k) for k in keys), v)
def __repr__(self):
return '{0}({1})'.format(type(self).__name__,
super(LowerDict, self).__repr__())
Implicit vs explicit is still a problem, but once dust settles, renaming of attributes/variables to start with ci (and a big fat doc comment explaining that ci stands for case insensitive) I think is a perfect solution - as readers of the code must be fully aware that we are dealing with case insensitive underlying data structures.
This will hopefully fix some hard to reproduce bugs, which I suspect boil down to case sensitivity.
Comments/corrections welcome :)
All you will have to do is
class BatchCollection(dict):
def __init__(self, *args, **kwargs):
dict.__init__(*args, **kwargs)
OR
class BatchCollection(dict):
def __init__(self, inpt={}):
super(BatchCollection, self).__init__(inpt)
A sample usage for my personal use
### EXAMPLE
class BatchCollection(dict):
def __init__(self, inpt={}):
dict.__init__(*args, **kwargs)
def __setitem__(self, key, item):
if (isinstance(key, tuple) and len(key) == 2
and isinstance(item, collections.Iterable)):
# self.__dict__[key] = item
super(BatchCollection, self).__setitem__(key, item)
else:
raise Exception(
"Valid key should be a tuple (database_name, table_name) "
"and value should be iterable")
Note: tested only in python3
collections.UserDict is often the simplest option when you need a custom dict.
As shown in the other answer, it's very tricky to overwrite dict correctly, while UserDict makes it easy. To answer the original question, you can get a dict with lower keys:
import collections
class LowercaseDict(collections.UserDict):
def __getitem__(self, key):
return super().__getitem__(key.lower())
def __setitem__(self, key, value):
return super().__setitem__(key.lower(), value)
def __delitem__(self, key):
return super().__delitem__(key.lower())
# Unfortunately, __contains__ is required currently due to
# https://github.com/python/cpython/issues/91784
def __contains__(self, key):
return key.lower() in self.data
d = LowercaseDict(MY_KEY=0) # Keys normalized in .__init__
d.update({'OTHER_KEY': 1}) # Keys normalized in .update
d['Hello'] = d['other_KEY']
assert 'HELLO' in d
print(d) # All keys normalized {'my_key': 0, 'other_key': 1, 'hello': 1}
And contrary to collections.abc.MutableMapping, you don't need __iter__, __len__, __init__,...
Subclassing UserDict is much easier.
However UserDict is a MutableMapping, not a dict, so:
assert not isinstance(collections.UserDict(), dict)
assert isinstance(collections.UserDict(), collections.abc.MutableMapping)
I have a class that looks like this
Class myClass:
def __init__(self, key, value):
self.key = key
self.value = value
where key is a string and value is always a list of elements of myClass, possibly empty.
I want to define my own iter method that returns value.key for each value in values. I tried
def __iter__(self):
return self
def __next__(self):
try:
self.value.next().key
except:
raise StopIteration
but it's looping forever. What am I doing wrong?
Also if I want to have Python 2 / 3 compatibility, should I add the method
def next(self):
return self.__next__()
There's no reason for you to implement __next__. You can use __iter__ to return a generator which will do what you want.
class Pair(object):
def __init__(self, key, value):
self.key = key
self.value = value
def __iter__(self):
return (v.key for v in self.value)
# alternative iter function, that allows more complicated logic
def __iter__(self):
for v in self.value:
yield v.key
p = Pair("parent", [Pair("child0", "value0"), Pair("child1", "value1")])
assert list(p) == ["child0", "child1"]
This way of doing things is compatible with both python2 and python3 as the returned generator will have the required next function in python2, and __next__ in python3.
You need to extract and preserve an iterator on list self.value -- you can't just call next on a list, you need an iterator on such a list.
So, you need an auxiliary iterator class:
class myClassIter(object):
def __init__(self, theiter):
self.theiter = theiter
def __next__(self):
return next(self.theiter).key
next = __next__
which I've also made Py 2/3 compatible with the object base and appropriate aliasing.
Here, I'm assuming every item in the list has a key attribute (so the only expected exception is StopIteration, which you can just propagate). If that is not the case, and you want to just stop the iteration when an item is met without the attribite, the try/except is needed, but keep it tight! -- a crucial design aspect of good exception handling. I.e, if these are indeed your specs:
def __next__(self):
try: return next(self.theiter).key
except AttributeError: raise StopIteration
don't catch all exceptions -- only the ones you specifically expect!
Now, in myClass, you'll want:
def __iter__(self):
return myClassIter(iter(self.value))
This means that myClass is an iterable, not an iterator, so you can e.g properly have more than one loop on a myClass instance:
mc = myClass(somekey, funkylist)
for ka in mc:
for kb in mc:
whatever(ka, kb)
If mc was itself an iterator, the inner loop would exhaust it and the semantics of the nested loops would therefore be completely different.
If you do indeed want such completely different semantics (i.e you want mc to be itself an iterator, not just an iterable) then you must dispense with the auxiliary class (but still need to store the iterator on self.value as an instance attribute for myClass) -- that would be a strange, uncomfortable arrangement, but it is (just barely) possible that it is indeed the arrangement your application needs...
I want to use str.format() and pass it a custom lazy dictionary.
str.format() should only access the key in the lazy dict it needs.
Is this possible?
Which interface needs to be implemented by the lazy_dict?
Update
This is not what I want:
'{0[a]}'.format(d)
I need something like this:
'{a}'.format(**d)
Need to run on Python2.7
For doing '{a}'.format(**d), especially the **d part, the "lazy" dict is transformed into a regular one. Here happens the access to all keys, and format() can't do anything about it.
You could craft some proxy objects which are put in place of the elements, and on string access they do the "real" work.
Something like
class LazyProxy(object):
def __init__(self, prx):
self.prx = prx
def __format__(self, fmtspec):
return format(self.prx(), fmtspec)
def __repr__(self):
return repr(self.prx())
def __str__(self):
return str(self.prx())
You can put these elements into a dict, such as
interd = { k, LazyProxy(lambda: lazydict[k]) for i in lazydict.iterkeys()}
I didn't test this, but I think this fulfills your needs.
After the last edit, it now works with !r and !s as well.
You can use the __format__ method (Python 3 only). See the doc here.
If I understand your question correctly, you want to pass a custom dictionary, that would compute values only when needed. First, we're looking for implementation of __getitem__():
>>> class LazyDict(object):
... def __init__(self, d):
... self.d = d
... def __getitem__(self, k):
... print k # <-- tracks the needed keys
... return self.d[k]
...
>>> d = D({'a': 19, 'b': 20})
>>> '{0[a]}'.format(d)
a
'19'
This shows that only key 'a' is accessed; 'b' is not, so you already have your lazy access.
But also, any object attribute is usable for str.format this way, and using #property decorator, you can access function results:
class MyObject(object):
def __init__(self):
self.a = 19
self.b = 20
def __getitem__(self, var):
return getattr(self, var)
# this command lets you able to call any attribute of your instance,
# or even the result of a function if it is decorated by #property:
#property
def c(self):
return 21
Example of usage:
>>> m = MyObject()
>>> '{0[c]}'.format(m)
'21'
But note that this also works, making the formating string a little bit specific, but avoid the need for __getitem__() implementation.
>>> '{0.c}'.format(m)
'21'
As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)
The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.
Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
tl'dr: How do I get dicts that can be used as keys to other dicts?
Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)
All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.
If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.
The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).
A reasonably clean, straightforward implementation is
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.
What if you jam key, value pairs into a frozenset? What would that require, how would it work?
Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
That alone puts you in spitting distance of an immutable mapping:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Which results in the less problematic:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
#classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
which, ultimately, does answer my own question:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
The accepted answer by #Unknown, as well as the answer by #AlexMartelli work perfectly fine, but only under the following constraints:
The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.
The much faster answer by #ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).
The faster yet answer by #RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:
Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().
If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).
I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.
I'd much appreciate any feedback on this, since I only tested this lightly so far.
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.
def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)
serialize the dict as string with json package:
d = {'a': 1, 'b': 2}
s = json.dumps(d)
restore the dict when you need:
d2 = json.loads(s)
If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:
cache[id(rule)] = "whatever"
since id() is unique for every dictionary
EDIT:
Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like
cache[ 'foo:bar' ] = 'baz'
If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like
cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )
I guess the advantage of this is that you wouldn't have to write as much code.