Related
I have a bit of code I was hoping to clean up/shrink down.
I have a function which receives a key and returns the corresponding value from either of two dictionaries, or a default value if the key is present in neither.
Here is a verbose (but explicit) version of the problem:
def lookup_function( key ):
if key.lower() in Dictionary_One: return Dictionary_One[ key.lower ]
if key.lower() in Dictionary_Two: return Dictionary_Two[ key.lower ]
return Globally_Available_Default_Value
Not horrifying to look at. Just seems a bit voluminous to me.
So, assuming that both dictionaries and the default value are available from the global scope, and that the key must be a string in lowercase, what is the cleanest, shortest, most graceful, and most pythonic way of achieving this?
Have fun!
You can shorten that to:
def lookup_function( key ):
key = key.lower()
return Dictionary_One.get(key, Dictionary_Two.get(key, Globally_Available_Default_Value))
Alternative answer using an "on the fly" mixed dictionary:
dict(Dictionary_Two, **Dictionary_One).get(key.lower(), Globally_Available_Default_Value)
Some good answers. Here's mine!
return { **{ key.lower(): default_value }, **dictionary_one, **dictionary_two }[ key.lower() ];
Just requires changing the names a bit to be a little shorter. This method would NOT be recommended for long dictionaries, but is perfectly workable for short ones. dictionary_one would override the default value if it contains the key, with dictionary_two then overriding any keys in dictionary_one.
Not perfect in all cases, but in a for general use this is what I'm going to go with.
Lesser/shorter code is not necessarily better as there are other considerations, such as generality, reusability, and extensibility.
Raymond Hettinger contributed a recipe to the Python Cookbookâ„¢, Second Edition for a Chainmap class that automates chained dictionary lookups which I think would be a very elegant way of doing what you want (and the class is reusable).
Update: Just discovered that the Python 3 collections module contains a ChainMap, so it ought to be easier to use it instead of writing your own as shown below.
class Chainmap(object):
def __init__(self, *mappings):
# record the sequence of mappings into which we must look
self._mappings = mappings
def __getitem__(self, key):
# try looking up into each mapping in sequence
for mapping in self._mappings:
try:
return mapping[key]
except KeyError:
pass
# 'key' not found in any mapping, so raise KeyError exception
raise KeyError(key)
def get(self, key, default=None):
# return self[key] if present, otherwise 'default'
try:
return self[key]
except KeyError:
return default
def __contains__(self, key):
# return True if 'key' is present in self, otherwise False
try:
self[key]
return True
except KeyError:
return False
Sample usage:
dictionary_one = {'key1': 1, 'key2':2}
dictionary_two = {'key2': 3, 'key4':4}
globally_available_default_value = 42
chmap = Chainmap(dictionary_one, dictionary_two)
def lookup_function(key):
try:
return chmap[key.lower()]
except KeyError:
return globally_available_default_value
print(lookup_function('Key1')) # -> 1
print(lookup_function('Key2')) # -> 2
print(lookup_function('Key3')) # -> 42
print(lookup_function('Key4')) # -> 4
I'm looking for a fast, clean and pythonic way of slicing custom made objects while preserving their type after the operation.
To give you some context, I have to deal with a lot of semi-unstructured data and handle it I work with lists of dictionaries. To streamline some operations I have created an "ld" object, that inherits from "list". Amongst its many capabilities it checks that the data was provided on the correct format. Let's simplify it by saying it ensures that all entries of the list are dictionaries containing some key "a", as shown bellow:
class ld( list):
def __init__(self, x):
list.__init__(self, x)
self.__init_check()
def __init_check(self):
for record in self:
if isinstance( record, dict) and "a" in record:
pass
else:
raise TypeError("not all entries are dictionaries or have the key 'a'")
return
This behaves correctly when the data is as desired and initialises ld:
tt = ld( [{"a": 1, "b":2}, {"a":4}, {"a":6, "c":67}])
type( tt)
It is also does the right thing when the data is incorrect:
ld( [{"w":1}])
ld( [1,2,3])
However the problems comes when I proceed to slice the object:
type( tt[:2])
tt[:2] is a list and no longer as all the methods and attributes that I created in the full-fledged ld object. I could reconvert the slice into an ld but that means that it would have to go through the entire initial data checking process again, slowing down computations a lot.
Here is the solution I came up with to speed things up:
class ld( list):
def __init__(self, x, safe=True):
list.__init__(self, x)
self.__init_check( safe)
def __init_check(self, is_safe):
if not is_safe:
return
for record in self:
if isinstance( record, dict) and "a" in record:
pass
else:
raise TypeError("not all entries are dictionaries or have the key 'a'")
return
def __getslice__(self, i, j):
return ld( list.__getslice__( self, i, j), safe=False)
Is there a cleaner and more pythonic way of going about it?
Thanks in advance for you help.
I don't think subclassing list to verify the shape or type of its contents in general is the right approach. The list pointedly doesn't care about its contents, and implementing a class whose constructor behavior varies based on flags passed to it is messy. If you need a constructor that verifies inputs, just do your check logic in a function that returns a list.
def make_verified_list(items):
"""
:type items: list[object]
:rtype: list[dict]
"""
new_list = []
for item in items:
if not verify_item(item):
raise InvalidItemError(item)
new_list.append(item)
return new_list
def verify_item(item):
"""
:type item: object
:rtype: bool
"""
return isinstance(item, dict) and "a" in item
Take this approach and you won't find yourself struggling with the behavior of core data structures.
I'm running Python 2.7 and I'm trying to create a custom FloatEncoder subclass of JSONEncoder. I've followed many examples such as this but none seem to work. Here is my FloatEncoder class:
class FloatEncoder(JSONEncoder):
def _iterencode(self, obj, markers=None):
if isinstance(obj, float):
return (str(obj) for obj in [obj])
return super(FloatEncoder, self)._iterencode(obj, markers)
And here is where I call json.dumps:
with patch("utils.fileio.FloatEncoder") as float_patch:
for val,res in ((.00123456,'0.0012'),(.00009,'0.0001'),(0.99999,'1.0000'),({'hello':1.00001,'world':[True,1.00009]},'{"world": [true, 1.0001], "hello": 1.0000}')):
untrusted = dumps(val, cls=FloatEncoder)
self.assertTrue(float_patch._iterencode.called)
self.assertEqual(untrusted, res)
The first assertion fails, meaning that _iterencode is not being executed. After reading the JSON documentation,I tried overriding the default() method but that also was not being called.
You seem to be trying to round float values down to 4 decimal points while generating JSON (based on test examples).
JSONEncoder shipping with Python 2.7 does not have have _iterencode method, so that's why it's not getting called. Also a quick glance at json/encoder.py suggests that this class is written in such a way that makes it difficult to change the float encoding behavior. Perhaps, it would be better to separate concerns, and round the floats before doing JSON serialization.
EDIT: Alex Martelli also supplies a monkey-patch solution in a related answer. The problem with that approach is that you're introducing a global modification to json library behavior that may unwittingly affect some other piece of code in your application that was written with assumption that floats were encoded without rounding.
Try this:
from collections import Mapping, Sequence
from unittest import TestCase, main
from json import dumps
def round_floats(o):
if isinstance(o, float):
return round(o, 4)
elif isinstance(o, basestring):
return o
elif isinstance(o, Sequence):
return [round_floats(item) for item in o]
elif isinstance(o, Mapping):
return dict((key, round_floats(value)) for key, value in o.iteritems())
else:
return o
class TestFoo(TestCase):
def test_it(self):
for val, res in ((.00123456, '0.0012'),
(.00009, '0.0001'),
(0.99999, '1.0'),
({'hello': 1.00001, 'world': [True, 1.00009]},
'{"world": [true, 1.0001], "hello": 1.0}')):
untrusted = dumps(round_floats(val))
self.assertEqual(untrusted, res)
if __name__ == '__main__':
main()
Don't define _iterencode, define default, as shown in the third answer on that page.
For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))
import datetime, json
x = {'alpha': {datetime.date.today(): 'abcde'}}
print json.dumps(x)
The above code fails with a TypeError since keys of JSON objects need to be strings. The json.dumps function has a parameter called default that is called when the value of a JSON object raises a TypeError, but there seems to be no way to do this for the key. What is the most elegant way to work around this?
You can extend json.JSONEncoder to create your own encoder which will be able to deal with datetime.datetime objects (or objects of any type you desire) in such a way that a string is created which can be reproduced as a new datetime.datetime instance. I believe it should be as simple as having json.JSONEncoder call repr() on your datetime.datetime instances.
The procedure on how to do so is described in the json module docs.
The json module checks the type of each value it needs to encode and by default it only knows how to handle dicts, lists, tuples, strs, unicode objects, int, long, float, boolean and none :-)
Also of importance for you might be the skipkeys argument to the JSONEncoder.
After reading your comments I have concluded that there is no easy solution to have JSONEncoder encode the keys of dictionaries with a custom function. If you are interested you can look at the source and the methods iterencode() which calls _iterencode() which calls _iterencode_dict() which is where the type error gets raised.
Easiest for you would be to create a new dict with isoformatted keys like this:
import datetime, json
D = {datetime.datetime.now(): 'foo',
datetime.datetime.now(): 'bar'}
new_D = {}
for k,v in D.iteritems():
new_D[k.isoformat()] = v
json.dumps(new_D)
Which returns '{"2010-09-15T23:24:36.169710": "foo", "2010-09-15T23:24:36.169723": "bar"}'. For niceties, wrap it in a function :-)
http://jsonpickle.github.io/ might be what you want. When facing a similar issue, I ended up doing:
to_save = jsonpickle.encode(THE_THING, unpicklable=False, max_depth=4, make_refs=False)
you can do
x = {'alpha': {datetime.date.today().strftime('%d-%m-%Y'): 'abcde'}}
If you really need to do it, you can monkeypatch json.encoder:
from _json import encode_basestring_ascii # used when ensure_ascii=True (which is the default where you want everything to be ascii)
from _json import encode_basestring # used in any other case
def _patched_encode_basestring(o):
"""
Monkey-patching Python's json serializer so it can serialize keys that are not string!
You can monkey patch the ascii one the same way.
"""
if isinstance(o, MyClass):
return my_serialize(o)
return encode_basestring(o)
json.encoder.encode_basestring = _patched_encode_basestring
JSON only accepts the here mentioned data types for encoding. As #supakeen mentioned, you can extend the JSONEncoder class in order to encode any values inside a dictionary but no keys! If you want to encode keys, you have to do it on your own.
I used a recursive function in order to encode tuple-keys as strings and recover them later.
Here an example:
def _tuple_to_string(obj: Any) -> Any:
"""Serialize tuple-keys to string representation. A tuple wil be obtain a leading '__tuple__' string and decomposed in list representation.
Args:
obj (Any): Typically a dict, tuple, list, int, or string.
Returns:
Any: Input object with serialized tuples.
"""
# deep copy object to avoid manipulation during iteration
obj_copy = copy.deepcopy(obj)
# if the object is a dictionary
if isinstance(obj, dict):
# iterate over every key
for key in obj:
# set for later to avoid modification in later iterations when this var does not get overwritten
serialized_key = None
# if key is tuple
if isinstance(key, tuple):
# stringify the key
serialized_key = f"__tuple__{list(key)}"
# replace old key with encoded key
obj_copy[serialized_key] = obj_copy.pop(key)
# if the key was modified
if serialized_key is not None:
# do it again for the next nested dictionary
obj_copy[serialized_key] = _tuple_to_string(obj[key])
# else, just do it for the next dictionary
else:
obj_copy[key] = _tuple_to_string(obj[key])
return obj_copy
This will turn a tuple of the form ("blah", "blub") to "__tuple__["blah", "blub"]" so that you can dump it using json.dumps() or json.dump(). You can use the leading "__tuple"__ to detect them during decoding. Therefore, I used this function:
def _string_to_tuple(obj: Any) -> Any:
"""Convert serialized tuples back to original representation. Tuples need to have a leading "__tuple__" string.
Args:
obj (Any): Typically a dict, tuple, list, int, or string.
Returns:
Any: Input object with recovered tuples.
"""
# deep copy object to avoid manipulation during iteration
obj_copy = copy.deepcopy(obj)
# if the object is a dictionary
if isinstance(obj, dict):
# iterate over every key
for key in obj:
# set for later to avoid modification in later iterations when this var does not get overwritten
serialized_key = None
# if key is a serialized tuple starting with the "__tuple__" affix
if isinstance(key, str) and key.startswith("__tuple__"):
# decode it so tuple
serialized_key = tuple(key.split("__tuple__")[1].strip("[]").replace("'", "").split(", "))
# if key is number in string representation
if all(entry.isdigit() for entry in serialized_key):
# convert to integer
serialized_key = tuple(map(int, serialized_key))
# replace old key with encoded key
obj_copy[serialized_key] = obj_copy.pop(key)
# if the key was modified
if serialized_key is not None:
# do it again for the next nested dictionary
obj_copy[serialized_key] = _string_to_tuple(obj[key])
# else, just do it for the next dictionary
else:
obj_copy[key] = _string_to_tuple(obj[key])
# if another instance was found
elif isinstance(obj, list):
for item in obj:
_string_to_tuple(item)
return obj_copy
Insert you custom logic for en-/decoding your instance by changing the
if isinstance(key, tuple):
# stringify the key
serialized_key = f"__tuple__{list(key)}"
in the _tuple_to_string function or the corresponding code block from the _string_to_tuple function, respectively:
if isinstance(key, str) and key.startswith("__tuple__"):
# decode it so tuple
serialized_key = tuple(key.split("__tuple__")[1].strip("[]").replace("'", "").split(", "))
# if key is number in string representation
if all(entry.isdigit() for entry in serialized_key):
# convert to integer
serialized_key = tuple(map(int, serialized_key))
Then, you can use it as usual:
>>> dct = {("L1", "L1"): {("L2", "L2"): "foo"}}
>>> json.dumps(_tuple_to_string(dct))
... {"__tuple__['L1', 'L2']": {"__tuple__['L2', 'L2']": "foo"}}
Hope, I could help you!