In my script I work with large and complex object (a multi-dimentional list that contains strings, dictionaries, and class objects of custom types). I need to copy, pickle (cache) and unpickle it, as well as send between child processes through MPI interface. At some points I get suspicious that the data transfer is error-free, i.e. if in the end I have the same object.
Therefore, I want to calculate its hash sum or some other type of fingerprint. I know that there is, for example, hashlib library; however, it is limited in terms of object type:
>>> import hashlib
>>> a = "123"
>>> hashlib.sha224(a.encode()).hexdigest()
'78d8045d684abd2eece923758f3cd781489df3a48e1278982466017f'
>>> a = [1, 2, 3]
>>> hashlib.sha224(a).hexdigest()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required
Thus, the question: is there some analog of this function that works with objects of any type?
One option would be to recursively convert all elements of the structure into hashable counterparts, i.e. lists into tuples, dicts and objects into frozensets, and then simply apply hash() to the whole thing. An illustration:
def to_hashable(s):
if isinstance(s, dict):
return frozenset((x, to_hashable(y)) for x, y in s.items())
if isinstance(s, list):
return tuple(to_hashable(x) for x in s)
if isinstance(s, set):
return frozenset(s)
if isinstance(s, MyObject):
d = {'__class__': s.__class__.__name__}
d.update(s.__dict__)
return to_hashable(d)
return s
class MyObject:
pass
class X(MyObject):
def __init__(self, zzz):
self.zzz = zzz
my_list = [
1,
{'a': [1,2,3], 'b': [4,5,6]},
{1,2,3,4,5},
X({1:2,3:4}),
X({5:6,7:8})
]
print hash(to_hashable(my_list))
my_list2 = [
1,
{'b': [4,5,6], 'a': [1,2,3]},
{5,4,3,2,1},
X({3:4,1:2}),
X({7:8,5:6})
]
print hash(to_hashable(my_list2)) # the same as above
pickle.dumps(...)
returns a string, which is a hashable object. You can do it as follows
import pickle
a=[1,2,3,4]
h=pickle.dumps(a)
print hash(h)
# or like this
from hashlib import sha512
print sha512(h).hexdigest()
c=pickle.loads(h)
assert c==a
Related
Here is some problem with pack/unpack tuples. As I know msgpack not distinguish between list and tuple and there is not hook to force list or tuple be ExtType. It generates frustrating problems.
Assume that I want do generic solution for all types of objects not only for Period - it is simple to assume that key should be fixed for Period but it is not want I want to do.
See simple example class with __hash__ - nothing special:
import msgpack
class Period(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(self.key)
def __eq__(self, other):
self.key == self.key
def encode(o):
if type(o) is Period:
return msgpack.ExtType(0, msgpack.dumps(o.__dict__))
def decode_ext(code, data):
if code == 0:
o = Period.__new__(Period)
o.__dict__ = msgpack.loads(data)
return o
o = {Period((2016, 7)): 112, Period((2016, 8)): 231}
print o
s = msgpack.dumps(o, default=encode)
print s
o2 = msgpack.loads(s, ext_hook=decode_ext)
print o2
It generates problem during unpacking which cannot be solved easily I think:
C:\root\Python27-64\python.exe "C:/Users/Cezary Wagner/PycharmProjects/msgpack_learn/src/02_tuple_wrong_pack.py"
Traceback (most recent call last):
{<__main__.Period object at 0x0000000002941668>: 231, <__main__.Period object at 0x0000000002941AC8>: 112}
File "C:/Users/Cezary Wagner/PycharmProjects/msgpack_learn/src/02_tuple_wrong_pack.py", line 28, in <module>
��
o2 = msgpack.loads(s, ext_hook=decode_ext)
��key������
File "msgpack/_unpacker.pyx", line 139, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:139)
��key���p
File "C:/Users/Cezary Wagner/PycharmProjects/msgpack_learn/src/02_tuple_wrong_pack.py", line 8, in __hash__
return hash(self.key)
TypeError: unhashable type: 'list'
Process finished with exit code 1
Do you have any idea how to reconstruct tuple to tuples and list to lists using msgpack if it possible at all?
For your objects, you would have to write hooks for dict as well, this is because your keys Period((2016,7)) etc are hashable ( being a tuple ) in the original object, get converted to list which is unhashable.
For your custom hooks, you can store the dict as tuples of key-value pairs,
i.e. {Period((2016, 7)): 112, Period((2016, 8)): 231} should be converted to [(Period((2016, 7)), 112), (Period((2016, 8)), 231)] first.
and convert them to dict while unpacking. That way the unhashable nature of lists will not come into picture.
The Dictionary __getitem__ method does not seem to work the same way as it does for List, and it is causing me headaches. Here is what I mean:
If I subclass list, I can overload __getitem__ as:
class myList(list):
def __getitem__(self,index):
if isinstance(index,int):
#do one thing
if isinstance(index,slice):
#do another thing
If I subclass dict, however, the __getitem__ does not expose index, but key instead as in:
class myDict(dict):
def __getitem__(self,key):
#Here I want to inspect the INDEX, but only have access to key!
So, my question is how can I intercept the index of a dict, instead of just the key?
Example use case:
a = myDict()
a['scalar'] = 1 # Create dictionary entry called 'scalar', and assign 1
a['vector_1'] = [1,2,3,4,5] # I want all subsequent vectors to be 5 long
a['vector_2'][[0,1,2]] = [1,2,3] # I want to intercept this and force vector_2 to be 5 long
print(a['vector_2'])
[1,2,3,0,0]
a['test'] # This should throw a KeyError
a['test'][[0,2,3]] # So should this
Dictionaries have no order; there is no index to pass in; this is why Python can use the same syntax ([..]) and the same magic method (__getitem__) for both lists and dictionaries.
When you index a dictionary on an integer like 0, the dictionary treats that like any other key:
>>> d = {'foo': 'bar', 0: 42}
>>> d.keys()
[0, 'foo']
>>> d[0]
42
>>> d['foo']
'bar'
Chained indexing applies to return values; the expression:
a['vector_2'][0, 1, 2]
is executed as:
_result = a['vector_2'] # via a.__getitem__('vector_2')
_result[0, 1, 2] # via _result.__getitem__((0, 1, 2))
so if you want values in your dictionary to behave in a certain way, you must return objects that support those operations.
In python 3, I need a function to dynamically return a value from a nested key.
nesteddict = {'a':'a1','b':'b1','c':{'cn':'cn1'}}
print(nesteddict['c']['cn']) #gives cn1
def nestedvalueget(keys):
print(nesteddict[keys])
nestedvalueget(['n']['cn'])
How should nestedvalueget be written?
I'm not sure the title is properly phrased, but I'm not sure how else to best describe this.
If you want to traverse dictionaries, use a loop:
def nestedvalueget(*keys):
ob = nesteddict
for key in keys:
ob = ob[key]
return ob
or use functools.reduce():
from functools import reduce
from operator import getitem
def nestedvalueget(*keys):
return reduce(getitem, keys, nesteddict)
then use either version as:
nestedvalueget('c', 'cn')
Note that either version takes a variable number of arguments to let you pas 0 or more keys as positional arguments.
Demos:
>>> nesteddict = {'a':'a1','b':'b1','c':{'cn':'cn1'}}
>>> def nestedvalueget(*keys):
... ob = nesteddict
... for key in keys:
... ob = ob[key]
... return ob
...
>>> nestedvalueget('c', 'cn')
'cn1'
>>> from functools import reduce
>>> from operator import getitem
>>> def nestedvalueget(*keys):
... return reduce(getitem, keys, nesteddict)
...
>>> nestedvalueget('c', 'cn')
'cn1'
And to clarify your error message: You passed the expression ['n']['cn'] to your function call, which defines a list with one element (['n']), which you then try to index with 'cn', a string. List indices can only be integers:
>>> ['n']['cn']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
>>> ['n'][0]
'n'
I'm coding a N'th order markov chain.
It goes something like this:
class Chain:
def __init__(self, order):
self.order = order
self.state_table = {}
def train(self, next_state, *prev_states):
if len(prev_states) != self.order: raise ValueError("prev_states does not match chain order")
if prev_states in self.state_table:
if next_state in self.state_table[prev_states]:
self.state_table[prev_states][next_state] += 1
else:
self.state_table[prev_states][next_state] = 0
else:
self.state_table[prev_states] = {next_state: 0}
Unfortunally, list and tuples are unhashable, and I cannot use them as keywords in dicts...
I have hopefully explained my problem well enough for you to understand what I try to achieve.
Any good ideas how I can use multiple values for dictionary keyword?
Followup question:
I did not know that tuples are hashable.
But the entropy for the hashes seem low. Are there hash collisions possible for tuples?!
Tuples are hashable when their contents are.
>>> a = {}
>>> a[(1,2)] = 'foo'
>>> a[(1,[])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
As for collisions, when I try a bunch of very similar tuples, I see them being mapped widely apart:
>>> hash((1,2))
3713081631934410656
>>> hash((1,3))
3713081631933328131
>>> hash((2,2))
3713082714462658231
>>> abs(hash((1,2)) - hash((1,3)))
1082525
>>> abs(hash((1,2)) - hash((2,2)))
1082528247575
You can use tuples as dictionary keys, they are hashable as long as their content is hashable (as #larsman said).
Don't worry about collisions, Python's dict takes care of it.
>>> hash('a')
12416037344
>>> hash(12416037344)
12416037344
>>> hash('a') == hash(12416037344)
True
>>> {'a': 'one', 12416037344: 'two'}
{'a': 'one', 12416037344: 'two'}
In this example I took a string and an integer. But it works the same with tuples. Just didn't have any idea how to find two tuples with identical hashes.
Folks,
Relative n00b to python, trying to find out the diff of two lists of dictionaries.
If these were just regular lists, I could create sets and then do a '-'/intersect operation.
However, set operation does not work on lists of dictionaries:
>>> l = []
>>> pool1 = {}
>>> l.append(pool1)
>>> s = set(l)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
You need a "hashable" dictionary.
The items() attribute is a list of tuples. Make this a tuple() and you have a hashable version of a dictionary.
tuple( sorted( some_dict.items() ) )
You can define your own dict wrapper that defines __hash__ method:
class HashableDict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
this wrapper is safe as long as you do not modify the dictionary while finding the intersection.
Python won't allow you to use a dictionary as a key in either a set or dictionary because it has no default __hash__ method defined. Unfortunately, collections.OrderedDict is also not hashable. There also isn't a built-in dictionary analogue to frozenset. You can either create a subclass of dict with your own hash method, or do something like this:
>>> def dict_item_set(dict_list):
... return set(tuple(*sorted(d.items())) for d in dict_list)
>>> a = [{1:2}, {3:4}]
>>> b = [{3:4}, {5:6}]
>>> dict(dict_item_set(a) - dict_item_set(b))
{1: 2}
>>> dict(dict_item_set(a) & dict_item_set(b))
{3: 4}
Of course, this is neither efficient nor pretty.