Python, checksum of a dict

Python, checksum of a dict - python

I'm thinking to create a checksum of a dict to know if it was modified or not
For the moment i have that:
>>> import hashlib
>>> import pickle
>>> d = {'k': 'v', 'k2': 'v2'}
>>> z = pickle.dumps(d)
>>> hashlib.md5(z).hexdigest()
'8521955ed8c63c554744058c9888dc30'
Perhaps a better solution exists?
Note: I want to create an unique id of a dict to create a good Etag.
EDIT: I can have abstract data in the dict.

Something like this:
reduce(lambda x,y : x^y, [hash(item) for item in d.items()])
Take the hash of each (key, value) tuple in the dict and XOR them alltogether.
#katrielalex
If the dict contains unhashable items you could do this:
hash(str(d))
or maybe even better
hash(repr(d))

In Python 3, the hash function is initialized with a random number, which is different for each python session. If that is not acceptable for the intended application, use e.g. zlib.adler32 to build the checksum for a dict:
import zlib
d={'key1':'value1','key2':'value2'}
checksum=0
for item in d.items():
c1 = 1
for t in item:
c1 = zlib.adler32(bytes(repr(t),'utf-8'), c1)
checksum=checksum ^ c1
print(checksum)

I would recommend an approach very similar to the one your propose, but with some extra guarantees:
import hashlib, json
hashlib.md5(json.dumps(d, sort_keys=True, ensure_ascii=True).encode('utf-8')).hexdigest()
sort_keys=True: keep the same hash if the order of your keys changes
ensure_ascii=True: in case you have some non-ascii characters, to make sure the representation does not change
We use this for our ETag.

I don't know whether pickle guarantees you that the hash is serialized the same way every time.
If you only have dictionaries, I would go for o combination of calls to keys(), sorted(), build a string based on the sorted key/value pairs and compute the checksum on that

I think you may not realise some of the subtleties that go into this. The first problem is that the order that items appear in a dict is not defined by the implementation. This means that simply asking for str of a dict doesn't work, because you could have
str(d1) == "{'a':1, 'b':2}"
str(d2) == "{'b':2, 'a':1}"
and these will hash to different values. If you have only hashable items in the dict, you can hash them and then join up their hashes, as #Bart does or simply
hash(tuple(sorted(hash(x) for x in d.items())))
Note the sorted, because you have to ensure that the hashed tuple comes out in the same order irrespective of which order the items appear in the dict. If you have dicts in the dict, you could recurse this, but it will be complicated.
BUT it would be easy to break any implementation like this if you allow arbitrary data in the dictionary, since you can simply write an object with a broken __hash__ implementation and use that. And you can't use id, because then you might have equal items which compare different.
The moral of the story is that hashing dicts isn't supported in Python for a reason.

As you said, you wanted to generate an Etag based on the dictionary content, OrderedDict which preserves the order of the dictionary may be better candidate here. Just iterator through the key,value pairs and construct your Etag string.

Related

How to check if an object is either on the keys or in the values of a dictionary?

I am trying to understand what is the simplest way to check if an object is in either the keys of a dictionary or in the values of a dictionary. I've tried using .items() but with no results.
Now I am using this solution but I wonder if there is a better solution:
zdict = { 'a':1,'b':2,'c':3}
print(list(zdict.values()) + list(zdict.keys()))
'b' in list(zdict.values()) + list(zdict.keys())

Stop unecessarily making list objects out of the views returned by .keys and .values. To check if an object is a dictionary key, you simply use some_object in some_dict, to check if it is in the values, you use some_object in some_dict.values(), so combining both:
some_object in some_dict or some_obect in some_dict.values()
This is fundamentally going to be a linear operation altogether, but checking if it is in the keys is constant-time, it's a hash-lookup, so you should check that first to take advantage of short-circuiting behavior. Note, if you make a list out of the keys then you force a linear search.

I wouldn't say this is simpler but maybe:
any('b'==k or 'b'==v for k,v in zdict.items())

"b" in sum(zdict.items(), ())
We turn the "tuples of list" into a tuple of all keys and values together, making use of the sum where we supply an empty tuple for its initial value.
Edit: A commenter said above is a quadratic operation. A linear version might be:
from itertools import chain
"b" in chain(*zdict.items())
Linearity is thanks to the lazy evaluation of chain.

This is not much different than others but quite different syntax:
any([ k for k, v in zdict.items() if k=='b' or v=='b'])

Python loading sorted json Dict is inconsistent in Windows and Linux

I am using Python 3.6 to process the data I receive in a text file containing a Dict having sorted keys. An example of such file can be:
{"0.1":"A","0.2":"B","0.3":"C","0.4":"D","0.5":"E","0.6":"F","0.7":"G","0.8":"H","0.9":"I","1.0":"J"}
My data load and transform is simple - I load the file, and then transform the dict into a list of tuples. Simplified it looks like this:
import json
import decimal
with open('test.json') as fp:
o = json.loads(fp.read())
l = [(decimal.Decimal(key), val) for key, val in o.items()]
is_sorted = all(l[i][0] <= l[i+1][0] for i in range(len(l)-1))
print(l)
print('Sorted:', is_sorted)
The list is always sorted in Windows and never in Linux. I know that I can sort the list manually, but since the data file is always sorted already and rather big, I'm looking for a different approach. Is there a way to somehow force json package to load the data to my dict sorted in both Windows and Linux?
For the clarification: I have no control over the structure of data I receive. My goal is to find the most efficient method to load the data into the list of tuples for further processing from what I get.

A dictionary is just a mapping between its keys and corresponding values. It doesn't have any order. It doesn't make sense to say you always find them sorted. In addition, any dictionary member access is O(1) so it's probably fast enough for your need. In case you think you still need some order, ordered dictionary may be useful.

Dicts are unordered objects in Python, so the problem you're running into is actually by design.
If you want to get a sorted list of tuples, you can do something like:
sorted_tuples = sorted(my_dict.items(),key=lambda x:x[0])
or
import operator
sorted_tuples = sorted(my_dict.items(),key=operator.itemgetter(0)
The dict method .items() converts the dict to a list of tuples and sorted() sorts that list. The key parameter to sorted explains how to sort the list. Both lambda x: x[0] and operator.itemgetter(0) select the first element of the tuple (the key from the original dict), and sort on that element.

Python: how to dump the values of a dict to a list respecting the original visualization? [duplicate]

I have a dictionary that I declared in a particular order and want to keep it in that order all the time. The keys/values can't really be kept in order based on their value, I just want it in the order that I declared it.
So if I have the dictionary:
d = {'ac': 33, 'gw': 20, 'ap': 102, 'za': 321, 'bs': 10}
It isn't in that order if I view it or iterate through it. Is there any way to make sure Python will keep the explicit order that I declared the keys/values in?

From Python 3.6 onwards, the standard dict type maintains insertion order by default.
Defining
d = {'ac':33, 'gw':20, 'ap':102, 'za':321, 'bs':10}
will result in a dictionary with the keys in the order listed in the source code.
This was achieved by using a simple array with integers for the sparse hash table, where those integers index into another array that stores the key-value pairs (plus the calculated hash). That latter array just happens to store the items in insertion order, and the whole combination actually uses less memory than the implementation used in Python 3.5 and before. See the original idea post by Raymond Hettinger for details.
In 3.6 this was still considered an implementation detail; see the What's New in Python 3.6 documentation:
The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5).
Python 3.7 elevates this implementation detail to a language specification, so it is now mandatory that dict preserves order in all Python implementations compatible with that version or newer. See the pronouncement by the BDFL. As of Python 3.8, dictionaries also support iteration in reverse.
You may still want to use the collections.OrderedDict() class in certain cases, as it offers some additional functionality on top of the standard dict type. Such as as being reversible (this extends to the view objects), and supporting reordering (via the move_to_end() method).

from collections import OrderedDict
OrderedDict((word, True) for word in words)
contains
OrderedDict([('He', True), ('will', True), ('be', True), ('the', True), ('winner', True)])
If the values are True (or any other immutable object), you can also use:
OrderedDict.fromkeys(words, True)

Rather than explaining the theoretical part, I'll give a simple example.
>>> from collections import OrderedDict
>>> my_dictionary=OrderedDict()
>>> my_dictionary['foo']=3
>>> my_dictionary['aol']=1
>>> my_dictionary
OrderedDict([('foo', 3), ('aol', 1)])
>>> dict(my_dictionary)
{'foo': 3, 'aol': 1}

Note that this answer applies to python versions prior to python3.7. CPython 3.6 maintains insertion order under most circumstances as an implementation detail. Starting from Python3.7 onward, it has been declared that implementations MUST maintain insertion order to be compliant.
python dictionaries are unordered. If you want an ordered dictionary, try collections.OrderedDict.
Note that OrderedDict was introduced into the standard library in python 2.7. If you have an older version of python, you can find recipes for ordered dictionaries on ActiveState.

Dictionaries will use an order that makes searching efficient, and you cant change that,
You could just use a list of objects (a 2 element tuple in a simple case, or even a class), and append items to the end. You can then use linear search to find items in it.
Alternatively you could create or use a different data structure created with the intention of maintaining order.

I came across this post while trying to figure out how to get OrderedDict to work. PyDev for Eclipse couldn't find OrderedDict at all, so I ended up deciding to make a tuple of my dictionary's key values as I would like them to be ordered. When I needed to output my list, I just iterated through the tuple's values and plugged the iterated 'key' from the tuple into the dictionary to retrieve my values in the order I needed them.
example:
test_dict = dict( val1 = "hi", val2 = "bye", val3 = "huh?", val4 = "what....")
test_tuple = ( 'val1', 'val2', 'val3', 'val4')
for key in test_tuple: print(test_dict[key])
It's a tad cumbersome, but I'm pressed for time and it's the workaround I came up with.
note: the list of lists approach that somebody else suggested does not really make sense to me, because lists are ordered and indexed (and are also a different structure than dictionaries).

You can't really do what you want with a dictionary. You already have the dictionary d = {'ac':33, 'gw':20, 'ap':102, 'za':321, 'bs':10}created. I found there was no way to keep in order once it is already created. What I did was make a json file instead with the object:
{"ac":33,"gw":20,"ap":102,"za":321,"bs":10}
I used:
r = json.load(open('file.json'), object_pairs_hook=OrderedDict)
then used:
print json.dumps(r)
to verify.

from collections import OrderedDict
list1 = ['k1', 'k2']
list2 = ['v1', 'v2']
new_ordered_dict = OrderedDict(zip(list1, list2))
print new_ordered_dict
# OrderedDict([('k1', 'v1'), ('k2', 'v2')])

Another alternative is to use Pandas dataframe as it guarantees the order and the index locations of the items in a dict-like structure.

I had a similar problem when developing a Django project. I couldn't use OrderedDict, because I was running an old version of python, so the solution was to use Django's SortedDict class:
https://code.djangoproject.com/wiki/SortedDict
e.g.,
from django.utils.datastructures import SortedDict
d2 = SortedDict()
d2['b'] = 1
d2['a'] = 2
d2['c'] = 3
Note: This answer is originally from 2011. If you have access to Python version 2.7 or higher, then you should have access to the now standard collections.OrderedDict, of which many examples have been provided by others in this thread.

Generally, you can design a class that behaves like a dictionary, mainly be implementing the methods __contains__, __getitem__, __delitem__, __setitem__ and some more. That class can have any behaviour you like, for example prividing a sorted iterator over the keys ...

if you would like to have a dictionary in a specific order, you can also create a list of lists, where the first item will be the key, and the second item will be the value
and will look like this
example
>>> list =[[1,2],[2,3]]
>>> for i in list:
... print i[0]
... print i[1]
1
2
2
3

You can do the same thing which i did for dictionary.
Create a list and empty dictionary:
dictionary_items = {}
fields = [['Name', 'Himanshu Kanojiya'], ['email id', 'hima#gmail.com']]
l = fields[0][0]
m = fields[0][1]
n = fields[1][0]
q = fields[1][1]
dictionary_items[l] = m
dictionary_items[n] = q
print dictionary_items

How To Create a Unique Key For A Dictionary In Python

What is the best way to generate a unique key for the contents of a dictionary. My intention is to store each dictionary in a document store along with a unique id or hash so that I don't have to load the whole dictionary from the store to check if it exists already or not. Dictionaries with the same keys and values should generate the same id or hash.
I have the following code:
import hashlib
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
print str(a)
print hashlib.sha1(str(a)).hexdigest()
print hashlib.sha1(str(b)).hexdigest()
The last two print statements generate the same string. Is this is a good implementation? or are there any pitfalls with this approach? Is there a better way to do this?
Update
Combining suggestions from the answers below, the following might be a good implementation
import hashlib
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
def get_id_for_dict(dict):
unique_str = ''.join(["'%s':'%s';"%(key, val) for (key, val) in sorted(dict.items())])
return hashlib.sha1(unique_str).hexdigest()
print get_id_for_dict(a)
print get_id_for_dict(b)

I prefer serializing the dict as JSON and hashing that:
import hashlib
import json
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
# Python 2
print hashlib.sha1(json.dumps(a, sort_keys=True)).hexdigest()
print hashlib.sha1(json.dumps(b, sort_keys=True)).hexdigest()
# Python 3
print(hashlib.sha1(json.dumps(a, sort_keys=True).encode()).hexdigest())
print(hashlib.sha1(json.dumps(b, sort_keys=True).encode()).hexdigest())
Returns:
71083588011445f0e65e11c80524640668d3797d
71083588011445f0e65e11c80524640668d3797d

No - you can't rely on particular order of elements when converting dictionary to a string.
You can, however, convert it to sorted list of (key,value) tuples, convert it to a string and compute a hash like this:
a_sorted_list = [(key, a[key]) for key in sorted(a.keys())]
print hashlib.sha1( str(a_sorted_list) ).hexdigest()
It's not fool-proof, as a formating of a list converted to a string or formatting of a tuple can change in some future major python version, sort order depends on locale etc. but I think it can be good enough.

A possible option would be using a serialized representation of the list that preserves order. I am not sure whether the default list to string mechanism imposes any kind of order, but it wouldn't surprise me if it were interpreter-dependent. So, I'd basically build something akin to urlencode that sorts the keys beforehand.
Not that I believe that you method would fail, but I'd rather play with predictable things and avoid undocumented and/or unpredictable behavior. It's true that despite "unordered", dictionaries end up having an order that may even be consistent, but the point is that you shouldn't take that for granted.

Python data structure design

The data structure should meet the following purpose:
each object is unique with certain key-value pairs
the keys and values are not predetermined and can contain any string value
querying for objects should be fast
Example:
object_123({'stupid':True, 'foo':'bar', ...})
structure.get({'stupid':True, 'foo':'bar', ...}) should return object_123
Optimally this structure is implemented with the standard python data structures available through the standard library.
How would you implement this?

The simplest solution I can think of is to use sorted tuple keys:
def key(d): return tuple(sorted(d.items()))
x = {}
x[key({'stupid':True, 'foo':'bar', ...})] = object_123
x.get(key({'stupid':True, 'foo':'bar', ...})) => object_123
Another option would be to come up with your own hashing scheme for your keys (either by wrapping them in a class or just using numeric keys in the dictionary), but depending on your access pattern this may be slower.

I think SQLite or is what you need. It may not be implemented with standard python structures but it's available through the standard library.

Say object_123 is a dict, which it pretty much looks like. Your structure seems to be a standard dict with keys like (('foo', 'bar'), ('stupid', True)); in other words, tuple(sorted(object_123.items())) so that they're always listed in a defined order.
The reason for the defined ordering is because dict.items() isn't guaranteed to return a list in a given ordering. If your dictionary key is (('foo', 'bar'), ('stupid', True)), you don't want a false negative just because you're searching for (('stupid', True),('foo', 'bar')). Sorting the values is probably the quickest way to protect against that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.