Two identical dictionaries differ (by using diff) after being pickled

Two identical dictionaries differ (by using diff) after being pickled - python

I have a dictionary whose keys are tuples like (int, str, int, str, int), and the corresponding values are lists of floats of the same size.
I pickled the dictionary twice by the same script:
import pickle
with open(name, 'wb') as source:
pickle.dump(the_dict, source)
For the two resulting binary files test_1 and test_2, I run
diff test_1 test_2
in a terminal (I'm using macOS) to see whether I can use diff to tell the difference. However, I received
Binary files test_1 and test_2 differ
Why? Was the same dictionary being pickled in different ways? Does it mean I cannot use diff to tell whether two dictionaries are identical?

Depending on what version of Python you are using, Python versions before v3.6 do not remember the order of insertion. Python v3.6 made this an implementation detail and v3.7 made it a language feature.
For backwards compatibility, you shouldn't depend on the dictionary remembering the order of inserted keys. Instead, you can use OrderedDict from the Collections module.
Also, using diff on pickled dict data may show differences in the data even though the actual dictionaries are equivalent -- since dicts, unlike lists, generally make no assurances on order state (see above for when that is not the case).

Related

Saving a generic python object externally

I moved over from matlab to python about a year ago and am still getting used to the differences. In matlab you could save a structure (the slightly less nice equivalent of an object from a generic class) to an external file. As a result - if you were working with something that generated structures that you didn't understand, you could save them to a file and then compare them.
This file would look something like
structure 1
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
structure 2
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
structure 3
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
and so on.
Is ther a python way of doing this? Right now i just care about printing them externally - so i can read and compare them. Matlab lets you read them back in, and i don't need that kind of functionality in this case.

Is ther a python way of doing this? Right now i just care about printing them externally - so i can read and compare them. Matlab lets you read them back in, and i don't need that kind of functionality in this case.
There is no really generic and readable serialisation scheme.
pickle works with everything but it's bad, it's unreadable (in fact it's very much specified as not human-readable, and even version 0 is at best to be qualified as "textual") and it's very unsafe.
json can be pretty readable, but it won't work out of the box with most types (though it will work with simple structures of builtin types e.g. if you're just creating lists and dicts it's fine).
You could use data classes or attrs, they make structures printable by default. They use the standard "repr" of
TypeName(field1=value1, field2=value2, ...)
which is somewhat less conducive to easy comparisons via most standard textual diffs, but both provide an asdict function which converts instances to dicts, which you can then easily serialise as JSON (with linebreaks and indentation) or some other generic scheme.

Pass Dictionary from LabVIEW to python script via a Python Node

TLDR: I am making a python wrapper around something for LabVIEW to use and I want to pass a dict (or even kwargs) [i.e. key/value pairs] to a python script so I can have more dynamic function arguments.
LabVIEW 2018 implemented a Python Node which allows LabVIEW to interact with python scripts by calling, passing, and getting returned variables.
The issue is it doesn't appear to have native support for the dict type:
Python Node Details Supported Data Types
The Python Node supports a large number of data types. You can use
this node to call the following data types:
Numerics Arrays, including multi-dimensional arrays Strings Clusters
Calling Conventions
This node converts integers and strings to the corresponding data
types in Python, converts arrays to lists, and converts clusters to
tuples.
Of course python is built around dictionaries but it appears LabVIEW does not support any way to pass a dictionary object.
Does anyone know of a way I can pass a cluster of named elements (or any other dictionary type) to a python script as a dict object?

There is no direct way to do it.
The simplest way on both sides would be to use JSON strings.
From LabVIEW to Python
LabVIEW Clusters can be flattened to JSON (Strings > Flatten/unflatten):
The resulting string can be converted to a dict in just one line (plus an import) python:
>>> import json
>>> myDict=json.loads('{"MyString":"FooBar","MySubCluster":{"MyInt":42,"MyFloat":3.1410000000000000142},"myIntArray":[1,2,3]}')
>>> myDict
{u'MyString': u'FooBar', u'MySubCluster': {u'MyInt': 42, u'MyFloat': 3.141}, u'myIntArray': [1, 2, 3]}
>>> myDict['MySubCluster']['MyFloat']
3.141
From Python to LabVIEW
The Python side is easy again:
>>> MyJson = json.dumps(myDict)
In LabVIEW, unflatten JSON from string, and wire a cluster of the expected structure with default values:
This of course requires that the structure of the dict is fixed.
If it is not, you can still access single elements by giving the path to them as array:
Limitations:
While this works like a charm (did you even notice that my locale uses comma as decimal sign?), not all datatypes are supported. For example, JSON itself does not have a time datatype, nor a dedicated path datatype, and so, the JSON VIs refuse to handle them. Use a numerical or string datatype, and convert it within LabVIEW.
Excourse: A dict-ish datatype in LabVIEW
If you ever need a dynamic datatype in LabVIEW, have a look at attributes of variants.
These are pairs of keys (string) and values (any datatype!), which can be added and reads about as simple as in Python. But there is no (builtin, simple) way to use this to interchange data with Python.

Pickling dict in Python

Can I expect the string representation of the same pickled dict to be consistent across different machines/runs for the same Python version?
In the scope of one run on the same machine?
e.g.
# Python 2.7
import pickle
initial = pickle.dumps({'a': 1, 'b': 2})
for _ in xrange(1000**2):
assert pickle.dumps({'a': 1, 'b': 2}) == initial
Does it depend on the actual structure of my dict object (nested values etc.)?
UPD:
The thing is - I can't actually make the code above fail in the scope of one run (Python 2.7) no matter how my dict object looks like (what keys/values etc.)

You can't in the general case, for the same reasons you can't rely on the dictionary order in other scenarios; pickling is not special here. The string representation of a dictionary is a function of the current dictionary iteration order, regardless of how you loaded it.
Your own small test is too limited, because it doesn't do any mutation of the test dictionary and doesn't use keys that would cause collisions. You create dictionaries with the exact same Python source code, so those will produce the same output order because the editing history of the dictionaries is exactly the same, and two single-character keys that use consecutive letters from the ASCII character set are not likely to cause a collision.
Not that you actually test string representations being equal, you only test if their contents are the same (two dictionaries that differ in string representation can still be equal because the same key-value pairs, subjected to a different insertion order, can produce different dictionary output order).
Next, the most important factor in the dictionary iteration order before cPython 3.6 is the hash key generation function, which must be stable during a single Python executable lifetime (or otherwise you'd break all dictionaries), so a single-process test would never see dictionary order change on the basis of different hash function results.
Currently, all pickling protocol revisions store the data for a dictionary as a stream of key-value pairs; on loading the stream is decoded and key-value pairs are assigned back to the dictionary in the on-disk order, so the insertion order is at least stable from that perspective. BUT between different Python versions, machine architectures and local configuration, the hash function results absolutely will differ:
The PYTHONHASHSEED environment variable, is used in the generation of hashes for str, bytes and datetime keys. The setting is available as of Python 2.6.8 and 3.2.3, and is enabled and set to random by default as of Python 3.3. So the setting varies from Python version to Python version, and can be set to something different locally.
The hash function produces a ssize_t integer, a platform-dependent signed integer type, so different architectures can produce different hashes just because they use a larger or smaller ssize_t type definition.
With different hash function output from machine to machine and from Python run to Python run, you will see different string representations of a dictionary.
And finally, as of cPython 3.6, the implementation of the dict type changed to a more compact format that also happens to preserve insertion order. As of Python 3.7, the language specification has changed to make this behaviour mandatory, so other Python implementations have to implement the same semantics. So pickling and unpickling between different Python implementations or versions predating Python 3.7 can also result in a different dictionary output order, even with all other factors equal.

No, you cannot. This depends on lot of things, including key values, interpreter state and python version.
If you need consistent representation, consider using JSON with canonical form.
EDIT
I'm not quite sure why people downvoting this without any comments, but I'll clarify.
pickle is not meant to produce reliable representations, its pure machine-(not human-) readable serializer.
Python version backward/forward compatibility is a thing, but it applies only for ability to deserialize identic object inside interpreter — i.e. when you dump in one version and load in another, it's guaranteed to have have same behaviour of same public interfaces. Neither serialized text representation or internal memory structure claimed to be the same (and IIRC, it never did).
Easiest way to check this is to dump same data in versions with significant difference in structure handling and/or seed handling while keeping your keys out of cached range (no short integers nor strings):
Python 3.5.6 (default, Oct 26 2018, 11:00:52)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dump
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x11\x00\x00\x00second_key_stringq\x01K\x02X\x10\x00\x00\x00first_string_keyq\x02K\x01u.'
Python 3.6.7 (default, Oct 26 2018, 11:02:59)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x10\x00\x00\x00first_string_keyq\x01K\x01X\x11\x00\x00\x00second_key_stringq\x02K\x02u.'

Python2 dictinaries are unordered; the order depends on the hash values of keys as explained in this great answer by Martijn Pieters. I don't think you can use a dict here, but you could use an OrderedDict (requires Python 2.7 or higher) which maintains the order of the keys. For example,
from collections import OrderedDict
data = [('b', 0), ('a', 0)]
d = dict(data)
od = OrderedDict(data)
print(d)
print(od)
#{'a': 0, 'b': 0}
#OrderedDict([('b', 0), ('a', 0)])
You can pickle an OrderedDict like you would pickle a dict, but order would be preserved, and the resulting string would be the same when pickling same objects.
from collections import OrderedDict
import pickle
data = [('a', 1), ('b', 2)]
od = OrderedDict(data)
s = pickle.dumps(od)
print(s)
Note that you shouldn't pass a dict in OrderedDict's constructor as the keys would be already placed. If you have a dictionary, you should first convert it to tuples with the desired order. OrderedDict is a subclass of dict and has all the dict methods, so you could create an empty object and assign new keys.
Your test doesn't fail because you're using the same Python version and the same conditions - the order of the dictionary will not change randomly between loop iterations. But we can demonstrate how your code fails to produce differend strings when we change the order of keys in the dictionary.
import pickle
initial = pickle.dumps({'a': 1, 'b': 2})
assert pickle.dumps({'b': 2, 'a': 1}) != initial
The resulting string should be different when we put key 'b' first (it would be different in Python >= 3.6), but in Python2 it's the same because key 'a' is placed before key 'b'.
To answer your main question, Python2 dictionaries are unordered, but a dictionary is likely to have the same order when using the same code and Python version. However that order may not be the same as the order in which you placed the items in the dictionary. If the order is important it's best to use an OrderedDict or update your Python version.

As with a frustratingly large number of things in Python, the answer is "sort of". Straight from the docs,
The pickle serialization format is guaranteed to be backwards compatible across Python releases.
That's potentially ever so subtly different from what you're asking. If it's a valid pickled dictionary now, it'll always be a valid pickled dictionary, and it'll always deserialize to the correct dictionary. That leaves unspoken a few properties which you might expect and which don't have to hold:
Pickling doesn't have to be deterministic, even for the same object in the same Python instance on the same platform. The same dictionary could have infinitely many possible pickled representations (not that we would expect the format to ever be inefficient enough to support arbitrarily large degrees of extra padding). As the other answers point out, dictionaries don't have a defined sort order, and this can give at least n! string representations of a dictionary with n elements.
Going further with the last point, it isn't guaranteed that pickle is consistent even in a single Python instance. In practice those changes don't currently happen, but that behavior isn't guaranteed to remain in future versions of Python.
Future versions of Python don't need to serialize dictionaries in a way which is compatible with current versions. The only promise we have is that they will be able to correctly deserialize our dictionaries. Currently dictionaries are supported the same in all Pickle formats, but that need not remain the case forever (not that I suspect it would ever change).

If you don't modify the dict its string representation won't change during a given run of the program, and its .keys method will return the keys in the same order. However, the order can change from run to run (before Python 3.6).
Also, two different dict objects that have identical key-value pairs are not guaranteed to use the same order (pre Python 3.6).
BTW, it's not a good idea to shadow a module name with your own variables, like you do with that lambda. It makes the code harder to read, and will lead to confusing error messages if you forget that you shadowed the module & try to access some other name from it later in the program.

Trade-off in Python dictionary key types

Say, I'm going to construct a probably large dictionary in Python 3 for in-memory operations. The dictionary keys are integers, but I'm going to read them from a file as string at first.
As far as storage and retrieval are concerned, I wonder if it matters whether I store the dictionary keys as integers themselves, or as strings.
In other words, would leaving them as integers help with hashing?

Dicts are fast but can be heavy on the memory.
Normally it shouldn't be a problem but you will only know when you test.
I would advise to first test 1.000 lines, 10.000 lines and so on and have a look on the memory footprint.
If you run out of memory and your data structure allows it maybe try using named tuples.
EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for emp in map(EmployeeRecord._make, csv.reader(open("employees.csv", "rb"))):
print(emp.name, emp.title)
(Example taken from the link)
If you have ascending integers you could also try to get more fancy by using the array module.

Actually the string hashing is rather efficient in Python 3. I expected this to has the opposite outcome:
>>> timeit('d["1"];d["4"]', setup='d = {"1": 1, "4": 4}')
0.05167865302064456
>>> timeit('d[1];d[4]', setup='d = {1: 1, 4: 4}')
0.06110116100171581

You don't seem to have bothered benchmarking the alternatives. It turns out that the difference is quite slight and I also find inconsistent differences. Besides this is an implementation detail how it's implemented, since both integers and strings are immutable they could possibly be compared as pointers.
What you should consider is which one is the natural choice of key. For example if you don't interpret the key as a number anywhere else there's little reason to convert it to an integer.
Additionally you should consider if you want to consider keys equal if their numeric value is the same or if they need to be lexically identical. For example if you would consider 00 the same key as 0 you would need to interpret it as integer and then integer is the proper key, if on the other hand you want to consider them different then it would be outright wrong to convert them to integers (as they would become the same then).

YAML output format of python

I use PyYaml to output a YAML file. But it reorder my items. like following
>>> yaml.dump({'3':5, '1':3})
"{'1': 3, '3': 5}\n"
I want to get "{'3': 5, '1': 3}\n". Can I do that thing
PS. I have tried the collections.OrderedDict. It's output is not good. Like following
>>> a= collections.OrderedDict()
>>> a['3']=1
>>> a['1']=2
>>> a['5']=2
>>> yaml.dump(a)
"!!python/object/apply:collections.OrderedDict\n- - ['3', 1]\n - ['1', 2]\n - ['5', 2]\n"

TL;DR: The solution is in the two lines commented "LOOK HERE!" It is possible to deal with YAML as dicts within your program and with an ordering in the stored file/text if you accept that the output will be lists of lists.
If you don't mind horribly ugly explicit types like !!python/ordered_dict or !!omap littering your file then you can go that route as well. My vote goes to !!omap, but I'm unsure how many tools/libs support it (I'm pretty sure fewer tools support !!python/ordered_dict, though). Ultimately you are dealing with two independent sets of data: the dict itself, and a metadata that defines the ordering of the keys.
(There are semi-magical ways of coercing an ordered dict in YAML without the !!python/ordered_dict or !!omap mess everywhere, but they are fragile, contradict the very definition of dictionaries, and will likely break as the underlying YAML library evolves. This situation is identical for JSON, by the way, as YAML is a superset of JSON and neither guarantee the order of keys -- which means the workarounds break the first time a standard-compliant tool/user messes with the file.)
The rest of this post is example/verification code and an explanation of why things are this way.
from __future__ import print_function
import yaml
# Setting up some example data
d = {'name': 'A Project',
'version': {'major': 1, 'minor': 4, 'patch': 2},
'add-ons': ['foo', 'bar', 'baz']}
# LOOK HERE!
ordering = ['name', 'version', 'add-ons', 'papayas']
ordered_set = [[x, d[x]] for x in ordering if x in d.keys()]
# In the event you only care about a few keys,
# you can tack the unspecified ones onto the end
# Note that 'papayas' isn't a key. You can establish an ordering that
# includes optional keys by using 'if' as a guard in the list comprehension.
# Demonstration
things = {'unordered.yaml': d, 'ordered.yaml': ordered_set}
for k in things:
f = open(k, 'w')
f.write(yaml.dump(things[k], default_flow_style=False, allow_unicode=True))
f.close()
# Let's check the result
output = []
for k in things:
f = open(k, 'r')
output.append(dict(yaml.load(f.read())))
f.close()
# Should print 'OK'
if output[0] == output[1]:
print('OK')
else:
print('Something is wrong')
The files created look like this:
ordered.yaml:
- - name
- A Project
- - version
- major: 1
minor: 4
patch: 2
- - add-ons
- - foo
- bar
- baz
unordered.yaml:
add-ons:
- foo
- bar
- baz
name: A Project
version:
major: 1
minor: 4
patch: 2
This doesn't produce as pretty a YAML document as you might hope. That said, it can take pretty YAML as initial input (yay!), and scripting the conversion from un-pretty, ordered YAML to pretty, still-ordered, dict-style YAML is straightforward (which I leave as an exercise for you).
If you have an ordering of keys you want preserved, write that into an ordered list/tuple. Use that list to generate an ordered list of lists (not list of tuples, though, because you'll get the !!python/tuple type in YAML, and that sucks). Dump that to YAML. To read it back in read it as normal, then pass that structure to dict() and you're back to the original dictionary you started with. You may have to recursively descend the structure if you have a nested structure which requires its order preserved (this is easier to do in code than to explain in prose -- which is something you probably already know).
In this example I want to have a project 'name' come first in the file, then 'version' number elements, then 'add-ons'. Normally PyYAML orders dictionary keys in alphanumeric order when you call dump(), but this isn't reliable because that might change in the future and there is nothing in the YAML standard that requires this, so I have no guarantee that a different YAML utility will do things this way. 'add-ons' comes before 'name', so I have an ordering problem. So I define my order, then pack an ordered list of lists, and then dump that.
You are asking for order out of something that is inherently unordered. A dictionary is a hash table, internally ordered exclusively for search speed. That order is something you're not supposed to be able to mess with because if a faster way of implementing dictionaries is discovered tomorrow the runtime needs to implement it without breaking everyone's code that depended on dictionaries being a helpful abstraction of a hash table.
In the same way, YAML is not a markup language (after all, it originally stood for "Yaml Ain't a Markup Language"), it is a data format. The difference is important. Some data is ordered, like tuples and lists; some isn't, like bags of key-value pairs (slightly different from a hash table, but conceptually similar).
I use a recursive version of this sort of solution to guarantee YAML output across different YAML implementations, not for human readability, but because I do a lot of data passing in YAML and each record has to be signed with a key, and indefinite order prevents uniform signatures whenever dicts/hashes are in use.

YAML mappings are unordered and so are Python dicts. The official way to read in a file
and keep the ordering is to use !!omap but those get converted to tuples in PyYAML and are not as easy to update as dict/ordereddict/OrderedDict.
If you already have a yaml file that you read in and update you can use my ruamel.yaml library that reads in the mappings when used in round-trip mode as ordereddict and writes them out as normal mappings (it also preservers comments).
An example of usage was given as an answer to another question.

I might be a little late to the party but using the function add_representer of the yaml package seems to resolve the problem.I just added the yaml.add_representer(collections.OrderedDict, Representer.represent_dict) before my yaml.dump and my yaml does not have anymore the above format warning :
import collections
import yaml
l= collections.OrderedDict()
l['hax']=45
l['ko']=[4,5]
l['ax']="less"
yaml.dump(l)
#output:'!!python/object/apply:collections.OrderedDict\n- - - hax\n - 45\n - - ko\n - - 4\n - 5\n - - ax\n - less\n'
#adding a representer for Ordered Dictionaries
from yaml.representer import Representer
yaml.add_representer(collections.OrderedDict, Representer.represent_dict)
yaml.dump(l)
#output'ax: less\nhax: 45\nko:\n- 4\n- 5\n'
Please let me know if it helps.
Another solution might be also to use oyaml instead of pyyaml as it suggested in this post.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.