Why do python dictionaries change order? - python

The order of objects stored in dictionaries in python3.5 changes over different executions of the interpreter, but it seems to stay the same for the same interpreter instance.
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
I always thought the order was based off of the hash of the key. Why is the order different between different executions of python?

Dictionaries use hash function, and the order is based on the hash of the key all right.
But, as stated somewhere in this Q&A, starting from python 3.3, the seed of the hash is randomly chosen at execution time (not to mention that it depends on the python versions) .
Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary is then also dependent on the random hash seed for the current Python invocation.
So each time you execute your program, you may get a different order.
Since order of dictionaries are not guaranteed (not before python 3.6 anyway), this is an implementation detail that you shouldn't consider.

dictionaries are inherently unordered. expecting any standardized behavior of the "order" is not realistic.
to keep the ordering, make an ordered list of .keys()

Related

Test if python Counter is contained in another Counter

How to test if a python Counter is contained in another one using the following definition:
A Counter a is contained in a Counter b if, and only if, for every key k in a, the value a[k] is less or equal to the value b[k]. The Counter({'a': 1, 'b': 1}) is contained in Counter({'a': 2, 'b': 2}) but it is not contained in Counter({'a': 2, 'c': 2}).
I think it is a poor design choice but in python 2.x the comparison operators (<, <=, >=, >) do not use the previous definition, so the third Counter is considered greater-than the first. In python 3.x, instead, Counter is an unorderable type.
The best I came up with is to convert the definition i gave in code:
def contains(container, contained):
return all(container[x] >= contained[x] for x in contained)
But if feels strange that python don't have an out-of-the-box solution and I have to write a function for every operator (or make a generic one and pass the comparison function).
While Counter instances are not comparable with the < and > operators, you can find their difference with the - operator. The difference never returns negative counts, so if A - B is empty, you know that B contains all the items in A.
def contains(larger, smaller):
return not smaller - larger
For all the keys in smaller Counter make sure that no value is greater than its counterpart in the bigger Counter:
def containment(big, small):
return not any(v > big[k] for (k, v) in small.iteritems())
>>> containment(Counter({'a': 2, 'b': 2}), Counter({'a': 1, 'b': 1}))
True
>>> containment(Counter({'a': 2, 'c': 2, 'b': 3}), Counter({'a': 2, 'b': 2}))
True
>>> print containment(Counter({'a': 2, 'b': 2}), Counter({'a': 2, 'b': 2, 'c':1}))
False
>>> print containment(Counter({'a': 2, 'c': 2}), Counter({'a': 1, 'b': 1})
False
Another, fairly succinct, way to express this:
"Counter A is a subset of Counter B" is equivalent to (A & B) == A.
That's because the intersection (&) of two Counters has the counts of elements common to both. That'll be the same as A if every element of A (counting multiplicity) is also in B; otherwise it will be smaller.
Performance-wise, this seems to be about the same as the not A - B method proposed by Blckknght. Checking each key as in the answer of enrico.bacis is considerably faster.
As a variation, you can also check that the union is equal to the larger Counter (so nothing was added): (A | B) == B. This is noticeably slower for some largish multisets I tested (1,000,000 elements).
While of historical interest, all these answers are obsolete.
Counter class objects are in fact comparable

Joblib - how to parallelize modifying an variable in memory

I have a question regarding joblib. I am working with networkX graphs, and wanted to parallelize the modification of edges, since iterating over the edge list is indeed an embarrassingly parallel problem. In doing so, I thought of running a simplified version of the code.
I have a variable x. It is a list of lists, akin to an edge list, though I understand that networkX returns a list of tuples for the edge list, and has primarily a dictionary-based implementation. Please bear with this simple example for the moment.
x = [[0, 1, {'a': 1}],
[1, 3, {'a': 3}]]
I have two functions that modifies the dictionary's 'a' value to be either the addition of the first two values or a subtraction of the first two values. They are defined as such:
def additive_a(edge):
edge[2]['a'] = edge[0] + edge[1]
def subtractive_a(edge):
edge[2]['a'] = edge[0] - edge[1]
If I do a regular for loop, the variable x can be modified properly:
for edge in x:
subtractive_a(edge) # or additive_a(edge) works as well.
Result:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
However, when I try doing it with joblib, I cannot get the desired result:
Parallel(n_jobs=8)(delayed(subtractive_a)(edge) for edge in x)
# I understand that n_jobs=8 for a two-item list is overkill.
The desired result is:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
When I check x, it is unchanged:
[[0, 1, {'a': 1}], [1, 3, {'a': 3}]]
I am unsure as to what is going on here. I can understand the example provided in the joblib documentation - which specifically showed computing an array of numbers using a single, simple function. However, that did not involve modifying an existing object in memory, which is what I think I'm trying to do. Is there a solution to this? How would I modify this code to parallelize the modification of a single object in memory?

Python Attributes Performance

I have a general questions to the performance of python.
If I "give" a method a variable, class or similar, does the size (data, methods...) of this object affect the speed of the program?
def function(foo):
pass
function(superHeavyObject)
function(superLightObject)
And: is a big dictionary/list slower than a small?
dict1 = {"a": 1, "b": 2, "c": 3}
dict1["c"]
dict2 = {"a": 1, "b":2, "c": 3, "d": 4, "e": 5 "f": 6, "g": 7, "h": 8, ...}
dict2["c"]
Thanks!
No, the size of the object doesn't affect how fast you can pass it around. References are just references, whatever the size of the object they point to.
Dictionaries normally have an average O(1) lookup, so it doesn't matter how big they are, the key lookup takes constant time. See Time Complexity in the Python Wiki.
Only if you were to produce massively colliding keys (keys that all hash to the same slot) would you run into the theoretical worst case of O(n) lookups. In practice this doesn't happen. Security researchers did make use of this theoretical possibility to show how you could DoS a Python program, but recent Python releases have mitigated this by adding a random component to hashes.

In what order does a dictionary in python store data? [duplicate]

This question already has answers here:
Why is python ordering my dictionary like so? [duplicate]
(3 answers)
In what order does python display dictionary keys? [duplicate]
(4 answers)
Closed 9 years ago.
I was wondering in what order does the dictionary in python store key : value pairs. I wrote the following in my python shell but I can't figure out what is the reason for the order it is storing the key : value pairs.
>>> d = {}
>>> d['a'] = 8
>>> d['b'] = 8
>>> d
{'a': 8, 'b': 8}
>>> d['c'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8}
>>> d['z'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8, 'z': 8}
>>> d['w'] = 8
>>> d
{'a': 8, 'c': 8, 'b': 8, 'z': 8, 'w': 8}
I also tried the same thing with different values for the same keys. But the order remained the same. Adding one more key : value pair gives another result that just can't make out. Here it is :
>>> d[1] = 8
>>> d
{'a': 8, 1: 8, 'c': 8, 'b': 8, 'w': 8, 'z': 8}
The short answer is: in an implementation-defined order. You can't rely and shouldn't expect any particular order, and it can change after changing the dictionary in a supposedly-irrelevant manner.
Although not directly, it's somehow explained in Dictionary view objects:
Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond.
Elements are stored based on the hash of their key. The documentation states that a key must be a hashable type.
Dictionaries do not have a predictable order as their keys are stored by a hash. If you need order, use a list or collections.OrderedDict.
It's a hash table. The keys are partially ordered by their hash value hash(key), but the actual traversal order of the dictionary can depend on the order that elements were inserted, the number of elements in the dictionary, and possibly other factors. You should never count on it being anything in particular.

Why python does not include a ordered dict (by default)?

Python have some great structures to model data.
Here are some :
+-------------------+-----------------------------------+
| indexed by int | no-indexed by int |
+-------------+-------------------+-----------------------------------+
| no-indexed | [1, 2, 3] | {1, 2, 3} |
| by key | or | or |
| | [x+1 in range(3)] | {x+1 in range(3)} |
+-------------+-------------------+-----------------------------------+
| indexed | | {'a': 97, 'c': 99, 'b': 98} |
| by key | | or |
| | | {chr(x):x for x in range(97,100)} |
+-------------+-------------------+-----------------------------------+
Why python does not include by default a structure indexed by key+int (like a PHP Array) ? I know there is a library that emulate this object ( http://docs.python.org/3/library/collections.html#ordereddict-objects). But here is the representation of a "orderedDict" taken from the documentation :
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])
Wouldn't it be better to have a native type that should logically be writen like this:
['a': 97, 'b': 98, 'c': 99]
And same logic for orderedDict comprehension :
[chr(x):x for x in range(97,100)]
Does it make sense to fill the table cell like this in the python design?
It is there any particular reason for this to not be implemented yet?
Python's dictionaries are implemented as hash tables. Those are inherently unordered data structures. While it is possible to add extra logic to keep track of the order (as is done in collections.OrderedDict in Python 2.7 and 3.1+), there's a non-trivial overhead involved.
For instance, the recipe that the collections documentation suggest for use in Python 2.4-2.6 requires more than twice as much work to complete many basic dictionary operations (such as adding and removing values). This is because it must maintain a doubly-linked list to use for ordered iteration, and it needs an extra dictionary to help maintain the list. While its operations are still O(1), the constant terms are larger.
Since Python uses dict instances everywhere (for all variable lookups, for instance), they need to be very fast or every part of every program will suffer. Since ordered iteration is not needed very often, it makes sense to avoid the overhead it requires in the general case. If you need an ordered dictionary, use the one in the standard library (or the recipe it suggests, if you're using an earlier version of Python).
Your question appears to be "why does Python not have native PHP-style arrays with ordered keys?"
Python has three core non-scalar datatypes: list, dict, and tuple. Dicts and tuples are absolutely essential for implementing the language itself: they are used for assignment, argument unpacking, attribute lookup, etc. Although not really used for the core language semantics, lists are pretty essential for data and programs in Python. All three must be extremely lightweight, have very well-understood semantics, and be as fast as possible.
PHP-style arrays are none of these things. They are not fast or lightweight, have poorly defined runtime complexity, and they have confused semantics since they can be used for so many different things--look at the array functions. They are actually a terrible datatype for almost every use case except the very narrow one for which they were created: representing x-www-form-encoded data. Even for this use case a failing is that earlier keys overwrite the value of later keys: in PHP ?a=1&a=2 results in array('a'=>2). (A common structure for dealing with this in Python is the MultiDict, which has ordered keys and values, and each key can have multiple values.)
PHP has one datatype that must be used for pretty much every use case without being great for any of them. Python has many different datatypes (some core, many more in external libraries) which excel at much more narrow use cases.
Adding a new answer with updated information: As of CPython3.6, dicts preserve order. Though still not index-accessible. Most likely because integer-based item-lookup is ambiguous since dict keys can be int's. (Some custom use cases exist.)
Unfortunately, the documentation for dict hasn't been updated to reflect this (yet) and still says "Keys and values are iterated over in an arbitrary order which is non-random". Ironically, the collections.OrderedDict docs mention the new behaviour:
Changed in version 3.6: With the acceptance of PEP 468, order is retained for keyword arguments passed to the OrderedDict constructor and its update() method.
And here's an article mentioning some more details about it:
A minor but useful internal improvement: Python 3.6 preserves the order of elements for more structures. Keyword arguments passed to a function, attribute definitions in a class, and dictionaries all preserve the order of elements as they were defined.
So if you're only writing code for Py36 onwards, you shouldn't need collections.OrderedDict unless you're using popitem, move_to_end or order-based equality.
Example, in Python 2.7:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 0: None, 'c': 3, 'b': 2, 'd': 4}
And in Python 3.6:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d['new'] = 'really?'
>>> d[None]= None
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>> d['a'] = 'aaa'
>>> d
{'a': 'aaa', 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>>
>>> # equality is not order-based
>>> d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
... d2 = {'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d2
{'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d1 == d2
True
As of python 3.7 this is now a default behavior for dictionaries, it was an implementation detail in 3.6 that was adopted as of June 2018 :')
the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.
https://docs.python.org/3/whatsnew/3.7.html

Categories

Resources