I have a general questions to the performance of python.
If I "give" a method a variable, class or similar, does the size (data, methods...) of this object affect the speed of the program?
def function(foo):
pass
function(superHeavyObject)
function(superLightObject)
And: is a big dictionary/list slower than a small?
dict1 = {"a": 1, "b": 2, "c": 3}
dict1["c"]
dict2 = {"a": 1, "b":2, "c": 3, "d": 4, "e": 5 "f": 6, "g": 7, "h": 8, ...}
dict2["c"]
Thanks!
No, the size of the object doesn't affect how fast you can pass it around. References are just references, whatever the size of the object they point to.
Dictionaries normally have an average O(1) lookup, so it doesn't matter how big they are, the key lookup takes constant time. See Time Complexity in the Python Wiki.
Only if you were to produce massively colliding keys (keys that all hash to the same slot) would you run into the theoretical worst case of O(n) lookups. In practice this doesn't happen. Security researchers did make use of this theoretical possibility to show how you could DoS a Python program, but recent Python releases have mitigated this by adding a random component to hashes.
Related
I'm taking a data structures course in Python, and a suggestion for a solution includes this code which I don't understand.
This is a sample of a dictionary:
vc_metro = {
'Richmond-Brighouse': set(['Lansdowne']),
'Lansdowne': set(['Richmond-Brighouse', 'Aberdeen'])
}
It is suggested that to remove some of the elements in the value, we use this code:
vc_metro['Lansdowne'] -= set(['Richmond-Brighouse'])
I have never seen such a structure, and using it in a basic situation such as:
my_list = [1, 2, 3, 4, 5, 6]
other_list = [1, 2]
my_list -= other_list
doesn't work. Where can I learn more about this recommended strategy?
You can't subtract lists, but you can subtract set objects meaningfully. Sets are hashtables, somewhat similar to dict.keys(), which allow only one instance of an object.
The -= operator is equivalent to the difference method, except that it is in-place. It removes all the elements that are present in both operands from the left one.
Your simple example with sets would look like this:
>>> my_set = {1, 2, 3, 4, 5, 6}
>>> other_set = {1, 2}
>>> my_set -= other_set
>>> my_set
{3, 4, 5, 6}
Curly braces with commas but no colons are interpreted as a set object. So the direct constructor call
set(['Richmond-Brighouse'])
is equivalent to
{'Richmond-Brighouse'}
Notice that you can't do set('Richmond-Brighouse'): that would add all the individual characters of the string to the set, since strings are iterable.
The reason to use -=/difference instead of remove is that differencing only removes existing elements, and silently ignores others. The discard method does this for a single element. Differencing allows removing multiple elements at once.
The original line vc_metro['Lansdowne'] -= set(['Richmond-Brighouse']) could be rewritten as
vc_metro['Lansdowne'].discard('Richmond-Brighouse')
When helping my co-worker troubleshoot a problem I saw something I was unaware python did. When compared to other ways of doing this I am curious where the performance and time complexity stacks up and the best approach is for sake of performance.
what my co-worker did that prompted this question:
list_of_keys = []
test_dict = {'foo': 1, 'bar': [1, 2, 3, 4, 5]}
list_of_keys.extend(test_dict)
print(list_of keys)
['foo', 'bar']
vs other examples I have seen:
list_of_keys = []
test_dict = {'foo': 1, 'bar': [1, 2, 3, 4, 5]}
for i in test_dict.keys():
list_of_keys.append(i)
and
keys = list(test_dict)
which one of these is shown to be the most beneficial and the most pythonic for the sake of simply appending keys. which one yields the best performance?
As the docs explain, s.extend(t):
extends s with the contents of t (for the most part the same as s[len(s):len(s)] = t)
OK, so that isn't very clear as to whether it should be faster or slower than calling append in a loop. But it is a little faster—the looping is happening in C rather than in Python, and it can use some special optimized code for adding onto the list because it knows you're not touching the list at the same time.
More importantly, it's a lot simpler, more readable, and harder to get wrong.
As for starting with an empty list and then extending it (or appending to it), there's no good reason to do that. If you already have a list with some values in it, and want to add the dict keys, then use extend. But if you just want to create a list of the keys, just do list(d).
As for d.keys() vs. d, there's really no difference at all. Whether you iterate over a dict or its dict_keys view, you get the exact same values iterated, even using the exact same dict_keyiterator. The extra call to keys() does make things a tiny bit slower, but that's a fixed cost, not once per element, so unless your dicts are tiny, you won't see any noticeable difference.
So, do whichever one seems more readable in the circumstances. Generally speaking, the only reason you want to loop over d.keys() is when you want to make it clear that you're iterating over a dict's keys, but it isn't obvious from the surrounding code that d is a dict.
Among other things, you also asked about complexity.
All of these solutions have the same (linear) complexity, because they all do the same thing under the covers: for every keys in the dictionary, append it to the end of a list. That's one step per key, and the complexity of each step is amortized constant (because Python lists expand exponentially), so the title time is O(N) where N is the length of the dict.
After #thebjorn mentioned the module. seems that calling extend is fastest
It seems that list() is the most pythonic for sake of readability and cleanliness.
the most beneficial seems dependent on use-case. but more or less doing this is redundant as mentioned in a comment. This was discovered from a mistake and i got curious.
timeit.timeit("for i in {'foo': 1, 'bar': [1, 2, 3, 4, 5]}.keys():[].append(i)", number=1000000)
0.6147394659928977
timeit.timeit("[].extend({'foo': 1, 'bar': [1, 2, 3, 4, 5]})", number=1000000)
0.36140396299015265
timeit.timeit("list({'foo': 1, 'bar': [1, 2, 3, 4, 5]})", number=1000000)
0.4726199270080542
The order of objects stored in dictionaries in python3.5 changes over different executions of the interpreter, but it seems to stay the same for the same interpreter instance.
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
{'b': 2, 'a': 1}
$ python3 <(printf 'print({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})\nprint({"a": 1, "b": 2})')
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
{'a': 1, 'b': 2}
I always thought the order was based off of the hash of the key. Why is the order different between different executions of python?
Dictionaries use hash function, and the order is based on the hash of the key all right.
But, as stated somewhere in this Q&A, starting from python 3.3, the seed of the hash is randomly chosen at execution time (not to mention that it depends on the python versions) .
Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary is then also dependent on the random hash seed for the current Python invocation.
So each time you execute your program, you may get a different order.
Since order of dictionaries are not guaranteed (not before python 3.6 anyway), this is an implementation detail that you shouldn't consider.
dictionaries are inherently unordered. expecting any standardized behavior of the "order" is not realistic.
to keep the ordering, make an ordered list of .keys()
I have a question regarding joblib. I am working with networkX graphs, and wanted to parallelize the modification of edges, since iterating over the edge list is indeed an embarrassingly parallel problem. In doing so, I thought of running a simplified version of the code.
I have a variable x. It is a list of lists, akin to an edge list, though I understand that networkX returns a list of tuples for the edge list, and has primarily a dictionary-based implementation. Please bear with this simple example for the moment.
x = [[0, 1, {'a': 1}],
[1, 3, {'a': 3}]]
I have two functions that modifies the dictionary's 'a' value to be either the addition of the first two values or a subtraction of the first two values. They are defined as such:
def additive_a(edge):
edge[2]['a'] = edge[0] + edge[1]
def subtractive_a(edge):
edge[2]['a'] = edge[0] - edge[1]
If I do a regular for loop, the variable x can be modified properly:
for edge in x:
subtractive_a(edge) # or additive_a(edge) works as well.
Result:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
However, when I try doing it with joblib, I cannot get the desired result:
Parallel(n_jobs=8)(delayed(subtractive_a)(edge) for edge in x)
# I understand that n_jobs=8 for a two-item list is overkill.
The desired result is:
[[0, 1, {'a': -1}], [1, 3, {'a': -2}]]
When I check x, it is unchanged:
[[0, 1, {'a': 1}], [1, 3, {'a': 3}]]
I am unsure as to what is going on here. I can understand the example provided in the joblib documentation - which specifically showed computing an array of numbers using a single, simple function. However, that did not involve modifying an existing object in memory, which is what I think I'm trying to do. Is there a solution to this? How would I modify this code to parallelize the modification of a single object in memory?
Python have some great structures to model data.
Here are some :
+-------------------+-----------------------------------+
| indexed by int | no-indexed by int |
+-------------+-------------------+-----------------------------------+
| no-indexed | [1, 2, 3] | {1, 2, 3} |
| by key | or | or |
| | [x+1 in range(3)] | {x+1 in range(3)} |
+-------------+-------------------+-----------------------------------+
| indexed | | {'a': 97, 'c': 99, 'b': 98} |
| by key | | or |
| | | {chr(x):x for x in range(97,100)} |
+-------------+-------------------+-----------------------------------+
Why python does not include by default a structure indexed by key+int (like a PHP Array) ? I know there is a library that emulate this object ( http://docs.python.org/3/library/collections.html#ordereddict-objects). But here is the representation of a "orderedDict" taken from the documentation :
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])
Wouldn't it be better to have a native type that should logically be writen like this:
['a': 97, 'b': 98, 'c': 99]
And same logic for orderedDict comprehension :
[chr(x):x for x in range(97,100)]
Does it make sense to fill the table cell like this in the python design?
It is there any particular reason for this to not be implemented yet?
Python's dictionaries are implemented as hash tables. Those are inherently unordered data structures. While it is possible to add extra logic to keep track of the order (as is done in collections.OrderedDict in Python 2.7 and 3.1+), there's a non-trivial overhead involved.
For instance, the recipe that the collections documentation suggest for use in Python 2.4-2.6 requires more than twice as much work to complete many basic dictionary operations (such as adding and removing values). This is because it must maintain a doubly-linked list to use for ordered iteration, and it needs an extra dictionary to help maintain the list. While its operations are still O(1), the constant terms are larger.
Since Python uses dict instances everywhere (for all variable lookups, for instance), they need to be very fast or every part of every program will suffer. Since ordered iteration is not needed very often, it makes sense to avoid the overhead it requires in the general case. If you need an ordered dictionary, use the one in the standard library (or the recipe it suggests, if you're using an earlier version of Python).
Your question appears to be "why does Python not have native PHP-style arrays with ordered keys?"
Python has three core non-scalar datatypes: list, dict, and tuple. Dicts and tuples are absolutely essential for implementing the language itself: they are used for assignment, argument unpacking, attribute lookup, etc. Although not really used for the core language semantics, lists are pretty essential for data and programs in Python. All three must be extremely lightweight, have very well-understood semantics, and be as fast as possible.
PHP-style arrays are none of these things. They are not fast or lightweight, have poorly defined runtime complexity, and they have confused semantics since they can be used for so many different things--look at the array functions. They are actually a terrible datatype for almost every use case except the very narrow one for which they were created: representing x-www-form-encoded data. Even for this use case a failing is that earlier keys overwrite the value of later keys: in PHP ?a=1&a=2 results in array('a'=>2). (A common structure for dealing with this in Python is the MultiDict, which has ordered keys and values, and each key can have multiple values.)
PHP has one datatype that must be used for pretty much every use case without being great for any of them. Python has many different datatypes (some core, many more in external libraries) which excel at much more narrow use cases.
Adding a new answer with updated information: As of CPython3.6, dicts preserve order. Though still not index-accessible. Most likely because integer-based item-lookup is ambiguous since dict keys can be int's. (Some custom use cases exist.)
Unfortunately, the documentation for dict hasn't been updated to reflect this (yet) and still says "Keys and values are iterated over in an arbitrary order which is non-random". Ironically, the collections.OrderedDict docs mention the new behaviour:
Changed in version 3.6: With the acceptance of PEP 468, order is retained for keyword arguments passed to the OrderedDict constructor and its update() method.
And here's an article mentioning some more details about it:
A minor but useful internal improvement: Python 3.6 preserves the order of elements for more structures. Keyword arguments passed to a function, attribute definitions in a class, and dictionaries all preserve the order of elements as they were defined.
So if you're only writing code for Py36 onwards, you shouldn't need collections.OrderedDict unless you're using popitem, move_to_end or order-based equality.
Example, in Python 2.7:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 0: None, 'c': 3, 'b': 2, 'd': 4}
And in Python 3.6:
>>> d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
>>> d['new'] = 'really?'
>>> d[None]= None
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>> d['a'] = 'aaa'
>>> d
{'a': 'aaa', 'b': 2, 'c': 3, 'd': 4, 0: None, 'new': 'really?', None: None}
>>>
>>> # equality is not order-based
>>> d1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 0: None}
... d2 = {'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d2
{'b': 2, 'a': 1, 'd': 4, 'c': 3, 0: None}
>>> d1 == d2
True
As of python 3.7 this is now a default behavior for dictionaries, it was an implementation detail in 3.6 that was adopted as of June 2018 :')
the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.
https://docs.python.org/3/whatsnew/3.7.html