Unanticipated Python Dictionary Behavior

Unanticipated Python Dictionary Behavior - python

I have this piece of code:
import time
d = dict()
for i in range(200000):
d[i] = "DUMMY"
start_time = time.time()
for i in range(200000):
for key in d:
if len(d) > 1 or -1 not in d:
break
del d[i]
print("--- {} seconds ---".format(time.time() - start_time))
Why does this take ~15 seconds to run?
But, if I comment out del d[i] or the inner loop, it runs in ~0.1 seconds.

The issue you have is caused by iterating over even one element (e.g. next(iter(d))) of a dictionary that was once large, but has been shrunk a great deal. This can be nearly slow as iterating over all of the dictionary items if you get unlucky with your hash values. And this code is very "unlucky" (predictably so, due to Python hash design).
The reason for the issue is that Python does not rebuild a dictionary's hash table when you remove items. So the hash table for a dictionary that used to have 200000 items in it, but which now has only 1 left is still has more than 200000 spaces in it (and probably more, since it was probably not entirely full at its peak).
When you're iterating the dictionary when it has all its values in it, finding the first one is pretty simple. The first one will be in one of the first few table entries. But as you empty out the table, more and more blank spaces will be at the start of the table and the search for the first value that still exists will take longer and longer.
This might be even worse given that you're using integer keys, which (mostly) hash to themselves (only -1 hashes to something else). This means that the first key in the "full" dictionary will usually be 0, the next 1, and so on. As you delete the values in increasing order, you'll be very precisely removing the earliest keys in the table first, and so making the searches maximally worse.

It's because this
for key in d:
if len(d) > 1 or -1 not in d:
break
will break on the first iteration, so your inner loop is basically a no-op.
Adding del[i] makes it do some real work, which takes time.
Update: well the above is obviously way to simplistic :-)
The following version of your code shows the same characteristic:
import time
import gc
n = 140000
def main(d):
for i in range(n):
del d[i] # A
for key in d: # B
break # B
import dis
d = dict()
for i in range(n):
d[i] = "DUMMY"
print dis.dis(main)
start_time = time.time()
main(d)
print("--- {} seconds ---".format(time.time() - start_time))
Using iterkeys doesn't make a difference.
If we plot the run time on different sizes of n we get (n on the x-axis, seconds on the y-axis):
so clearly something exponential going on.
Deleting line (A) or lines (B) removes the exponential component, although I'm not sure why.
Update 2: Based on #Blckknght's answer, we can regain some of the speed by infrequently rehashing the items:
def main(d):
for i in range(n):
del d[i]
if i % 5000 == 0:
d = {k:v for k, v in d.items()}
for key in d:
break
or this:
def main(d):
for i in range(n):
del d[i]
if i % 6000 == 0:
d = {k:v for k, v in d.items()}
try:
iter(d).next()
except StopIteration:
pass
takes under half the time of the original on large n (the bump at 130000 is consistent over 4 runs..).

There seems to be some performance cost to accessing the keys as a whole after deleting an item. This cost is not incurred when you do direct accesses so, my guess is that the dictionary flags its key list as dirty when an item is removed and waits for a reference to the key list before updating/rebuilding it.
This explains why you don't get a performance hit when you remove the inner loop (you're not causing the key list to be rebuilt). It also explains why the loop is fast when you remove the del d[i] line (you're not flagging the key list for rebuilding).

Related

Avoid nested loops when checking data in Python

I have two lists of dictionaries:
dict_list1 = [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}]
dict_list2 = [{'k1':1, 'k2':2, 'k3':10}, {'k1':3, 'k2':4, 'k3':10}]
And now for each dict_x in dict_list1, I want to know if there is a dict_y on dict_list2 that contains every key,value from dict_x.
I cannot think of another way of doing this other then:
for dict_x in dict_list1:
for dict_y in dict_list2:
count = len(dict_x)
for key, val in dict_x.items():
if key in dict_y and dict_y[key] == val:
count -= 1
if count == 0:
print('YAY')
break

dict views can perform quick "is subset" testing via the inequality operators. So:
if dict_x.items() <= dict_y.items(): # Use .viewitems() instead of .items() on Python 2.7
will only return true if every key/value pair in dict_x also appears in dict_y.
This won't change anything in terms of big-O performance, but it does make the code somewhat cleaner:
for dict_x in dict_list1:
for dict_y in dict_list2:
if dict_x.items() <= dict_y.items():
print('YAY')
break
Note that creating the views costs something (it's just a fixed cost, not dependent on dict size), so if performance matters, it may be worth caching the views; doing so for dict_list1 is free:
for dict_x in dict_list1:
dict_x_view = dict_x.items()
for dict_y in dict_list2:
if dict_x_view <= dict_y.items():
print('YAY')
break
but some eager conversions would be needed to cache both:
# Convert all of dict_list2 to views up front; costs a little if
# not all views end up being tested (we always break before finishing)
# but usually saves some work at the cost of a tiny amount of memory
dict_list2_views = [x.items() for x in dict_list2]
for dict_x in dict_list1:
dict_x_view = dict_x.items()
for dict_y_view in dict_list2_views:
if dict_x_view <= dict_y_view:
print('YAY')
break
You could also collapse the loop using any (which removes the need to break since any short-circuits), so the first (simplest) check could become:
for dict_x in dict_list1:
if any(dict_x.items() <= dict_y.items() for dict_y in dict_list2):
print('YAY')
This could be further collapsed to a single list comprehension that results in the various matches, but at that point the code is going to be pretty cramped/ugly:
for _ in (dict_x in dict_list1 if any(dict_x.items() <= dict_y.items() for dict_y in dict_list2)):
print('YAY')
though without knowing what you'd really do (as opposed to just printing YAY) that's getting a little pointless.

Below, I use the fact that the dict.items view implements set operations to check for each d1.items() if there exists a d2.items(), such that d1.items() is a subset of d2.items()
[any(d1.items() <= d2.items() for d2 in dict_list2) for d1 in dict_list1]

You can use any and all:
dict_list1 = [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}]
dict_list2 = [{'k1':1, 'k2':2, 'k3':10}, {'k1':3, 'k2':4, 'k3':10}]
v = [any(all(c in i and i[c] == k for c, k in b.items()) for i in dict_list2)\
for b in dict_list1]
Output:
[True, True]

finding number of the first d zeroes in a hash object using loop

I need to determine security level "d" by counting the number of the first d 0's in a hash string using SHA256 (I defined c_hash(mystring,sha256) to do that).
I used generate_name function that generate a random first name with 3-6 letters and family name with 4-8 letters. Here is my code:
def d_security(d):
t0 = time.clock()
cnt = 0
while cnt != d:
obj = generate_name()
hash_obj = c_hash(obj,sha256)
if hash_obj[:d] == d*"0":
cnt+=1
t1 = time.clock()
print(t1-t0)
return None
for d = 5, it takes over 2min find a string that matches the security level. Any suggestion on how to make this run faster using another loop?

d*"0" is being recalculated everytime despite being constant. Maybe calculate it and assign it to another variable BEFORE the while loop will help.
If your code is running slowly it's not mainly because of the loop. Generating names and hashes should take the majority of run time. So either:
2a. Generate shorter names, or
2b. If the has function allows, re-design your hash generator c_hash to return only the first d digits of the hash value. You only need those digits after all, calculating the whole hash values takes time.

Evaluate d*"0" only once by putting it in the top of the function.
Also, use startswith instead of using a slice. Effectively giving you:
def d_security(d):
expected = d*"0"
t0 = time.clock()
cnt = 0
while cnt != d:
obj = generate_name()
hash_obj = c_hash(obj,sha256)
if hash_obj.startswith(expected):
cnt+=1
t1 = time.clock()
print(t1-t0)

O(n) list subtraction

When working on an AoC puzzle, I found I wanted to subtract lists (preserving ordering):
def bag_sub(list_big, sublist):
result = list_big[:]
for n in sublist:
result.remove(n)
return result
I didn't like the way the list.remove call (which is itself O(n)) is contained within the loop, that seems needlessly inefficient. So I tried to rewrite it to avoid that:
def bag_sub(list_big, sublist):
c = Counter(sublist)
result = []
for k in list_big:
if k in c:
c -= Counter({k: 1})
else:
result.append(k)
return result
Is this now O(n), or does the Counter.__isub__ usage still screw things up?
This approach requires that elements must be hashable, a restriction which the original didn't have. Is there an O(n) solution which avoids creating this additional restriction? Does Python have any better "bag" datatype than collections.Counter?
You can assume sublist is half the length of list_big.

I'd use a Counter, but I'd probably do it slightly differently, and I'd probably do this iteratively...
def bag_sub(big_list, sublist):
sublist_counts = Counter(sublist)
result = []
for item in big_list:
if sublist_counts[item] > 0:
sublist_counts[item] -= 1
else:
result.append(item)
return result
This is very similar to your solution, but it's probably not efficient to create an entire new counter every time you want to decrement the count on something.1
Also, if you don't need to return a list, then consider a generator function...
This works as long as all of the elements in list_big and sublist can be hashed. This solution is O(N + M) where N and M are the lengths of list_big and sublist respectively.
If the elements cannot be hashed, you are out of luck unless you have other constraints (e.g. the inputs are sorted using the same criterion). If your inputs are sorted, you could do something similar to the merge stage of merge-sort to determine which elements from bag_sub are in sublist.
1Note that Counters also behave a lot like a defaultdict(int) so it's perfectly fine to look for an item in a counter that isn't there already.

Is this now O(n), or does the Counter.__isub__ usage still screw things up?
This would be expected-case O(n), except that when Counter.__isub__ discards nonpositive values, it goes through every key to do so. You're better off just subtracting 1 from the key value the "usual" way and checking c[k] instead of k in c. (c[k] is 0 for k not in c, so you don't need an in check.)
if c[k]:
c[k] -= 1
else:
result.append(k)
Is there an O(n) solution which avoids creating this additional restriction?
Only if the inputs are sorted, in which case a standard variant of a mergesort merge can do it.
Does Python have any better "bag" datatype than collections.Counter?
collections.Counter is Python's bag.

Removing an item from a list of length N is O(N) if the list is unordered, because you have to find it.
Removing k items from a list of length N, therefore, is O(kN) if we focus on "reasonable" cases where k << N.
So I don't see how you could get it down to O(N).
A concise way to write this:
new_list = [x for x in list_big if x not in sublist]
But that's still O(kN).

QUICKEST way to find a key in a dictionary

I have a dictionary, with over 11 million keys (and each value is a list).Each key is a unique integer.
e.g.
Dict1 = {11:"a",12:"b",22:"c",56:"d"}
Then, separately, I have a list of ranges, e.g.
[10-20,30-40,50-60]
And I want to say, for each range in my list of ranges, go through the dictionary and return the value, if the key is within the range.
So it would return:
10-20: "a","b"
50-60: "d"
The actual code that I used is:
for each_key in sorted(dictionary):
if each_key in range(start,end):
print str(dictionary[each_key])
The problem is that this technique is prohibitively long because it's going through all 11 million keys and checking if it's within the range or not.
Is there a way that I can say "skip through all of the dictionary keys until one in found that is higher than the start number" and then "stop once the end number is higher than the key"? Just basically some way that just zooms in on the portion of the dictionary within a certain range very quickly?
Thanks

Just use Python's EAFP principle. It's Easier to Ask Forgiveness than Permission.
Assume that all keys are valid, and catch the error if they're not:
for key in xrange(start, end):
try:
print str(dictionary[key])
except KeyError:
pass
This will just try to get each number as a key, and if there's a KeyError from a non existent key then it will move on to the next iteration.
Note that if you expect a lot of the keys will be missing, it might be faster to test first:
for key in xrange(start, end):
if key in dictionary:
print str(dictionary[key])
Note that xrange is just a slightly different function to range. It will produce the values one by one instead of creating the whole list in advance. It's useful to use in for loops and has no drawbacks in this case.

my thought for this problem is to find the correct keys first. The reason why your solution take too much time is that it use O(n) algorithm to find a correct key. If we can implement binary search method, the complexity will be reduced to O(log(n)), which helps a lot.
Following is my sample code. It works for the example, but I cannot promise it won't get some small bugs. Just find the idea there and implement yours.
def binarySearch(alist, target):
left = 0
right = len(alist) -1
if target>alist[-1]:
return len(alist)
while left < right:
m = (left + right) / 2
if alist[m] == target:
return m
if alist[m] < target:
left = m+1
else:
right = m
return left
def work(dictionary, start, end):
keys = sorted(dictionary.keys())
start_pos = binarySearch(keys, start)
end_pos = binarySearch(keys, end)
print [dictionary[keys[pos]] for pos in range(start_pos,end_pos)]
dictionary = {11:"a",12:"b",22:"c",56:"d"}
work(dictionary, 10, 20)
work(dictionary, 20, 40)
work(dictionary, 10, 60)

This solution ( using OrderedDict and filter ) can help you a bit.
from collections import OrderedDict
d = {2:3, 10:89, 4:5, 23:0}
od = OrderedDict(sorted(d.items()))
lst=["1-10","11-20","21-30"]
lower_lst=map(int,[i.split("-")[0] for i in lst])
upper_lst=map(int,[i.split("-")[1] for i in lst])
for low,up in zip(lower_lst,upper_lst):
print "In range {0}-{1}".format(low,up),filter(lambda a:low <= a[0] <= up,od.iteritems())

Performance of my function removing keys from a dictionary which are substrings of other keys

I am curious why removing a line in my code results in a significant increase in performance. The function itself takes a dictionary and removes all keys which are substrings of other keys.
The line which slows my code down is:
if sub in reduced_dict and sub2 in reduced_dict:
Here's my function:
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
The function checks if sub is in sub2 many times. I assumed that if I checked for this comparison having already been made, I would be saving myself time. This doesn't seem to be the case. Why is the constant time function for lookup in a dictionary slowing me down?
I am a beginner so, I'm interested in concepts.
When I tested if the line in question is ever returning False, it appears that it is. I've tested this with the following
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub not in reduced_dict or sub2 not in reduced_dict:
print 'not present' # This line prints many thousands of times
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
For 14,805 keys in the function's input dictionary:
19.6360001564 sec. without the line
33.1449999809 sec. with the line
Here are 3 dictionary examples. Biggest sample dictionary with 14805 keys, medium sample dictionary and smaller sample dictionary
I have graphed time in seconds (Y) vs input size in # of keys (X) for the first 14,000 keys in the biggest example dictionary. It appears all these functions have exponential complexity.
John Zwinck answer for this question
Matt my algorithm for this question without the dictionary
comparision
Matt exponential is from my first attempt at this problem. This took 76s
Matt compare is the algorithm in this question with the dict comparison line
tdelaney solution for this question. Algorithm 1 & 2 in order
georg solution from a related question I asked
The accepted answer executes in apparently linear time.
I'm surprised to find magic ratio exists for input size where run time for a dict look-up == a string search.

For the sample corpus, or any corpus in which most keys are small, it's much faster to test all possible subkeys:
def reduced(dictionary):
keys = set(dictionary.iterkeys())
subkeys = set()
for key in keys:
for n in range(1, len(key)):
for i in range(len(key) + 1 - n):
subkey = key[i:i+n]
if subkey in keys:
subkeys.add(subkey)
return {k: v
for (k, v) in dictionary.iteritems()
if k not in subkeys}
This takes about 0.2s on my system (i7-3720QM 2.6GHz).

I would do it a bit differently. Here's a generator function which gives you the "good" keys only. This avoids creating a dict which may be largely destroyed key-by-key. I also have just two levels of "for" loops and some simple optimizations to try to find matches more quickly and avoid searching for impossible matches.
def reduced_keys(dictionary):
keys = dictionary.keys()
keys.sort(key=len, reverse=True) # longest first for max hit chance
for key1 in keys:
found_in_key2 = False
for key2 in keys:
if len(key2) <= len(key1): # no more keys are long enough to match
break
if key1 in key2:
found_in_key2 = True
break
if not found_in_key2:
yield key1
If you want to make an actual dict using this, you can:
{ key: d[key] for key in reduced_keys(d) }

You create len_dict, but even though it groups keys of equal size, you still have to traverse everything multiple times to compare. Your basic plan is right - sort by size and only compare what's the same size or bigger, but there are other ways to do that. Below, I just created a regular list sorted by key size and then iterated backwards so that I could trim the dict as I went. I'm curious how its execution time compares to yours. It did your little dict example in .049 seconds.
(I hope it actually worked!)
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for k_fwd,v_fwd in items[i+1:]:
if k in k_fwd:
del items[i]
break
return dict(items)
EDIT
A significant speed increase by not unpacking k_fwd,v_fwd (after running both a few times, this wasn't really a speed-up. something else must have been eating time on my PC for awhile).
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for kv_fwd in items[i+1:]:
if k in kv_fwd[0]:
del items[i]
break
return dict(items)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unanticipated Python Dictionary Behavior - python

Related

Avoid nested loops when checking data in Python

finding number of the first d zeroes in a hash object using loop

O(n) list subtraction

QUICKEST way to find a key in a dictionary

Performance of my function removing keys from a dictionary which are substrings of other keys

Categories

Resources