Avoid nested loops when checking data in Python

Avoid nested loops when checking data in Python - python

I have two lists of dictionaries:
dict_list1 = [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}]
dict_list2 = [{'k1':1, 'k2':2, 'k3':10}, {'k1':3, 'k2':4, 'k3':10}]
And now for each dict_x in dict_list1, I want to know if there is a dict_y on dict_list2 that contains every key,value from dict_x.
I cannot think of another way of doing this other then:
for dict_x in dict_list1:
for dict_y in dict_list2:
count = len(dict_x)
for key, val in dict_x.items():
if key in dict_y and dict_y[key] == val:
count -= 1
if count == 0:
print('YAY')
break

dict views can perform quick "is subset" testing via the inequality operators. So:
if dict_x.items() <= dict_y.items(): # Use .viewitems() instead of .items() on Python 2.7
will only return true if every key/value pair in dict_x also appears in dict_y.
This won't change anything in terms of big-O performance, but it does make the code somewhat cleaner:
for dict_x in dict_list1:
for dict_y in dict_list2:
if dict_x.items() <= dict_y.items():
print('YAY')
break
Note that creating the views costs something (it's just a fixed cost, not dependent on dict size), so if performance matters, it may be worth caching the views; doing so for dict_list1 is free:
for dict_x in dict_list1:
dict_x_view = dict_x.items()
for dict_y in dict_list2:
if dict_x_view <= dict_y.items():
print('YAY')
break
but some eager conversions would be needed to cache both:
# Convert all of dict_list2 to views up front; costs a little if
# not all views end up being tested (we always break before finishing)
# but usually saves some work at the cost of a tiny amount of memory
dict_list2_views = [x.items() for x in dict_list2]
for dict_x in dict_list1:
dict_x_view = dict_x.items()
for dict_y_view in dict_list2_views:
if dict_x_view <= dict_y_view:
print('YAY')
break
You could also collapse the loop using any (which removes the need to break since any short-circuits), so the first (simplest) check could become:
for dict_x in dict_list1:
if any(dict_x.items() <= dict_y.items() for dict_y in dict_list2):
print('YAY')
This could be further collapsed to a single list comprehension that results in the various matches, but at that point the code is going to be pretty cramped/ugly:
for _ in (dict_x in dict_list1 if any(dict_x.items() <= dict_y.items() for dict_y in dict_list2)):
print('YAY')
though without knowing what you'd really do (as opposed to just printing YAY) that's getting a little pointless.

Below, I use the fact that the dict.items view implements set operations to check for each d1.items() if there exists a d2.items(), such that d1.items() is a subset of d2.items()
[any(d1.items() <= d2.items() for d2 in dict_list2) for d1 in dict_list1]

You can use any and all:
dict_list1 = [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}]
dict_list2 = [{'k1':1, 'k2':2, 'k3':10}, {'k1':3, 'k2':4, 'k3':10}]
v = [any(all(c in i and i[c] == k for c, k in b.items()) for i in dict_list2)\
for b in dict_list1]
Output:
[True, True]

Related

Unanticipated Python Dictionary Behavior

I have this piece of code:
import time
d = dict()
for i in range(200000):
d[i] = "DUMMY"
start_time = time.time()
for i in range(200000):
for key in d:
if len(d) > 1 or -1 not in d:
break
del d[i]
print("--- {} seconds ---".format(time.time() - start_time))
Why does this take ~15 seconds to run?
But, if I comment out del d[i] or the inner loop, it runs in ~0.1 seconds.

The issue you have is caused by iterating over even one element (e.g. next(iter(d))) of a dictionary that was once large, but has been shrunk a great deal. This can be nearly slow as iterating over all of the dictionary items if you get unlucky with your hash values. And this code is very "unlucky" (predictably so, due to Python hash design).
The reason for the issue is that Python does not rebuild a dictionary's hash table when you remove items. So the hash table for a dictionary that used to have 200000 items in it, but which now has only 1 left is still has more than 200000 spaces in it (and probably more, since it was probably not entirely full at its peak).
When you're iterating the dictionary when it has all its values in it, finding the first one is pretty simple. The first one will be in one of the first few table entries. But as you empty out the table, more and more blank spaces will be at the start of the table and the search for the first value that still exists will take longer and longer.
This might be even worse given that you're using integer keys, which (mostly) hash to themselves (only -1 hashes to something else). This means that the first key in the "full" dictionary will usually be 0, the next 1, and so on. As you delete the values in increasing order, you'll be very precisely removing the earliest keys in the table first, and so making the searches maximally worse.

It's because this
for key in d:
if len(d) > 1 or -1 not in d:
break
will break on the first iteration, so your inner loop is basically a no-op.
Adding del[i] makes it do some real work, which takes time.
Update: well the above is obviously way to simplistic :-)
The following version of your code shows the same characteristic:
import time
import gc
n = 140000
def main(d):
for i in range(n):
del d[i] # A
for key in d: # B
break # B
import dis
d = dict()
for i in range(n):
d[i] = "DUMMY"
print dis.dis(main)
start_time = time.time()
main(d)
print("--- {} seconds ---".format(time.time() - start_time))
Using iterkeys doesn't make a difference.
If we plot the run time on different sizes of n we get (n on the x-axis, seconds on the y-axis):
so clearly something exponential going on.
Deleting line (A) or lines (B) removes the exponential component, although I'm not sure why.
Update 2: Based on #Blckknght's answer, we can regain some of the speed by infrequently rehashing the items:
def main(d):
for i in range(n):
del d[i]
if i % 5000 == 0:
d = {k:v for k, v in d.items()}
for key in d:
break
or this:
def main(d):
for i in range(n):
del d[i]
if i % 6000 == 0:
d = {k:v for k, v in d.items()}
try:
iter(d).next()
except StopIteration:
pass
takes under half the time of the original on large n (the bump at 130000 is consistent over 4 runs..).

There seems to be some performance cost to accessing the keys as a whole after deleting an item. This cost is not incurred when you do direct accesses so, my guess is that the dictionary flags its key list as dirty when an item is removed and waits for a reference to the key list before updating/rebuilding it.
This explains why you don't get a performance hit when you remove the inner loop (you're not causing the key list to be rebuilt). It also explains why the loop is fast when you remove the del d[i] line (you're not flagging the key list for rebuilding).

Test if all values of a dictionary are equal - when value is unknown

I have 2 dictionaries:
the values in each dictionary should all be equal.
BUT I don't know what that number will be...
dict1 = {'xx':A, 'yy':A, 'zz':A}
dict2 = {'xx':B, 'yy':B, 'zz':B}
N.B. A does not equal B
N.B. Both A and B are actually strings of decimal numbers (e.g. '-2.304998') as they have been extracted from a text file
I want to create another dictionary - that effectively summarises this data - but only if all the values in each dictionary are the same.
i.e.
summary = {}
if dict1['xx'] == dict1['yy'] == dict1['zz']:
summary['s'] = dict1['xx']
if dict2['xx'] == dict2['yy'] == dict2['zz']:
summary['hf'] = dict2['xx']
Is there a neat way of doing this in one line?
I know it is possible to create a dictionary using comprehensions
summary = {k:v for (k,v) in zip(iterable1, iterable2)}
but am struggling with both the underlying for loop and the if statement...
Some advice would be appreciated.
I have seen this question, but the answers all seem to rely on already knowing the value being tested (i.e. are all the entries in the dictionary equal to a known number) - unless I am missing something.

sets are a solid way to go here, but just for code golf purposes here's a version that can handle non-hashable dict values:
expected_value = next(iter(dict1.values())) # check for an empty dictionary first if that's possible
all_equal = all(value == expected_value for value in dict1.values())
all terminates early on a mismatch, but the set constructor is well enough optimized that I wouldn't say that matters without profiling on real test data. Handling non-hashable values is the main advantage to this version.

One way to do this would be to leverage set. You know a set of an iterable has a length of 1 if there is only one value in it:
if len(set(dct.values())) == 1:
summary[k] = next(iter(dct.values()))
This of course, only works if the values of your dictionary are hashable.

While we can use set for this, doing so has a number of inefficiencies when the input is large. It can take memory proportional to the size of the input, and it always scans the whole input, even when two distinct values are found early. Also, the input has to be hashable.
For 3-key dicts, this doesn't matter much, but for bigger ones, instead of using set, we can use itertools.groupby and see if it produces multiple groups:
import itertools
groups = itertools.groupby(dict1.values())
# Consume one group if there is one, then see if there's another.
next(groups, None)
if next(groups, None) is None:
# All values are equal.
do_something()
else:
# Unequal values detected.
do_something_else()

Except for readability, I don't care for all the answers involving set or .values. All of these are always O(N) in time and memory. In practice it can be faster, although it depends on the distribution of values.
Also because set employs hashing operations, you may also have a hefty large constant multiplier to your time cost. And your values have to hashable, when a test for equality is all that's needed.
It is theoretically better to take the first value from the dictionary and search for the first example in the remaining values that is not equal to.
set might be quicker than the solution below because its workings are may reduce to C implementations.
def all_values_equal(d):
if len(d)<=1: return True # Treat len0 len1 as all equal
i = d.itervalues()
firstval = i.next()
try:
# Incrementally generate all values not equal to firstval
# .next raises StopIteration if empty.
(j for j in i if j!=firstval).next()
return False
except StopIteration:
return True
print all_values_equal({1:0, 2:1, 3:0, 4:0, 5:0}) # False
print all_values_equal({1:0, 2:0, 3:0, 4:0, 5:0}) # True
print all_values_equal({1:"A", 2:"B", 3:"A", 4:"A", 5:"A"}) # False
print all_values_equal({1:"A", 2:"A", 3:"A", 4:"A", 5:"A"}) # True
In the above:
(j for j in i if j!=firstval)
is equivalent to:
def gen_neq(i, val):
"""
Give me the values of iterator i that are not equal to val
"""
for j in i:
if j!=val:
yield j

I found this solution, which I find quite a bit I combined another solution found here: enter link description here
user_min = {'test':1,'test2':2}
all(value == list(user_min.values())[0] for value in user_min.values())
>>> user_min = {'test':1,'test2':2}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
False
>>> user_min = {'test':2,'test2':2}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
True
>>> user_min = {'test':'A','test2':'B'}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
False
>>> user_min = {'test':'A','test2':'A'}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
True
Good for a small dictionary, but I'm not sure about a large dictionary, since we get all the values to choose the first one

O(n) list subtraction

When working on an AoC puzzle, I found I wanted to subtract lists (preserving ordering):
def bag_sub(list_big, sublist):
result = list_big[:]
for n in sublist:
result.remove(n)
return result
I didn't like the way the list.remove call (which is itself O(n)) is contained within the loop, that seems needlessly inefficient. So I tried to rewrite it to avoid that:
def bag_sub(list_big, sublist):
c = Counter(sublist)
result = []
for k in list_big:
if k in c:
c -= Counter({k: 1})
else:
result.append(k)
return result
Is this now O(n), or does the Counter.__isub__ usage still screw things up?
This approach requires that elements must be hashable, a restriction which the original didn't have. Is there an O(n) solution which avoids creating this additional restriction? Does Python have any better "bag" datatype than collections.Counter?
You can assume sublist is half the length of list_big.

I'd use a Counter, but I'd probably do it slightly differently, and I'd probably do this iteratively...
def bag_sub(big_list, sublist):
sublist_counts = Counter(sublist)
result = []
for item in big_list:
if sublist_counts[item] > 0:
sublist_counts[item] -= 1
else:
result.append(item)
return result
This is very similar to your solution, but it's probably not efficient to create an entire new counter every time you want to decrement the count on something.1
Also, if you don't need to return a list, then consider a generator function...
This works as long as all of the elements in list_big and sublist can be hashed. This solution is O(N + M) where N and M are the lengths of list_big and sublist respectively.
If the elements cannot be hashed, you are out of luck unless you have other constraints (e.g. the inputs are sorted using the same criterion). If your inputs are sorted, you could do something similar to the merge stage of merge-sort to determine which elements from bag_sub are in sublist.
1Note that Counters also behave a lot like a defaultdict(int) so it's perfectly fine to look for an item in a counter that isn't there already.

Is this now O(n), or does the Counter.__isub__ usage still screw things up?
This would be expected-case O(n), except that when Counter.__isub__ discards nonpositive values, it goes through every key to do so. You're better off just subtracting 1 from the key value the "usual" way and checking c[k] instead of k in c. (c[k] is 0 for k not in c, so you don't need an in check.)
if c[k]:
c[k] -= 1
else:
result.append(k)
Is there an O(n) solution which avoids creating this additional restriction?
Only if the inputs are sorted, in which case a standard variant of a mergesort merge can do it.
Does Python have any better "bag" datatype than collections.Counter?
collections.Counter is Python's bag.

Removing an item from a list of length N is O(N) if the list is unordered, because you have to find it.
Removing k items from a list of length N, therefore, is O(kN) if we focus on "reasonable" cases where k << N.
So I don't see how you could get it down to O(N).
A concise way to write this:
new_list = [x for x in list_big if x not in sublist]
But that's still O(kN).

Performance of my function removing keys from a dictionary which are substrings of other keys

I am curious why removing a line in my code results in a significant increase in performance. The function itself takes a dictionary and removes all keys which are substrings of other keys.
The line which slows my code down is:
if sub in reduced_dict and sub2 in reduced_dict:
Here's my function:
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
The function checks if sub is in sub2 many times. I assumed that if I checked for this comparison having already been made, I would be saving myself time. This doesn't seem to be the case. Why is the constant time function for lookup in a dictionary slowing me down?
I am a beginner so, I'm interested in concepts.
When I tested if the line in question is ever returning False, it appears that it is. I've tested this with the following
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub not in reduced_dict or sub2 not in reduced_dict:
print 'not present' # This line prints many thousands of times
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
For 14,805 keys in the function's input dictionary:
19.6360001564 sec. without the line
33.1449999809 sec. with the line
Here are 3 dictionary examples. Biggest sample dictionary with 14805 keys, medium sample dictionary and smaller sample dictionary
I have graphed time in seconds (Y) vs input size in # of keys (X) for the first 14,000 keys in the biggest example dictionary. It appears all these functions have exponential complexity.
John Zwinck answer for this question
Matt my algorithm for this question without the dictionary
comparision
Matt exponential is from my first attempt at this problem. This took 76s
Matt compare is the algorithm in this question with the dict comparison line
tdelaney solution for this question. Algorithm 1 & 2 in order
georg solution from a related question I asked
The accepted answer executes in apparently linear time.
I'm surprised to find magic ratio exists for input size where run time for a dict look-up == a string search.

For the sample corpus, or any corpus in which most keys are small, it's much faster to test all possible subkeys:
def reduced(dictionary):
keys = set(dictionary.iterkeys())
subkeys = set()
for key in keys:
for n in range(1, len(key)):
for i in range(len(key) + 1 - n):
subkey = key[i:i+n]
if subkey in keys:
subkeys.add(subkey)
return {k: v
for (k, v) in dictionary.iteritems()
if k not in subkeys}
This takes about 0.2s on my system (i7-3720QM 2.6GHz).

I would do it a bit differently. Here's a generator function which gives you the "good" keys only. This avoids creating a dict which may be largely destroyed key-by-key. I also have just two levels of "for" loops and some simple optimizations to try to find matches more quickly and avoid searching for impossible matches.
def reduced_keys(dictionary):
keys = dictionary.keys()
keys.sort(key=len, reverse=True) # longest first for max hit chance
for key1 in keys:
found_in_key2 = False
for key2 in keys:
if len(key2) <= len(key1): # no more keys are long enough to match
break
if key1 in key2:
found_in_key2 = True
break
if not found_in_key2:
yield key1
If you want to make an actual dict using this, you can:
{ key: d[key] for key in reduced_keys(d) }

You create len_dict, but even though it groups keys of equal size, you still have to traverse everything multiple times to compare. Your basic plan is right - sort by size and only compare what's the same size or bigger, but there are other ways to do that. Below, I just created a regular list sorted by key size and then iterated backwards so that I could trim the dict as I went. I'm curious how its execution time compares to yours. It did your little dict example in .049 seconds.
(I hope it actually worked!)
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for k_fwd,v_fwd in items[i+1:]:
if k in k_fwd:
del items[i]
break
return dict(items)
EDIT
A significant speed increase by not unpacking k_fwd,v_fwd (after running both a few times, this wasn't really a speed-up. something else must have been eating time on my PC for awhile).
def myfilter(d):
items = d.items()
items.sort(key=lambda x: len(x[0]))
for i in range(len(items)-2,-1,-1):
k = items[i][0]
for kv_fwd in items[i+1:]:
if k in kv_fwd[0]:
del items[i]
break
return dict(items)

Python -- list comprehension with try/exception and nested conditional

This code should find the mode of a list in O(n) linear time. I want to turn this into a list comprehension because I'm teaching myself Python, and am trying to improve my list comprehension skills.
These were informative but don't really answer my question:
Convert nested loops and conditions to a list comprehension
`elif` in list comprehension conditionals
Nested list comprehension equivalent
The problem that I'm running into is nesting the if's and the try/except. I'm sure this is simple question so a junior Python programmer might have the answer quickly.
def mode(L):
# your code here
d = {}; mode = 0; freq = 0
for j in L:
try:
d[j] += 1
if d[j] > freq:
mode = j; freq = d[j]
except(KeyError): d[j] = 1
return mode
Note that L parameter is a list of ints like this:
L = [3,4,1,20,102,3,5,67,39,10,1,4,34,1,6,107,99]
I was thinking something like:
[try (d[j] += 1) if d[j] > freq (mode = j; freq = d[j]) except(KeyError): d[j] = 1 for j in L]
But I don't have enough duct tape to fix how badly the syntax is off with that thing.

I know you're learning comprehensions, but you can do this with a default dictionary, or a Counter too.
import collections
def mode(L):
# your code here
d = collections.defaultdict(lambda: 1); mode = 0; freq = 0
for j in L:
d[j] += 1
if d[j] > freq:
mode = j; freq = d[j]
return mode
Better still, when you are not trying to learn comprehensions:
import collections
def mode(L):
collections.Counter(L).most_common(1)[0][0]

While it might not be possible directly do this within a list comprehension, there's also no reason to. You only really want to be checking for errors when you're actually retrieving the results. As such, you really want to use a generator instead of a list comprehension.
Syntax is largely the same, just using parens instead instead of brackets, so you would do something like this:
generator = (do something)
try:
for thing in generator
except KeyError:
etc...
That said, you really don't want to do this for you particular application. You want to use a counter:
from collections import Counter
d = Counter(L)
mode = Counter.most_common(1)[0]

You can't incorporate try: except: in a list comprehension. However, you can get around it by refactoring into a dict comprehension:
d = {i: L.count(i) for i in L}
You can then determine the maximum and corresponding key in a separate test. However, this would be O(n**2).

It's not possible to use try-except expressions in list comprenhension.
Quoting this answer:
It is not possible to handle exceptions in a list comprehension for a list comprehension is an expression containing other expression, nothing more (i.e., no statements, and only statements can catch/ignore/handle exceptions).
Edit 1:
What you could do instead of using the try-except clause, is use the get method from the dictionary:
def mode(L):
d = {}
mode = 0
freq = 0
for j in L:
d[j] = d.get(j, 0) + 1
if d[j] > freq:
mode = j
freq = d[j]
return mode
From Python docs:
get(key[, default]): Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
Edit 2:
This is my list comprenhension approach, not very efficient, just for fun:
r2 = max(zip(L, [L.count(e) for e in L]), key = lambda x: x[1])[0]

Since you're trying to find the value that appears most often, an easy way to do that is with max:
def mode(L):
return max(L, key=L.count)
This is a bit less efficient than the other answers that suggest using collections.Counter (it is O(N^2) rather than O(N)), but for a modest sized list it will probably be fast enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoid nested loops when checking data in Python - python

Below, I use the fact that the dict.items view implements set operations to check for each d1.items() if there exists a d2.items(), such that d1.items() is a subset of d2.items() [any(d1.items() <= d2.items() for d2 in dict_list2) for d1 in dict_list1]

You can use any and all: dict_list1 = [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}] dict_list2 = [{'k1':1, 'k2':2, 'k3':10}, {'k1':3, 'k2':4, 'k3':10}] v = [any(all(c in i and i[c] == k for c, k in b.items()) for i in dict_list2)\ for b in dict_list1] Output: [True, True]

Related

Unanticipated Python Dictionary Behavior

Test if all values of a dictionary are equal - when value is unknown

O(n) list subtraction

Performance of my function removing keys from a dictionary which are substrings of other keys

Python -- list comprehension with try/exception and nested conditional

Categories

Resources