Pythonic iterable difference

Pythonic iterable difference - python

I've written some code to find all the items that are in one iterable and not another and vice versa. I was originally using the built in set difference, but the computation was rather slow as there were millions of items being stored in each set. Since I know there will be at most a few thousand differences I wrote the below version:
def differences(a_iter, b_iter):
a_items, b_items = set(), set()
def remove_or_add_if_none(a_item, b_item, a_set, b_set):
if a_item is None:
if b_item in a_set:
a_set.remove(b_item)
else:
b_set.add(b)
def remove_or_add(a_item, b_item, a_set, b_set):
if a in b_set:
b_set.remove(a)
if b in a_set:
a_set.remove(b)
else:
b_set.add(b)
return True
return False
for a, b in itertools.izip_longest(a_iter, b_iter):
if a is None or b is None:
remove_or_add_if_none(a, b, a_items, b_items)
remove_or_add_if_none(b, a, b_items, a_items)
continue
if a != b:
if remove_or_add(a, b, a_items, b_items) or \
remove_or_add(b, a, b_items, a_items):
continue
a_items.add(a)
b_items.add(b)
return a_items, b_items
However, the above code doesn't seem very pythonic so I'm looking for alternatives or suggestions for improvement.

Here is a more pythonic solution:
a, b = set(a_iter), set(b_iter)
return a - b, b - a
Pythonic does not mean fast, but rather elegant and readable.
Here is a solution that might be faster:
a, b = set(a_iter), set(b_iter)
# Get all the candidate return values
symdif = a.symmetric_difference(b)
# Since symdif has much fewer elements, these might be faster
return symdif - b, symdif - a
Now, about writing custom “fast” algorithms in Python instead of using the built-in operations: it's a very bad idea.
The set operators are heavily optimized, and written in C, which is generally much, much faster than Python.
You could write an algorithm in C (or Cython), but then keep in mind that Python's set algorithms were written and optimized by world-class geniuses.
Unless you're extremely good at optimization, it's probably not worth the effort. On the other hand, if you do manage to speed things up substantially, please share your code; I bet it'd have a chance of getting into Python itself.
For a more realistic approach, try eliminating calls to Python code. For instance, if your objects have a custom equality operator, figure out a way to remove it.
But don't get your hopes up. Working with millions of pieces of data will always take a long time. I don't know where you're using this, but maybe it's better to make the computer busy for a minute than to spend the time optimizing set algorithms?

i think your code is broken - try it with [1,1] and [1,2] and you'll get that 1 is in one set but not the other.
> print differences([1,1],[1,2])
(set([1]), set([2]))
you can trace this back to the effect of the if a != b test (which is assuming something about ordering that is not present in simple set differences).
without that test, which probably discards many values, i don't think your method is going to be any faster than built-in sets. the argument goes something like: you really do need to create one set in memory to hold all the data (your bug came from not doing that). a naive set approach creates two sets. so the best you can do is save half the time, and you also have to do the work, in python, of what is probably efficient c code.

I would have thought python set operations would be the best performance you could get out of the standard library.
Perhaps it's the particular implementation you chose that's the problem, rather than the data structures and attendant operations themselves. Here's an alternate implementation that should be give you better performance.
For sequence comparison tasks in which the sequences are large, avoid, if at all possible, putting the objects that comprise the sequences into the containers used for the comparison--better to work with indices instead. If the objects in your sequences are unordered, then sort them.
So for instance, i use NumPy, the numerical python library, for these sort of tasks:
# a, b are 'fake' index arrays of type boolean
import numpy as NP
a, b = NP.random.randint(0, 2, 10), NP.random.randint(0, 2, 10)
a, b = NP.array(a, dtype=bool), NP.array(b, dtype=bool)
# items a and b have in common:
NP.sum(NP.logical_and(a, b))
# the converse (the differences)
NP.sum(NP.logical_or(a, b))

Related

How can i check that a list is in my array in python

for example if i have:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
and i want to check if the following list is the same as one of the lists that the array consist of:
B = [2,3,4]
I tried
B in A #which returns True
But the following also returns True, which should be false:
B = [2,2,2]
B in A

Try this generator comprehension. The builtin any() short-circuits so that you don't have extra evaluations that you don't need.
any(np.array_equal(row, B) for row in A)
For now, np.array_equal doesn't implement internal short-circuiting. In a different question the performance impact of different ways of accomplishing this is discussed.
As #Dan mentions below, broadcasting is another valid way to solve this problem, and it's often (though not always) a better way. For some rough heuristics, here's how you might want to choose between the two approaches. As with any other micro-optimization, benchmark your results.
Generator Comprehension
Reduced memory footprint (not creating the array B==A)
Short-circuiting (if the first row of A is B, we don't have to look at the rest)
When rows are large (definition depends on your system, but could be ~100 - 100,000), broadcasting isn't noticeably faster.
Uses builtin language features. You have numpy installed anyway, but I'm partial to using the core language when there isn't a reason to do otherwise.
Broadcasting
Fastest way to solve an extremely broad range of problems using numpy. Using it here is good practice.
If we do have to search through every row in A (i.e. if more often than not we expect B to not be in A), broadcasting will almost always be faster (not always a lot faster necessarily, see next point)
When rows are smallish, the generator expression won't be able to vectorize the computations efficiently, so broadcasting will be substantially faster (unless of course you have enough rows that short-circuiting outweighs that concern).
In a broader context where you have more numpy code, the use of broadcasting here can help to have more consistent patterns in your code base. Coworkers and future you will appreciate not having a mix of coding styles and patterns.

You can do it by using broadcasting like this:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,3,4]) # Or [2,3,4], a list will work fine here too
(B==A).all(axis=1).any()

Using the built-in any. As soon as an identical element is found, it stops iterating and returns true.
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = [3,2,4]
if any(np.array_equal(B, x) for x in A):
print(f'{B} inside {A}')
else:
print(f'{B} NOT inside {A}')

You need to use .all() for comparing all the elements of list.
A = np.array([[2,3,4],[5,6,7]])
B = [2,3,4]
for i in A:
if (i==B).all():
print ("Yes, B is present in A")
break
EDIT: I put break to break out of the loop as soon as the first occurence is found. This applies to example such as A = np.array([[2,3,4],[2,3,4]])
# print ("Yes, B is present in A")
Alternative solution using any:
any((i==B).all() for i in A)
# True

list((A[[i], :]==B).all() for i in range(A.shape[0]))
[True, False]
This will tell you what row of A is equal to B

Straight forward, you could use any() to go through a generator comparing the arrays with array_equal.
from numpy import array_equal
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,2,4])
in_A = lambda x, A : any((array_equal(a,x) for a in A))
print(in_A(B, A))
False
[Program finished]

From for loops to matrix computation

I've got a piece of code taking an input and checking if the input meets requirements. The input is composed of a list of objects called S.
class S:
def __init__(self, f, t, tf, timeline):
self.f = f
self.t = t
self.tf = tf
self.timeline = timeline
To know if a combination of objects meets the requirement, I have functions taking a list of size N of objects and returning True or False.
input1 = [S_1, ..., S_N]
def c1(input1):
if condition_c1_valid:
return True
else:
return False
Now let's consider this example:
import itertools
possible_objects = [S(f, t, tf, timeline) for f in [...] for t in [..] ...]
inputs_to_check = list(itertools.combination_with_replacement(possible_objects, 5)
results = list()
for inp in inputs_to_check:
if c1(inp):
results.append(inp)
Right now, my solution is using a for loop on the N condition I'm checking every time.
The code keeps the inputs which meets the condition.
Could this be computed at once in a matrix fashion? (Vectorized)
I was thinking of something like this: (pseudo code)
Data[input, c1, ..., cN]
return where(all(c1, ..., cN) is True)
Can anyone tell me if it is achievable, and could point me towards examples? In the end, my list of inputs to check is very large. Thus it would be interesting to send the computation to the GPU. I thought that maybe this could be achieved through Tensorflow...
Thanks for the tips :)
EDIT: The example above is far from the reality. I'm using nested for loops on a large set, with a complexity of the 6th or 7th degree. The current solution is optimize with generators, but I would like to push this further.

In the most general sense, you won't be able to vectorize this. CPython is notoriously bad at parallel processing due to the GIL and it's primary matrix vectorization library (numpy) is for dealing with primative types (integers, floats, etc.), not python objects such as S.
There are a few things that could help:
If f, t, tf, timeline are numbers (which they look like they
may be), then you could form four numpy arrays of these values and
pass those through a vectorized version of c1 which returns a boolean array. You could then do np.asarray(input1)[c1_vec(f_vec, t_vec, tf_vec, timeline_vec)]
You said you've used generators instead of lists, but just to be especially sure your example should read as:
possible_objects = (S(f, t, tf, timeline) for f in (...) for t in (...) ...)
inputs_to_check = itertools.combination_with_replacement(possible_objects, 5)
results = [inp for inp in inputs_to_check if c1(inp)]
This saves a lot of time of writing objects to memory that can be avoided.
Use PyPy. It uses a JIT compiler to massively speed up python for loops. For very large loops this will get up to near C speed.
You mention using a GPU. CPython doesn't even run on more then one CPU core, running this on a GPU would be pointless unless using another implementation.

How to avoid using for-loops with numpy?

I have already written the following piece of code, which does exactly what I want, but it goes way too slow. I am certain that there is a way to make it faster, but I cant seem to find how it should be done. The first part of the code is just to show what is of which shape.
two images of measurements (VV1 and HH1)
precomputed values, VV simulated and HH simulated, which both depend on 3 parameters (precomputed for (101, 31, 11) values)
the index 2 is just to put the VV and HH images in the same ndarray, instead of making two 3darrays
VV1 = numpy.ndarray((54, 43)).flatten()
HH1 = numpy.ndarray((54, 43)).flatten()
precomp = numpy.ndarray((101, 31, 11, 2))
two of the three parameters we let vary
comp = numpy.zeros((len(parameter1), len(parameter2)))
for i,(vv,hh) in enumerate(zip(VV1,HH1)):
comp0 = numpy.zeros((len(parameter1),len(parameter2)))
for j in range(len(parameter1)):
for jj in range(len(parameter2)):
comp0[j,jj] = numpy.min((vv-precomp[j,jj,:,0])**2+(hh-precomp[j,jj,:,1])**2)
comp+=comp0
The obvious thing i know i should do is get rid of as many for-loops as I can, but I don't know how to make the numpy.min behave properly when working with more dimensions.
A second thing (less important if it can get vectorized, but still interesting) i noticed is that it takes mostly CPU time, and not RAM, but i searched a long time already, but i cant find a way to write something like "parfor" instead of "for" in matlab, (is it possible to make an #parallel decorator, if i just put the for-loop in a separate method?)
edit: in reply to Janne Karila: yeah that definately improves it a lot,
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
Is definitely a lot faster, but is there any possibility to remove the outer for-loop too? And is there a way to make a for-loop parallel, with an #parallel or something?

This can replace the inner loops, j and jj
comp0 = numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
This may be a replacement for the whole loop, though all this indexing is stretching my mind a bit. (this creates a large intermediate array though)
comp = numpy.sum(
numpy.min((VV1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,0])**2
+(HH1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,1])**2,
axis=2),
axis=0)

One way to parallelize the loop is to construct it in such a way as to use map. In that case, you can then use multiprocessing.Pool to use a parallel map.
I would change this:
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
To something like this:
def buildcomp(vvhh):
vv, hh = vvhh
return numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
if __name__=='__main__':
from multiprocessing import Pool
nthreads = 2
p = Pool(nthreads)
complist = p.map(buildcomp, np.column_stack((VV1,HH1)))
comp = np.dstack(complist).sum(-1)
Note that the dstack assumes that each comp.ndim is 2, because it will add a third axis, and sum along it. This will slow it down a bit because you have to build the list, stack it, then sum it, but these are all either parallel or numpy operations.
I also changed the zip to a numpy operation np.column_stack, since zip is much slower for long arrays, assuming they're already 1d arrays (which they are in your example).
I can't easily test this so if there's a problem, feel free to let me know.

In computer science, there is the concept of Big O notation, used for getting an approximation of how much work is required to do something. To make a program fast, do as little as possible.
This is why Janne's answer is so much faster, you do fewer calculations. Taking this principle farther, we can apply the concept of memoization, because you are CPU bound instead of RAM bound. You can use the memory library, if it needs to be more complex than the following example.
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
memo = AutoVivification()
def memoize(n, arr, end):
if not memo[n][arr][end]:
memo[n][arr][end] = (n-arr[...,end])**2
return memo[n][arr][end]
for (vv,hh) in zip(VV1,HH1):
first = memoize(vv, precomp, 0)
second = memoize(hh, precomp, 1)
comp+= numpy.min(first+second, axis=2)
Anything that has already been computed gets saved to memory in the dictionary, and we can look it up later instead of recomputing it. You can even break down the math being done into smaller steps that are each memoized if necessary.
The AutoVivification dictionary is just to make it easier to save the results inside of memoize, because I'm lazy. Again, you can memoize any of the math you do, so if numpy.min is slow, memoize it too.

Efficient strings containing each other

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of #ninjagecko and #Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)

Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)

Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):
Initialize a hashtable or trie
For each string b in B:
For every substring of that string
Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
For each string a in A:
Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those
(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.

A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.
First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).
This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!
There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.
EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:
import pyahocorasick
def aho(A, B):
t = pyahocorasick.Trie();
for a in A:
t.add_word(a, a)
t.make_automaton()
return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]
One thing I observed though, was when testing this implementation with #SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.

There are specialized index structures for this, see for example
http://en.wikipedia.org/wiki/Suffix_tree
You'd build a suffix-tree or something similar for B, then use A to query it.

Recursive generation + filtering. Better non-recursive?

I have the following need (in python):
generate all possible tuples of length 12 (could be more) containing either 0, 1 or 2 (basically, a ternary number with 12 digits)
filter these tuples according to specific criteria, culling those not good, and keeping the ones I need.
As I had to deal with small lengths until now, the functional approach was neat and simple: a recursive function generates all possible tuples, then I cull them with a filter function. Now that I have a larger set, the generation step is taking too much time, much longer than needed as most of the paths in the solution tree will be culled later on, so I could skip their creation.
I have two solutions to solve this:
derecurse the generation into a loop, and apply the filter criteria on each new 12-digits entity
integrate the filtering in the recursive algorithm, so to prevent it stepping into paths that are already doomed.
My preference goes to 1 (seems easier) but I would like to hear your opinion, in particular with an eye towards how a functional programming style deals with such cases.

How about
import itertools
results = []
for x in itertools.product(range(3), repeat=12):
if myfilter(x):
results.append(x)
where myfilter does the selection. Here, for example, only allowing result with 10 or more 1's,
def myfilter(x): # example filter, only take lists with 10 or more 1s
return x.count(1)>=10
That is, my suggestion is your option 1. For some cases it may be slower because (depending on your criteria) you many generate many lists that you don't need, but it's much more general and very easy to code.
Edit: This approach also has a one-liner form, as suggested in the comments by hughdbrown:
results = [x for x in itertools.product(range(3), repeat=12) if myfilter(x)]

itertools has functionality for dealing with this. However, here is a (hardcoded) way of handling with a generator:
T = (0,1,2)
GEN = ((a,b,c,d,e,f,g,h,i,j,k,l) for a in T for b in T for c in T for d in T for e in T for f in T for g in T for h in T for i in T for j in T for k in T for l in T)
for VAL in GEN:
# Filter VAL
print VAL

I'd implement an iterative binary adder or hamming code and run that way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.