Recursive generation + filtering. Better non-recursive? - python

I have the following need (in python):
generate all possible tuples of length 12 (could be more) containing either 0, 1 or 2 (basically, a ternary number with 12 digits)
filter these tuples according to specific criteria, culling those not good, and keeping the ones I need.
As I had to deal with small lengths until now, the functional approach was neat and simple: a recursive function generates all possible tuples, then I cull them with a filter function. Now that I have a larger set, the generation step is taking too much time, much longer than needed as most of the paths in the solution tree will be culled later on, so I could skip their creation.
I have two solutions to solve this:
derecurse the generation into a loop, and apply the filter criteria on each new 12-digits entity
integrate the filtering in the recursive algorithm, so to prevent it stepping into paths that are already doomed.
My preference goes to 1 (seems easier) but I would like to hear your opinion, in particular with an eye towards how a functional programming style deals with such cases.

How about
import itertools
results = []
for x in itertools.product(range(3), repeat=12):
if myfilter(x):
results.append(x)
where myfilter does the selection. Here, for example, only allowing result with 10 or more 1's,
def myfilter(x): # example filter, only take lists with 10 or more 1s
return x.count(1)>=10
That is, my suggestion is your option 1. For some cases it may be slower because (depending on your criteria) you many generate many lists that you don't need, but it's much more general and very easy to code.
Edit: This approach also has a one-liner form, as suggested in the comments by hughdbrown:
results = [x for x in itertools.product(range(3), repeat=12) if myfilter(x)]

itertools has functionality for dealing with this. However, here is a (hardcoded) way of handling with a generator:
T = (0,1,2)
GEN = ((a,b,c,d,e,f,g,h,i,j,k,l) for a in T for b in T for c in T for d in T for e in T for f in T for g in T for h in T for i in T for j in T for k in T for l in T)
for VAL in GEN:
# Filter VAL
print VAL

I'd implement an iterative binary adder or hamming code and run that way.

Related

Iterating through a generator of itertools.combinations object takes forever

Edit::
after all these discussions with juanpa & fusion here in the comments and Kevin on python chat , i have come to a conclusion that iterating through a generator takes the same time as it would take iterating through any other object because generator itself generates those combinations on the fly. Moreover the approach by fusion worked great for len(arr) up to 1000(maybe up to 5k - but it terminates due to time out, of course on an online judge - Please Note it is not because of trying to get the min_variance_sub, but I also have to get the sum of absolute differences of all the pairs possible in the min_variance_sub). I am going to accept fusion's approach as an answer for this question, because it answered the question.
But I will also create a new question for that problem statement (more like a QnA, where I will also answer the question for future visitors - i got the answer from submissions by other candidates, an editorial by problem setter, and a code by problem setter himself - though I do not understand the approach they used). I will link to the other question as I create it :)
It's HERE
The original question starts below
I'm using itertools.combinations on an array so first up I tried something like
aList = [list(x) for x in list(cmb(arr, k))]
where cmb = itertools.combinations, arr is the list, and k is an int.
This works totally good for len(arr) < 20 or so but this Raised a MemoryError when len(arr) became 50 or more.
On a suggestion by kevin on Python Chat, I used a generator, and it worked amazingly fast in generating those combinations like this
aGen = (list(x) for x in cmb(arr, k))
But It's so slow to iterate through this generator object.
I tried something like
for p in aGen:
continue
and even this code seems to take forever.
Kevin also suggested an answer talking about kth combination which was nice but in my case I actually want to test all the possible combinations and select the one with minimum variance.
So what would be the memory efficient way of checking all the possible combinations of an array (a list) to have minimum variance (to be precise, I only need to consider sub arrays having exactly k number of elements)
Thank You For Any Help.
You can sort the list with n elements first,
Then use a moving window of k length along the sorted list.
And find the minimum variance of the n-k+1 possible combinations.
The minimum should be the minimum of all combinations.
def myvar(arr):
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
input_list = [.......]
sorted_list = sorted(input_list)
variance = None
min_variance_sub = None
for i in range(len(sorted_list) - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
print(min_variance_sub)

Memoized to DP solution - Making Change

Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?
Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache
I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.

Fastest way to apply function on different lists in python

I have complex algorithm to build in order to select the best combination of elements in my list.
I have a list of 20 elements. I make all the combinations this list using this algorithms, the resutlt would be a list of element with size: 2^20-1 (without duplications)
from itertools import combinations
def get_all_combinations(input_list):
for i in xrange(len(input_list)):
for item in combinations(input_list, r = i + 1):
yield list(item)
input_list = [1,4,6,8,11,13,5,98,45,10,21,34,46,85,311,133,35,938,345,310]
print len(get_all_combinations(input_list)) # 1048575
I have another algorithm that is applied on every list, then calculate the max.
// this is just an example
def calcul_factor(item):
return max(item) * min(item) / sqrt(min(item))
I tried to do it like this way: but it's taking a long time.
columnsList= get_all_combinations(input_list)
for x in columnsList:
i= calcul_factor(x)
factorsList.append(i)
l.append(x)
print "max", max(factorsList)
print "Best combinations:", l[factorsList.index( max(factorsList))]
Does using Maps/Lamda expressions solve issues to make "parallelisme" to calculate the maximum ?
ANy hints to do that ?
In case you can't find a better algorithm (which might be needed here) you can avoid creating those big lists by using generators.
With the help of itertools.chain you can combine the itertools.combinations-generators. Furthermore the max-function can take a function as a key.
Your code can be reduced to:
all_combinations = chain(*[combinations(input_list, i) for i in range(1, len(input_list))])
max(all_combinations, key=algorithm)
Since this code relies solely on generators it might be faster (doesn't mean fast enough).
Edit: I generally agree with Hugh Bothwell, that you should be trying to find a better algorithm before going with an implementation like this. Especially if your lists are going to contain more than 20 elements.
If you can easily calculate calcul_factor(item + [k]) given calcul_factor(item), you might greatly benefit from a dynamic-programming approach.
If you can eliminate some bad solutions early, it will also greatly reduce the total number of combinations to consider (branch-and-bound).
If the calculation is reasonably well-behaved, you might even be able to use ie simplex method or a linear solver and walk directly to a solution (something like O(n**2 log n) runtime instead of O(2**n))
Could you show us the actual calcul_factor code and an actual input_list?

creating a hash-based sorting algorithm

For experimental and learning purposes. I was trying to create a sorting algorithm from a hash function that gives a value biased on alphabetical sequence of the string, it then would ideally place it in the right place from that hash. i tryed looking for a hash-biased sorting function but only found one for integers and would be a memory hog if adapted for my purposes.
The reasoning is that theoretically if done right this algorithm can achieve O(n) speeds or nearly so.
So here is what i have worked out in python so far:
letters = {'a':0,'b':1,'c':2,'d':3,'e':4,'f':5,'g':6,'h':7,'i':8,'j':9,
'k':10,'l':11,'m':12,'n':13,'o':14,'p':15,'q':16,'r':17,
's':18,'t':19,'u':20,'v':21,'w':22,'x':23,'y':24,'z':25,
'A':0,'B':1,'C':2,'D':3,'E':4,'F':5,'G':6,'H':7,'I':8,'J':9,
'K':10,'L':11,'M':12,'N':13,'O':14,'P':15,'Q':16,'R':17,
'S':18,'T':19,'U':20,'V':21,'W':22,'X':23,'Y':24,'Z':25}
def sortlist(listToSort):
listLen = len(listToSort)
newlist = []
for i in listToSort:
k = letters[i[0]]
for j in i[1:]:
k = (k*26) + letters[j]
norm = k/pow(26,len(i)) # get a float hash that is normalized(i think thats what it is called)
# 2nd part
idx = int(norm*len(newlist)) # get a general of where it should go
if newlist: #find the right place from idx
if norm < newlist[idx][1]:
while norm < newlist[idx][1] and idx > 0: idx -= 1
if norm > newlist[idx][1]: idx += 1
else:
while norm > newlist[idx][1] and idx < (len(newlist)-1): idx += 1
if norm > newlist[idx][1]: idx += 1
newlist.insert(idx,[i,norm])# put it in the right place with the "norm" to ref later when sorting
return newlist
i think that the 1st part is good, but the 2nd part needs help. so the Qs would be what would be the best way to do something like this or is it even possible to get O(n) time (or near that) out of this?
the testing i did with an 88,000 word list took prob about 5 min, 10,000 took about 30 sec it got a lot worse as the list count went up.
if this idea actually works out then i would recode it in C to get some real speed and optimizations.
The 2nd part is there only because it works - even if slow, and i cant think of a better way to do it for the life of me, i would like to replace it with something that would not have to do the other loops if at all possible.
thank for any advice or ideas that you could give.
On sorting in O(n): you can't do it generally for all inputs, period. It is simply, fundamentally, mathematically impossible.
Here's the nice, short information-theoretic proof of impossibility: to sort, you have to be able to distinguish among the n! possible orderings of the input; to do so, you have to get log2(n!) bits of data; to do that, you need to do O(log (n!)) comparisons, which is O(n log n). Any sorting algorithm that claims to run in O(n) is either running on specialized data (e.g. data with a fixed number of bits), or is not correct.
Implementing a sorting algorithm is a good learning exercise, but you may want to stick to existing algorithms until you are comfortable with the concepts and methods commonly employed. It might be rather frustrating otherwise if the algorithm doesn't work.
Have fun learning!
P.S. Python's built-in timsort algorithm is really good on a lot of real-world data. So, if you need a general sorting algorithm for production code, you can usually rely on .sort/sorted to be fast enough for your needs. (And, if you can understand timsort, you'll do better than 90% of the Python-wielding population :)

Efficient strings containing each other

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of #ninjagecko and #Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)
Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)
Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):
Initialize a hashtable or trie
For each string b in B:
For every substring of that string
Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
For each string a in A:
Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those
(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.
A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.
First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).
This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!
There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.
EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:
import pyahocorasick
def aho(A, B):
t = pyahocorasick.Trie();
for a in A:
t.add_word(a, a)
t.make_automaton()
return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]
One thing I observed though, was when testing this implementation with #SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.
There are specialized index structures for this, see for example
http://en.wikipedia.org/wiki/Suffix_tree
You'd build a suffix-tree or something similar for B, then use A to query it.

Categories

Resources