Limiting the number of combinations /permutations in python - python

I was going to generate some combination using the itertools, when i realized that as the number of elements increase the time taken will increase exponentially. Can i limit or indicate the maximum number of permutations to be produced so that itertools would stop after that limit is reached.
What i mean to say is:
Currently i have
#big_list is a list of lists
permutation_list = list(itertools.product(*big_list))
Currently this permutation list has over 6 Million permutations. I am pretty sure if i add another list, this number would hit the billion mark.
What i really need is a significant amount of permutations (lets say 5000). Is there a way to limit the size of the permutation_list that is produced?

You need to use itertools.islice, like this
itertools.islice(itertools.product(*big_list), 5000)
It doesn't create the entire list in memory, but it returns an iterator which consumes the actual iterable lazily. You can convert that to a list like this
list(itertools.islice(itertools.product(*big_list), 5000))

itertools.islice has many benefits such as ability to set start and step. Solutions below aren't that flexible and you should use them only if start is 0 and step is 1. On the other hand, they don't require any imports.
You could create a tiny wrapper around itertools.product
it = itertools.product(*big_list)
pg = (next(it) for _ in range(5000)) # generator expression
(next(it) for _ in range(5000)) returns a generator not capable of producing more than 5000 values. Convert it to list by using the list constructor
pl = list(pg)
or by wrapping the generator expression with square brackets (instead of round ones)
pl = [next(it) for _ in range(5000)] # list comprehension
Another solution, which is just as efficient as the first one, is
pg = (p for p, _ in zip(itertools.product(*big_list), range(5000))
Works in Python 3+, where zip returns an iterator that stops when the shortest iterable is exhausted. Conversion to list is done as in the first solution.

You can try out this method to get particular number of permutations number of results a permutation produce is n! where n stands for the number of elements in a list for example if you want to get only 2 results then you can try the following:
Use any temporary variable and limit it
from itertools import permutations
m=['a','b','c','d']
per=permutations(m)
temp=1
for i in list(per):
if temp<=2: #2 is the limit set
print (i)
temp=temp+1
else:
break

Related

How can I partition `itertools.combinations` such that I can process the results in parallel?

I have a massive quantity of combinations (86 choose 10, which yields 3.5 trillion results) and I have written an algorithm which is capable of processing 500,000 combinations per second. I would not like to wait 81 days to see the final results, so naturally I am inclined to separate this into many processes to be handled by my many cores.
Consider this naive approach:
import itertools
from concurrent.futures import ProcessPoolExecutor
def algorithm(combination):
# returns a boolean in roughly 1/500000th of a second on average
def process(combinations):
for combination in combinations:
if algorithm(combination):
# will be very rare (a few hundred times out of trillions) if that matters
print("Found matching combination!", combination)
combination_generator = itertools.combinations(eighty_six_elements, 10)
# My system will have 64 cores and 128 GiB of memory
with ProcessPoolExecutor(workers=63) as executor:
# assign 1,000,000 combinations to each process
# it may be more performant to use larger batches (to avoid process startup overhead)
# but eventually I need to start worrying about running out of memory
group = []
for combination in combination_generator:
group.append(combination)
if len(group) >= 1_000_000:
executor.submit(process, group)
group = []
This code "works", but it has virtually no performance gain over a single-threaded approach, since it is bottlenecked by the generation of the combinations for combination in combination_generator.
How can I pass this computation off to the child-processes so that it can be parallelized? How can each process generate a specific subset of itertools.combinations?
p.s. I found this answer, but it only deals with generating single specified elements, whereas I need to efficiently generate millions of specified elements.
I'm the author of one answer to the question you already found for generating the combination at a given index. I'd start with that: Compute the total number of combinations, divide that by the number of equally sized subsets you want, then compute the cut-over for each of them. Then do your subprocess tasks with these combinations as bounds. Within each subprocess you'd do the iteration yourself, not using itertools. It's not hard:
def next_combination(n: int, c: list[int]):
"""Compute next combination, in lexicographical order.
Args:
n: the number of items to choose from.
c: a list of integers in strictly ascending order,
each of them between 0 (inclusive) and n (exclusive).
It will get modified by the call.
Returns: the list c after modification,
or None if this was the last combination.
"""
i = len(c)
while i > 0:
i -= 1
n -= 1
if c[i] == n: continue
c[i] += 1
for j in range(i + 1, len(c)):
c[j] = c[j - 1] + 1
return c
return None
Note that both the code above and the one from my other answer assume that you are looking for combinations of elements from range(n). If you want to combine other elements, do combinations for this range then use the elements of the found combinations as indices into your sequence of actual things you want to combine.
The main advantage of the approach above is that it ensures equal batch size, which might be useful if processing time is expected to be mostly determined by batch size. If processing time still varies greatly even for batches of the same size, that might be too much effort. I'll post an alternative answer addressing that.
You can do a recursive divide-and-conquer approach, where you make a decision based on the expected number of combinations. If it is small, use itertools. If it is large, handle the case of the first element being included and of it being excluded both in recursive calls.
The result does not ensure batches of equal size, but it does give you an upper bound on the size of each batch. If processing time of each batch is somewhat varied anyway, that might be good enough.
T = typing.TypeVar('T')
def combination_batches(
seq: collections.abc.Sequence[T],
r: int,
max_batch_size: int,
prefix: tuple[T, ...] = ()
) -> collections.abc.Iterator[collections.abc.Iterator[tuple[T, ...]]]:
"""Compute batches of combinations.
Each yielded value is itself a generator over some of the combinations.
Taken together they produce all the combinations.
Args:
seq: The sequence of elements to choose from.
r: The number of elements to include in each combination.
max_batch_size: How many elements each returned iterator
is allowed to iterate over.
prefix: Used during recursive calls, prepended to each returned tuple.
Yields: generators which each generate a subset of all the combinations,
in a way that generators together yield every combination exactly once.
"""
if math.comb(len(seq), r) > max_batch_size:
# One option: first element taken.
yield from combination_batches(
seq[1:], r - 1, max_batch_size, prefix + (seq[0],))
# Other option: first element not taken.
yield from combination_batches(
seq[1:], r, max_batch_size, prefix)
return
yield (prefix + i for i in itertools.combinations(seq, r))
See https://ideone.com/GD6WYl for a more compete demonstration.
Note that I don't know how well the process pool executor deals with generators as arguments, whether it is able to just forward a short description of each. There is a chance that in order to ship the generator to a subprocess it will actually generate all the values. So instead of yielding the generator expression the way I did, you might want to yield some object which pickles more nicely but still offers iteration over the same values.

What's the overhead of using built-in python functions like zip() and join() on the performance of my function?

Below I have provided the function to calculate the LCF (longest common prefix). I want to know the Big O time-complexity and space complexity. Can I say it is O(n)? Or do zip() and join() affect the time-complexity? I am wondering the space complexity is O(1). Please correct me if I am wrong. The input to the function is a list containing strings e.g., ["flower","flow","flight"].
def longestCommonPrefix(self, strs):
res = []
for x in zip(*strs):
if len(set(x)) == 1:
res.append(x[0])
else:
break
return "".join(res)
Iterating to get a single tuple value from zip(*strs) takes O(len(strs)) time and space. That's just the time it takes to allocate and fill a tuple of that length.
Iterating to consume the whole iterator takes O(len(strs) * min(len(s) for s in strs)) time, but shouldn't take any additional space over a single iteration.
Your iteration code is a bit trickier, because you may stop iterating early, when you find the first place within your strings where some characters don't match. In the worst case, all the strings are identical (up to the length of the shortest one) and so you'd use the time complexity above. And in the best case there is no common prefix, so you can use the single-value iteration as your best case.
But there's no good way to describe "average case" performance because it depends a lot on the distributions of the different inputs. If your inputs were random strings, you could do some statistics and predict an average number of iterations, but if your input strings are words, or even more likely, specific words expected to have common prefixes, then it's very likely that all bets are off.
Perhaps the best way to describe that part of the function's performance is actually in terms of its own output. It takes O(len(strs) * len(self.longestCommonPrefix(strs)) time to run.
As for str.join, running "".join(res) if we know nothing about res takes O(len(res) + len("".join(res))) for both time and space. Because your code only joins individual characters, the two lengths are going to be the same, so we can say that the join in your function takes O(len(self.longestCommonPrefix(strs))) time and space.
Putting things together, we can see that the main loop takes a multiple of the time taken by the join call, so we can ignore the latter and say that the function's time complexity is just O(len(strs) * len(self.longestCommonPrefix(strs)). However, the memory usage complexities for the two parts are independent and we can't easily predict if the number of strings or the length of the output will grow faster. So we need to combine them and say that you need O(len(strs) + len(self.longestCommonPrefix(strs))) space.
Time:
Your code is O(n * m), where n is the lenght of the list and m is the lenght of the biggest string in the list.
zip() is O(1) in python 3.x. The function allocates a special iterable (called the zip object), and assigns the parameter array to an internal field. In case of zip(*x) (pointed from #juanpa.arrivillaga), it builds a tuple, so it is O(n). As a result, you will get an O(n) because you iterate over the list (tuple) plus the zip(*x) call staying at the end with O(n).
join() is O(n). Where n is the total length of the input.
set() is O(m). Where m is the total lenght of the word.
Space:
It is O(n), because in the worst scenario, res will need to append x[0] n times.

How to select the all the permutations

I want to get a list of all permutations or ranks of permutations where the ith element is k and the len is greater than k and labeled with n. A list of integers from 1..n should be permuted. How Can this be done?
For the first Element of the permutation its trivial. But how does it work for ith Element? Iterating through n! permutations is not an option.
First of all, notice that this problem can easily be transformed into just ranking/listing permutations. All that you need to do is write a function that takes a permutation of 1..(n-1) and transforms it into a permutation meeting your condition, and vice versa. (Going one way just increment every number in the permutation that is bigger than k and insert k in the ith position. Going the other remove the k and decrement everything larger than k.)
But ranking/listing is a well-understood problem. See https://rosettacode.org/wiki/Permutations/Rank_of_a_permutation for solutions in multiple languages, including three in Python.
This idea can be extended to more conditions like the first. You just need to write more general transforms first.
Warning: the length of permutations is (n-1)!, and the total size of permutations is O(n*(n-1)!).
import itertools
i,k,n = 1,5,10 #pick these
permutations = [p for p in itertools.permutations(list(range(1,n+1,1))) if p[i]==k]

why is my iterator implementation very inefficient?

I wrote the following python script to count the number of occurrences of a character (a) in the first n characters of an infinite string.
from itertools import cycle
def count_a(str_, n):
count = 0
str_ = cycle(str_)
for i in range(n):
if next(str_) == 'a':
count += 1
return count
My understanding of iterators is that they are supposed to be efficient, but this approach is super slow for very large n. Why is this so?
The cycle iterator might not be as efficient as you think, the documentation says
Make an iterator returning elements from the iterable and saving a
copy of each.
When the iterable is exhausted, return elements from the saved copy.
Repeats indefinitely
...Note, this member of the toolkit may require significant auxiliary
storage (depending on the length of the iterable).
Why not simplify and just not use the iterator at all? It adds unnecessary overhead and gives you no benefit. You can easily count the occurrences with a simple str_[:n].count('a')
The first problem here is that despite using itertools, you're still doing explicit python-level for loop. To gain the C level speed boost when using itertools you want to keep all the iteration in the high speed itertools.
So let's do this step by step, first we want to get the number of characters in a finite string. To do this, you can use the itertools.islice method to get the first n characters in the string:
str_first_n_chars = islice(cycle(str_), n)
You next want to count the number of occurrences of the letter (a), to do this you can do some variation of either of these (you may want to experiment which variants is faster):
count_a = sum(1 for c in str_first_n_chars if c == 'a')
count_a = len(tuple(filter('a'.__eq__, str_first_n_chars))
This is all and well, but this is still slow for really large n because you need to iterate through str_ many, many times for really large n, like for example n = 10**10000. In other words, this algorithm is O(n).
There's one last improvement we could made. Notice how that the number of (a) in the str_ never really change in each iteration. Rather than iterating through str_ multiple times for large n, we can do a little bit of smarter with a bit of math so that we only need to iterate through str_ twice. First we count the number of (a) in a single stretch of str_:
count_a_single = str_.count('a')
Then we find out how many times we would have to iterate through str_ to get length n by using divmod function:
iter_count, remainder = divmod(n, len(str_))
then we can just multiply iter_count with count_a_single and add the number of (a) in the remaining length. We don't need cycle or islice and such here because remainder < len(str_)
count_a = iter_count * count_a_single + str_[:remainder].count('a')
With this method, the runtime performance of the algorithm grows only on the length of a single cycle of str_ rather than n. In other words, this algorithm is O(len(str_)).

What is the fastest way of computing powerset of an array in Python?

Given a list of numbers, e.g. x = [1,2,3,4,5] I need to compute its powerset (set of all subsets of that list). Right now, I am using the following code to compute the powerset, however when I have a large array of such lists (e.g. 40K of such arrays), it is extremely slow. So I am wondering if there can be any way to speed this up.
superset = [sorted(x[:i]+x[i+s:]) for i in range(len(x)) for s in range(len(x))]
I also tried the following code, however it is much slower than the code above.
from itertools import chain, combinations
def powerset(x):
xx = list(x)
return chain.from_iterable(combinations(xx, i) for i in range(len(xx)+1))
You can represent a powerset more efficiently by having all subsets reference the original set as a list and have each subset include a number whose bits indicate inclusion in the set. Thus you can enumerate the power set by computing the number of elements and then iterating through the integers with that many bits. However, as have been noted in the comments, the power set grows extremely fast, so if you can avoid having to compute or iterate through the power set, you should do so if at all possible.

Categories

Resources