Related
Given a list containing N sublists of multiple lengths, find all unique combinations of a k size, selecting only one element from each sublist.
The order of the elements in the combination is not relevant: (a, b) = (b, a)
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output =
[
('B1', 'T1'),('B1', 'T2'),('B1', 'L1'),('B1', 'L2'),('B1', 'L3'),('B1', 'L4'),
('B2', 'T1'),('B2', 'T2'),('B2', 'L1'),('B2', 'L2'),('B2', 'L3'),('B2', 'L4'),
('B3', 'T1'),('B3', 'T2'),('B3', 'L1'),('B3', 'L2'),('B3', 'L3'),('B3', 'L4'),
('T1', 'L1'),('T1', 'L2'),('T1', 'L3'),('T1', 'L4'),
('T2', 'L1'),('T2', 'L2'),('T2', 'L3'),('T2', 'L4')
]
Extra points for a pythonic way of doing it
Speed/Efficiency matters, the idea is to use in a list with hundreds of lists ranging from 5 to 50 in length
What I have been able to accomplish so far:
Using for and while loops to move pointers and build the answer, however I am having a hard time figuring out how to include K parameter to set the size of tuple combination dinamically. (not really happy about it)
def build_combinations(lst):
result = []
count_of_lst = len(lst)
for i, sublist in enumerate(lst):
if i == count_of_lst - 1:
continue
else:
for item in sublist:
j = 0
while i < len(lst)-1:
while j <= len(lst[i+1])-1:
comb = (item, lst[i+1][j])
result.append(comb)
j = j + 1
i = i + 1
j = 0
i = 0
return result
I've seen many similar questions in stack overflow, but none of them addressed the parameters the way I am trying to (one item from each list, and the size of the combinations being a params of function)
I tried using itertools combinations, product, permutation and flipping them around without success. Whenever using itertools I have either a hard time using only one item from each list, or not being able to set the size of the tuple I need.
I tried NumPy using arrays and a more math/matrix approach, but didn't go too far. There's definitely a way of solving with NumPy, hence why I tagged numpy as well
You need to combine two itertools helpers, combinations to select the two unique ordered lists to use, then product to combine the elements of the two:
from itertools import combinations, product
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = [pair
for lists in combinations(sample_list, sample_k)
for pair in product(*lists)]
print(expected_output)
Try it online!
If you want to get really fancy/clever/ugly, you can push all the work down to the C layer with:
from itertools import combinations, product, starmap, chain
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = list(chain.from_iterable(starmap(product, combinations(sample_list, sample_k))))
print(expected_output)
That will almost certainly run meaningfully faster for huge inputs (especially if you can loop the results from chain.from_iterable directly rather than realizing them as a list), but it's probably not worth the ugliness unless you're really tight for cycles (I wouldn't expect much more than a 10% speed-up, but you'd need to benchmark to be sure).
I have a certain amount of sets, each containing a variable amount of unique numbers - unique in the set they belong to and that can't be found in others.
I'd like to make an algorithm implemented preferably in Python - but it can be any other language - that find one combination of number from each of these sets that sums to a specified number, knowing, if this helps, that there can be the same set multiple times, and an element from a set can be reused.
Practical example: let's say I have the following sets:
A = {1, 3, 6, 7, 15}
B = {2, 8, 10}
C = {4, 5, 9, 11, 12}
I want to obtain a number combination with a method find_subset_combination(expected_sum, subset_list)
>>> find_subset_combination(41, [A, B, B, C, B])
[1, 8, 10, 12, 10]
A solution to this problem has already been proposed here, however it is rather a brute-force approach; as the number of sets and their size will be much larger in my case, I'd like an algorithm functioning with the least number of iteration possible.
What approach would you suggest me ?
Firstly lets solve this for just two sets. This is known as the 'two sum' problem. You have two sets a and b that add to l. Since a + b = l we know that l - a = b. This is important as we can determine if l - a is in b in O(1) time. Rather than looping through b to find it in O(b) time. This means we can solve the 2 sum problem in O(a) time.
Note: For brevity the provided code only produces one solution. However changing two_sum to a generator function can return them all.
def two_sum(l, a, b):
for i in a:
if l - i in b:
return i, l - i
raise ValueError('No solution found')
Next we can solve the 'four sum' problem. This time we have four sets c, d, e and f. By combining c and d into a, and e and f into b we can use two_sum to solve the problem in O(cd + ef) space and time. To combine the sets we can just use a cartesian product, adding the results together.
Note: To get all results perform a cartesian product on all resulting a[i] and b[j].
import itertools
def combine(*sets):
result = {}
for keys in itertools.product(*sets):
results.setdefault(sum(keys), []).append(keys)
return results
def four_sum(l, c, d, e, f):
a = combine(c, d)
b = combine(e, f)
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
It should be apparent that the 'three sum' problem is just a simplified version of the 'four sum' problem. The difference is that we're given a at the start rather than being asked to calculate it. This runs in O(a + ef) time and O(ef) space.
def three_sum(l, a, e, f):
b = combine(e, f)
i, j = two_sum(l, a, b)
return (i, *b[j][0])
Now we have enough information to solve the 'six sum' problem. The question comes down to how do we divide all these sets?
If we decide to pair them together then we can use the 'three sum' solution to get what we want. But this may not run in the best time, as it runs in O(ab + bcde), or O(n^4) time if they're all the same size.
If we decide to put them in trios then we can use the 'two sum' to get what we want. This runs in O(abc + def), or O(n^3) if they're all the same size.
At this point we should have all the information to make a generic version that runs in O(n^⌈s/2⌉) time and space. Where s is the amount of sets entered into the function.
def n_sum(l, *sets):
midpoint = len(sets) // 2
a = combine(*sets[:midpoint])
b = combine(*sets[midpoint:])
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
You can further optimize the code. The size of both sides of the two sum matter quite a lot.
To exemplify this you can imagine 4 sets of 1 number on one side and 4 sets of 1000 numbers on the other. This will run in O(1^4 + 1000^4) time. Which is obviously really bad. Instead you can balance both sides of the two sum to make it much smaller. By having 2 sets of 1 number and 2 sets of 1000 numbers on both sides of the equation the performance increases; O(1^2×1000^2 + 1^2×1000^2) or simply O(1000^2). Which is far smaller than O(1000^4).
Expanding on the previous point if you have 3 sets of 1000 numbers and 3 sets of 10 numbers then the best solution is to put two 1000s on one side and everything else on the other side:
1000^2 + 10^3×1000 = 2_000_000
Interlaced sorted and same size either side (10, 1000, 10), (1000, 10, 1000)
10^2×1000 + 10×1000^2 = 10_100_000
Additionally if there is an even amount of each set provided then you can cut the time it takes to run in half by only calling combine once. For example if the input is n_sum(l, a, b, c, a, b, c) (without the above optimizations) it should be apparent that the second call to combine is only a waste of time and space.
Here's my issue:
I have a large integer (anywhere between 0 and 2^32-1). Let's call this number X.
I also have a list of integers, unsorted currently. They are all unique numbers, greater than 0 and less than X. Assume that there is a large amount of items in this list, let's say over 100,000 items.
I need to find up to 3 numbers in this list (let's call them A, B and C) that add up to X.
A, B and C all need to be inside of the list, and they can be repeated (for example, if X is 4, I can have A=1, B=1 and C=2 even though 1 would only appear once in the list).
There can be multiple solutions for A, B and C but I just need to find one possible solution for each the quickest way possible.
I've tried creating a for loop structure like this:
For A in itemlist:
For B in itemlist:
For C in itemlist:
if A + B + C == X:
exit("Done")
But since my list of integers contains over 100,000 items, this uses too much memory and would take far too long.
Is there any way to find a solution for A, B and C without using an insane amount of memory or taking an insane amount of time? Thanks in advance.
you can reduce the running time from n^3 to n^2 by using set something like that
s = set(itemlist)
for A in itemlist:
for B in itemlist:
if X-(A+B) in s:
print A,B,X-(A+B)
break
you can also sort the list and use binary search if you want to save memory
import itertools
nums = collections.Counter(itemlist)
target = t # the target sum
for i in range(len(itemlist)):
if itemlist[i] > target: continue
for j in range(i+1, len(itemlist)):
if itemlist[i]+itemlist[j] > target: continue
if target - (itemlist[i]+itemlist[j]) in nums - collections.Counter([itemlist[i], itemlist[j]]):
print("Found", itemlist[i], itemlist[j], target - (itemlist[i]+itemlist[j]))
Borrowing from #inspectorG4dget's code, this has two modifications:
If C < B then we can short-circuit the loop.
Use bisect_left() instead of collections.Counter().
This seems to run more quickly.
from random import randint
from bisect import bisect_left
X = randint(0, 2**32 - 1)
itemset = set(randint(0,X) for _ in range(100000))
itemlist = sorted(list(itemset)) # sort the list for binary search
l = len(itemlist)
for i,A in enumerate(itemlist):
for j in range(i+1, l): # use numbers above A
B = itemlist[j]
C = X - A - B # calculate C
if C <= B: continue
# see https://docs.python.org/2/library/bisect.html#searching-sorted-lists
i = bisect_left(itemlist, C)
if i != l and itemlist[i] == C:
print("Found", A, B, C)
To reduce the number of comparisons, we enforce A < B < C.
My problem is similar to the quesiton asked here. Differing from this question, I need an algorithm which generates r-tuple permutations of a given list with repeated elements.
On an example:
list1 = [1,1,1,2,2]
for i in permu(list1, 3):
print i
[1,1,1]
[1,1,2]
[1,2,1]
[2,1,1]
[1,2,2]
[2,1,2]
[2,2,1]
It seems that itertools.permutations will work fine here with adding a simple filtering to remove the repeated ones. However in my real cases, lists are much longer than this example and as you already know complexity of itertools.permutations increases exponential as the length of list increases.
So far, what I have is below. This code does the described job, but it is not efficient.
def generate_paths(paths, N = None):
groupdxs = [i for i, group in enumerate(paths) for _ in range(len(group))]
oldCombo = []
result = []
for dxCombo in itertools.permutations(groupdxs, N):
if dxCombo <= oldCombo: # as simple filter
continue
oldCombo = dxCombo
parNumbers = partialCombinations(dxCombo, len(paths))
if not parNumbers.count(0) >= len(paths)-1: # all of nodes are coming from same path, same graph
groupTemps = []
for groupInd in range(len(parNumbers)):
groupTemp = [x for x in itertools.combinations(paths[groupInd], parNumbers[groupInd])]
groupTemps.append(groupTemp)
for parGroups in itertools.product(*groupTemps):
iters = [iter(group) for group in parGroups]
p = [next(iters[i]) for i in dxCombo]
result.append(p)
return result
def partialCombinations(combo, numGruops):
tempCombo = list(combo)
result = list([0] * numGruops)
for x in tempCombo:
result[x] += 1
return result
In first for loop, I need to generate all possible r-length tuples which makes algorithm slower. There is a good solution for permutation without using r-length in above link. How can I adopt this algorithm to mine? Or is there any better ways?
I haven't thought this through very well for your case, but here's another approach.
Instead of giving large lists to permutations, you could give a small list that has no duplicates. You can use combinations_with_replacement to generate these smaller lists (you'll need to filter them to match the quantities of duplicates from your original input) and then get the permutations of each combination.
possible_values = (1,2)
n_positions = 3
sorted_combinations = itertools.combinations_with_replacement(possible_values, n_positions)
unique_permutations = set()
for combo in sorted_combinations:
# TODO: Do filtering for acceptable combinations before passing to permutations.
for p in itertools.permutations(combo):
unique_permutations.add(p)
print "len(unique_permutations) = %i. It should be %i^%i = %i.\nPermutations:" % (len(unique_permutations), len(possible_values), n_positions, pow(len(possible_values), n_positions))
for p in unique_permutations:
print p
Given two sorted arrays like the following:
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
I would like the output to be:
c = array([1,2,3,4,5,6,7,8,9,10])
or:
c = array([1,2,3,4,4,5,6,7,8,9,10])
I'm aware that I can do the following:
c = unique(concatenate((a,b))
I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.
Any idea is welcomed. Thanks
Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:
Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).
Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.
Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.
The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out,
numpy.ndarray[int, ndim=1] in1,
numpy.ndarray[int, ndim=1] in2):
cdef int i = 0
cdef int j = 0
cdef int k = 0
while (i!=len(in1)) or (j!=len(in2)):
# set out[k] to smaller of in[i] or in[j]
# increment k
# increment one of i or j
When curious about timings, it's always best to just timeit. Below, i've listed a subset of the various methods and their timings:
import numpy as np
import timeit
import heapq
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, np.insert(a, lo, [x])
size=10000
a = np.array(range(size))
b = np.array(range(size))
def op(a,b):
return np.unique(np.concatenate((a,b)))
def martijn(a,b):
c = np.copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
return c
def martijn2(a,b):
c = np.zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
def larsmans(a,b):
return np.array(sorted(set(a) | set(b)))
def larsmans_mod(a,b):
return np.array(set.union(set(a),b))
def sebastian(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Results:
martijn2 25.1079499722
OP 1.44831800461
larsmans 9.91507601738
larsmans_mod 5.87612199783
sebastian 3.50475311279e-05
My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.
EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output ...
In addition to the other answer on using bisect.insort, if you are not content with performance, you may try using blist module with bisect. It should improve the performance.
Traditional list insertion complexity is O(n), while blist's complexity on insertion is O(log(n)).
Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m), while the solutions with insort are O(n*m) complexity (n elements * m insertions)
import heapq
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10]
If you want to delete repeating values, you can use groupby:
import heapq
import itertools
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
it = (k for k,v in itertools.groupby(it))
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
You could use the bisect module for such merges, merging the second python list into the first.
The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:
from numpy import array, copy, insert
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, insert(a, lo, [x])
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
c = copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:
import bisect
c = copy(a)
lo = 0
for i in b:
lo = bisect.bisect(c, i)
c = insert(c, i, lo)
Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.
If you don't need to preserve a, just operate directly on that array and save yourself the copy.
More efficient still: because both lists are sorted, we can use heapq.merge:
from numpy import zeros
import heapq
c = zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
Use the bisect module for this:
import bisect
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
for i in b:
pos = bisect.bisect(a, i)
insert(a,[pos],i)
I can't test this right now, but it should work
The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:
import numpy as np
import sortednp
a = np.array([1,2,4,5,6,8,9])
b = np.array([3,4,7,10])
c = sortednp.merge(a, b)
I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).
Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))), but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:
import numpy as np
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
size = int(1e7)
a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
np.testing.assert_array_equal(insort(a, b), np.union1d(a, b))
import timeit
repetitions = 20
print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,))
print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))
Output on my machine:
insort: 1.69962s
union1d: 1.66338s