"Smart" method to solve a subset sum problem with multiple sets

"Smart" method to solve a subset sum problem with multiple sets - python

I have a certain amount of sets, each containing a variable amount of unique numbers - unique in the set they belong to and that can't be found in others.
I'd like to make an algorithm implemented preferably in Python - but it can be any other language - that find one combination of number from each of these sets that sums to a specified number, knowing, if this helps, that there can be the same set multiple times, and an element from a set can be reused.
Practical example: let's say I have the following sets:
A = {1, 3, 6, 7, 15}
B = {2, 8, 10}
C = {4, 5, 9, 11, 12}
I want to obtain a number combination with a method find_subset_combination(expected_sum, subset_list)
>>> find_subset_combination(41, [A, B, B, C, B])
[1, 8, 10, 12, 10]
A solution to this problem has already been proposed here, however it is rather a brute-force approach; as the number of sets and their size will be much larger in my case, I'd like an algorithm functioning with the least number of iteration possible.
What approach would you suggest me ?

Firstly lets solve this for just two sets. This is known as the 'two sum' problem. You have two sets a and b that add to l. Since a + b = l we know that l - a = b. This is important as we can determine if l - a is in b in O(1) time. Rather than looping through b to find it in O(b) time. This means we can solve the 2 sum problem in O(a) time.
Note: For brevity the provided code only produces one solution. However changing two_sum to a generator function can return them all.
def two_sum(l, a, b):
for i in a:
if l - i in b:
return i, l - i
raise ValueError('No solution found')
Next we can solve the 'four sum' problem. This time we have four sets c, d, e and f. By combining c and d into a, and e and f into b we can use two_sum to solve the problem in O(cd + ef) space and time. To combine the sets we can just use a cartesian product, adding the results together.
Note: To get all results perform a cartesian product on all resulting a[i] and b[j].
import itertools
def combine(*sets):
result = {}
for keys in itertools.product(*sets):
results.setdefault(sum(keys), []).append(keys)
return results
def four_sum(l, c, d, e, f):
a = combine(c, d)
b = combine(e, f)
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
It should be apparent that the 'three sum' problem is just a simplified version of the 'four sum' problem. The difference is that we're given a at the start rather than being asked to calculate it. This runs in O(a + ef) time and O(ef) space.
def three_sum(l, a, e, f):
b = combine(e, f)
i, j = two_sum(l, a, b)
return (i, *b[j][0])
Now we have enough information to solve the 'six sum' problem. The question comes down to how do we divide all these sets?
If we decide to pair them together then we can use the 'three sum' solution to get what we want. But this may not run in the best time, as it runs in O(ab + bcde), or O(n^4) time if they're all the same size.
If we decide to put them in trios then we can use the 'two sum' to get what we want. This runs in O(abc + def), or O(n^3) if they're all the same size.
At this point we should have all the information to make a generic version that runs in O(n^⌈s/2⌉) time and space. Where s is the amount of sets entered into the function.
def n_sum(l, *sets):
midpoint = len(sets) // 2
a = combine(*sets[:midpoint])
b = combine(*sets[midpoint:])
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
You can further optimize the code. The size of both sides of the two sum matter quite a lot.
To exemplify this you can imagine 4 sets of 1 number on one side and 4 sets of 1000 numbers on the other. This will run in O(1^4 + 1000^4) time. Which is obviously really bad. Instead you can balance both sides of the two sum to make it much smaller. By having 2 sets of 1 number and 2 sets of 1000 numbers on both sides of the equation the performance increases; O(1^2×1000^2 + 1^2×1000^2) or simply O(1000^2). Which is far smaller than O(1000^4).
Expanding on the previous point if you have 3 sets of 1000 numbers and 3 sets of 10 numbers then the best solution is to put two 1000s on one side and everything else on the other side:
1000^2 + 10^3×1000 = 2_000_000
Interlaced sorted and same size either side (10, 1000, 10), (1000, 10, 1000)
10^2×1000 + 10×1000^2 = 10_100_000
Additionally if there is an even amount of each set provided then you can cut the time it takes to run in half by only calling combine once. For example if the input is n_sum(l, a, b, c, a, b, c) (without the above optimizations) it should be apparent that the second call to combine is only a waste of time and space.

Related

How to implement a for loop inside a for loop inside a for loop?

I have two variables (n;p) in two independent equations, one to find C and the other to find T. For the first iteration I assume the values of n and p finding C and T, now I have to find all the pairs of n and p that will make C=T.
Additionally to being used as a constant, n will define the range for a list in the first equation and C is the sum of all the values on that list.
I was able to program the first iteration defining n and p as constants, thus finding C and T which were not equal to each other, but I don't know how to program the whole process so that it will repeat for a fix set of p (that will be a list and not a constant) and finding the n's that will make C=T. I have to iterate through p and for each one find one n that will satisfy the condition.
Therefore I need to make a for loop inside a for loop inside a for loop. It will be something like this:
for all the values of p in range (0, 12, 0.1) do:
for all the values of n in range(0, 160, 0.001) do:
"the rest of the operation that also has for loop in it"
that operation will result in C and T, then,
if C = T
print(p and n) # the pair (p;n) that made it possible
As you can see that's an idea, not actual code, I don't know how to write the actual code for it. I saw something about a zip function but still don't get it. Help.

I figure it out by trying to enunciate the question (and making some research):
p = np.arange(0.1, 12.1, 0.1)
for k in range(0, p.size):
# this will evaluate the next "for loop" (iterate with k) for all the values of p
n = np.arange(0, 160, 0.002)
for j in range(0, n.size):
# this will evaluate the next "for loop" (iterate with j) for all the values of n
# the next for loop uses "i" to iterate other operations not relevant for
# this problem right now
Inside of the "for loop" that uses "j", at the end, I needed to compare the results of C and T and find if they were equal (or as equal as possible), and then print the n and p that made this posible.
I still need to work this part of the code but a simple way to compare it could be:
comparison_parameter = 1
comparison = T - C
if comparison_parameter > comparison > -comparison_parameter:
print('For p = ', round(p[k], 1), ', n =', round(n[j], 4))
else:
pass
This is definitely not the most elegant way to do all this, here are some problems with this code that I would recommend to improve:
I'm using lists and not generator, therefore python will store every calculation in memory
I'm not using "assertAlmostEqual() function" to compare C and T (https://www.geeksforgeeks.org/python-unittest-assertalmostequal-function/)
My way gives me multiple values of n (given a certain p) that satisfy the condition for a particular comparison_parameter, but I only want the one correspondant to the comparison being closest to 0, meaning that C and T will be closer to each other.

Is it possible to compute median while data is still being generated? Python online median calculator

I've seen a broader version of this question asked, where the individual was looking for more than one summary statistic, but I have not seen a solution presented. I'm only interested in the median, in Python, here.
Let's say I'm generating a million values in a loop. I cannot save the million values into a list and compute median once I'm done, because of memory issues. Is it possible to compute median as I go? For mean, I'd just sum values progressively, and once done divide by a million. For median the answer does not seem as intuitive.
I'm stuck on the "thought experiment" section of this, so I haven't been able to really try anything that I think might work. I'm not sure if this is an algorithm that has been implemented, but I can't find it if it has been.

This won't be possible unless your idea of "values" is restricted in some exploitable way; and/or you can make more than one pass over the data; and/or you're willing to store stuff on disk too. Suppose you know there are 5 values, all distinct integers, and you know the first 3 are 5, 6, 7. The median will be one of those, but at this point you can't know which one, so you have to remember all of them. If 1 and 2 come next, the median is 5; if 4 and 8 next, 6; if 8 and 9 next, it's 7.
This obviously generalizes to any odd number of values range(i, i + 2*N+1), at the point you've seen the first N+1 of them: the median can turn out to be any one of those first N+1, so unless there's something exploitable about the nature of the values, you have to remember all of them at that point.
An example of something exploitable: you know there are at most 100 distinct values. Then you can use a dict to count how many of each appear, and easily calculate the median at the end from that compressed representation of the distribution.
Approximating
For reasons already mentioned, there is no "shortcut" here to be had in general. But I'll attach Python code for a reasonable one-pass approximation method, as detailed in "The Remedian: A Robust Averaging Method for Large Data Sets". That paper also points to other approximation methods.
The key: pick an odd integer B greater than 1. Then successive elements are stored in a buffer until B of them have been recorded. At that point, the median of those advances to the next level, and the buffer is cleared. Their median remains the only memory of those B elements retained.
The same pattern continues at deeper levels too: after B of those median-of-B medians have been recorded, the median of those advances to the next level, and the second-level buffer is cleared. The median advanced then remains the only memory of the B**2 elements that went into it.
And so on. At worst it can require storing B * log(N, B) values, where N is the total number of elements. In Python it's easy to code it so buffers are created as needed, so N doesn't need to be known in advance.
If B >= N, the method is exact, but then you've also stored every element. If B < N, it's an approximation to the median. See the paper for details - it's quite involved. Here's a case that makes it look very good ;-)
>>> import random
>>> xs = [random.random() for i in range(1000001)]
>>> sorted(xs)[500000] # true median
0.5006315438367565
>>> w = MedianEst(11)
>>> for x in xs:
... w.add(x)
>>> w.get()
0.5008443883489089
Perhaps surprisingly, it does worse if the inputs are added in sorted order:
>>> w.clear()
>>> for x in sorted(xs):
... w.add(x)
>>> w.get()
0.5021045181828147
User beware! Here's the code:
class MedianEst:
def __init__(self, B):
assert B > 1 and B & 1
self.B = B
self.half = B >> 1
self.clear()
def add(self, x):
for xs in self.a:
xs.append(x)
if len(xs) == self.B:
x = sorted(xs)[self.half]
xs.clear()
else:
break
else:
self.a.append([x])
def get(self):
total = 0
weight = 1
accum = []
for xs in self.a:
total += len(xs) * weight
accum.extend((x, weight) for x in xs)
weight *= self.B
# `total` elements in all
limit = total // 2 + 1
total = 0
for x, weight in sorted(accum):
total += weight
if total >= limit:
return x
def clear(self):
self.a = []

Random valid permutations of an array in Python [duplicate]

This question already has answers here:
Generate a random derangement of a list
(7 answers)
Closed 7 years ago.
Suppose that we are given arrays A and B of positive integers. A and B contain the same integers (the same number of times), so they are naturally the same length.
Consider permutations of two arrays U and V to be valid if U[i] != V[i] for i = 0, 1, ..., len(U) - 1.
We want to find a valid pair of permutations for A and B. However, we want our algorithm to be such that all pairs of valid permutations are equally likely to be returned.
I've been working on this problem today and cannot seem to come up with a sleek solution. Here is my best solution thus far:
import random
def random_valid_permutation(values):
A = values[:]
B = values[:]
while not is_valid_permutation(A, B):
random.shuffle(A)
random.shuffle(B)
return A, B
def is_valid_permutation(A, B):
return all([A[i] != B[i] for i in range(len(A))])
Unfortunately, since this method involves a random shuffle of each array, it could in theory take infinite time to produce a valid output. I have come up with a couple of alternatives that do run in finite (and reasonable) time, but their implementation is much longer.
Does anyone know of a sleek way to solve this problem?

First note that every permutation, A, has the same number of derangements B as any other permutation A. Thus it is enough to generate a single A and then generate random B until you get a match. The probability that a permutation, B, is a derangement of A is known to be (approximately) 1/e (a little better than 1 out of 3) in a way that is essentially independent of the number of items. There is over a 99% probability that you will find a valid B with less than a dozen trials. Unless your list of values is large, fishing using the built-in random.shuffle might be quicker than rolling your own with the overhead of checking with each new placement of an item if it has led to a clash. The following is almost instantaneous with a thousand elements and still only takes about a second or so with a million elements:
import random
def random_valid_permutation(values):
A = values[:]
B = values[:]
random.shuffle(A)
random.shuffle(B)
while not is_valid_permutation(A, B):
random.shuffle(B)
return A, B
def is_valid_permutation(A, B):
return all(A[i] != B[i] for i in range(len(A)))
As an optimization -- I removed [ and ] form the definition of is_valid_permutation() since all can work directly on generator expressions. There is no reason to create the whole list in memory since any clash will typically detected long before the end of the list.

Optimize finding pairs of arrays which can be compared

Definition: Array A(a1,a2,...,an) is >= than B(b1,b2,...bn) if they are equal sized and a_i>=b_i for every i from 1 to n.
For example:
[1,2,3] >= [1,2,0]
[1,2,0] not comparable with [1,0,2]
[1,0,2] >= [1,0,0]
I have a list which consists of a big number of such arrays (approx. 10000, but can be bigger). Arrays' elements are positive integers. I need to remove all arrays from this list that are bigger than at least one of other arrays. In other words: if there exists such B that A >= B then remove A.
Here is my current O(n^2) approach which is extremely slow. I simply compare every array with all other arrays and remove it if it's bigger. Are there any ways to speed it up.
import numpy as np
import time
import random
def filter_minimal(lst):
n = len(lst)
to_delete = set()
for i in xrange(n-1):
if i in to_delete:
continue
for j in xrange(i+1,n):
if j in to_delete: continue
if all(lst[i]>=lst[j]):
to_delete.add(i)
break
elif all(lst[i]<=lst[j]):
to_delete.add(j)
return [lst[i] for i in xrange(len(lst)) if i not in to_delete]
def test(number_of_arrays,size):
x = map(np.array,[[random.randrange(0,10) for _ in xrange(size)] for i in xrange(number_of_arrays)])
return filter_minimal(x)
a = time.time()
result = test(400,10)
print time.time()-a
print len(result)
P.S. I've noticed that using numpy.all instead of builtin python all slows the program dramatically. What can be the reason?

Might not be exactly what you are asking for, but this should get you started.
import numpy as np
import time
import random
def compare(x,y):
#Reshape x to a higher dimensional array
compare_array=x.reshape(-1,1,x.shape[-1])
#You can now compare every x with every y element wise simultaneously
mask=(y>=compare_array)
#Create a mask that first ensures that all elements of y are greater then x and
#then ensure that this is the case at least once.
mask=np.any(np.all(mask,axis=-1),axis=-1)
#Places this mask on x
return x[mask]
def test(number_of_arrays,size,maxval):
#Create arrays of size (number_of_arrays,size) with maximum value maxval.
x = np.random.randint(maxval, size=(number_of_arrays,size))
y= np.random.randint(maxval, size=(number_of_arrays,size))
return compare(x,y)
print test(50,10,20)

First of all we need to carefully check the objective. Is it true that we delete any array that is > ANY of the other arrays, even the deleted ones? For example, if A > B and C > A and B=C, then do we need to delete only A or both A and C? If we only need to delete INCOMPATIBLE arrays, then it is a much harder problem. This is a very difficult problem because different partitions of the set of arrays may be compatible, so you have the problem of finding the largest valid partition.
Assuming the easy problem, a better way to define the problem is that you want to KEEP all arrays which have at least one element < the corresponding element in ALL the other arrays. (In the hard problem, it is the corresponding element in the other KEPT arrays. We will not consider this.)
Stage 1
To solve this problem what you do is arrange the arrays in columns and then sort each row while maintaining the key to the array and the mapping of each array-row to position (POSITION lists). For example, you might end up with a result in stage 1 like this:
row 1: B C D A E
row 2: C A E B D
row 3: E D B C A
Meaning that for the first element (row 1) array B has a value >= C, C >= D, etc.
Now, sort and iterate the last column of this matrix ({E D A} in the example). For each item, check if the element is less than the previous element in its row. For example, in row 1, you would check if E < A. If this is true you return immediately and keep the result. For example, if E_row1 < A_row1 then you can keep array E. Only if the values in the row are equal do you need to do a stage 2 test (see below).
In the example shown you would keep E, D, A (as long as they passed the test above).
Stage 2
This leaves B and C. Sort the POSITION list for each. For example, this will tell you that the row with B's mininum position is row 2. Now do a direct comparison between B and every array below it in the mininum row, here row 2. Here there is only one such array, D. Do a direct comparison between B and D. This shows that B < D in row 3, therefore B is compatible with D. If the item is compatible with every array below its minimum position keep it. We keep B.
Now we do the same thing for C. In C's case we need only do one direct comparison, with A. C dominates A so we do not keep C.
Note that in addition to testing items that did not appear in the last column we need to test items that had equality in Stage 1. For example, imagine D=A=E in row 1. In this case we would have to do direct comparisons for every equality involving the array in the last column. So, in this case we direct compare E to A and E to D. This shows that E dominates D, so E is not kept.
The final result is we keep A, B, and D. C and E are discarded.
The overall performance of this algorithm is n2*log n in Stage 1 + { n lower bound, n * log n - upper bound } in Stage 2. So, maximum running time is n2*log n + nlogn and minimum running time is n2logn + n. Note that the running time of your algorithm is n-cubed n3. Since you compare each matrix (n*n) and each comparison is n element comparisons = n*n*n.
In general, this will be much faster than the brute force approach. Most of the time will be spent sorting the original matrix, a more or less unavoidable task. Note that you could potentially improve my algorithm by using priority queues instead of sorting, but the resulting algorithm would be much more complicated.

Establishing highest score for a set of combinations

I'm programming in python.
I have the data of the following form:
(A, B, C, D, E, F, G, H, I)
Segments of this data are associated with a score, for example:
scores:
(A, B, C, D) = .99
(A, B, C, E) = .77
(A, B, E) = .66
(G,) = 1
(I,) = .03
(H, I) = .55
(I, H) = .15
(E, F, G) = .79
(B,) = .93
(A, C) = .46
(D,) = .23
(D, F, G) = .6
(F, G, H) = .34
(H,) = .09
(Y, Z) = 1
We can get a score for this data as follows:
A B C E + D F G + H I = .77 * .6 * .55 = 0.2541
another possiblity is:
A B C D + E F G + H + I = .99 * .79 * .09 * .03 = 0.00211167
So, the first combination gives the higher score.
I wish to write an algorithm to establish for the data above the highest possible score. The members of data should no be repeated more than once. In other words:
A B C E + E F G + D + H I
is not valid. How would you recommend I go about solving this?
Thanks,
Barry
EDIT:
I should clarify that (H, I) != (I, H) and that (I, H) is not a subsegment for ABCDEFGHI, but is a subsegment of ABIHJ.
Another thing I should mention is that scores is a very large set (millions) and the segment on which we're calculating the score has an average length of around 10. Furthermore, the way in which I calculate the score might change in the future. Maybe I'd like to add the subsegments and take an average instead of multipling, who knows... for that reason it might be better to seperate the code which calculates the possible combinations from the actual calculation of the score. At the moment, I'm inclined to think that itertools.combinations might offer a good starting point.

Brute-forcing, by using recursion (for each segment in order, we recursively find the best score using the segment, and the best score not using the segment. A score of 0 is assigned if there is no possible combination of segments for the remaining items):
segment_scores = (('A', 'B', 'C', 'D'), .99), (('A', 'B', 'C', 'E'), .77) #, ...
def best_score_for(items, segments, subtotal = 1.0):
if not items: return subtotal
if not segments: return 0.0
segment, score = segments[0]
best_without = best_score_for(items, segments[1:], subtotal)
return max(
best_score_for(items.difference(segment), segments[1:], subtotal * score),
best_without
) if items.issuperset(segment) else best_without
best_score_for(set('ABCDEFGHI'), segment_scores) # .430155

This sounds like a NP-complete problem in disguise, a derivative of the Knapsack problem. This means you may have to walk through all possibilities to get an exact solution.
Even though... wait. Your values are between 0 and 1. That is the results can only get smaller of at most stay equal. Therefore the solution is trivial: Get the single group with the highest value, and be done with. (I'm aware that's probably not what you want, but you might have to add another condition, e.g. all elements have to be used..?)
A beginning of a brute force approach:
import operator
segment_scores = {(A, B, C, D): .99, (A, B, C, E): .77} #...
def isvalid(segments):
"""returns True if there are no duplicates
for i in range(len(segments)-1):
for element in segments[i]:
for j in range(len(segments)-i-1):
othersegment = segments[j+i+1]
if element in othersegment:
return False
return True
better way:
"""
flattened = [item for sublist in segments for item in sublist]
# http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
return len(set(flattened)) == len(flattened)
def getscore(segments):
"""
p = 1.0
for segment in segments:
p *= segment_scores[segment]
return p
better way:
"""
return reduce(operator.mul, [segment_scores[segment] for segment in segments])
Now, create all 2^(num segments) possible combinations of segments, check for each if it is valid, and if it is, compute the score while keeping the current winner and its highscore. Just a starting point...
OK just another update: There's lots of space for optimizations here, in particular since you're multiplying (I'm assuming now you have to use each element).
Since your total score never increases, you can drop any exploration path [segment0, segment1] that drops below the current high score because you'll only get works for any segment2.
If you don't just iterate over all possibilities but start by exploring all segment lists that contain the first segment (by recursively exploring all segment lists that contain in addition the second segment and so on), you can break as soon as, for example, the first and the second segment are invalid, i.e. no need to explore all possibilities of grouping (A,B,C,D) and (A,B,C,D,E)
Since multiplying hurts, trying to minimize the number of segments might be a suitable heuristic, so start with big segments with high scores.

First, I'd suggest assigning a unique symbol to the segments that make sense.
Then you probably want combinations of those symbols (or perhaps permutations, I'm sure you know your problem better than I do), along with a "legal_segment_combination" function you'd use to throw out bad possibilities - based on a matrix of which ones conflict and which don't.
>>> import itertools
>>> itertools.combinations([1,2,3,4], 2)
<itertools.combinations object at 0x7fbac9c709f0>
>>> list(itertools.combinations([1,2,3,4], 2))
[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
>>>
Then max the valid possibilities that make it past legal_segment_combination().

First, you could take the logarithm of each score, since then the problem is to maximize the sum of the scores instead of the product. Then, you can solve the problem as an Assignment Problem, where to each data point you assign one sequence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.