Establishing highest score for a set of combinations

Establishing highest score for a set of combinations - python

I'm programming in python.
I have the data of the following form:
(A, B, C, D, E, F, G, H, I)
Segments of this data are associated with a score, for example:
scores:
(A, B, C, D) = .99
(A, B, C, E) = .77
(A, B, E) = .66
(G,) = 1
(I,) = .03
(H, I) = .55
(I, H) = .15
(E, F, G) = .79
(B,) = .93
(A, C) = .46
(D,) = .23
(D, F, G) = .6
(F, G, H) = .34
(H,) = .09
(Y, Z) = 1
We can get a score for this data as follows:
A B C E + D F G + H I = .77 * .6 * .55 = 0.2541
another possiblity is:
A B C D + E F G + H + I = .99 * .79 * .09 * .03 = 0.00211167
So, the first combination gives the higher score.
I wish to write an algorithm to establish for the data above the highest possible score. The members of data should no be repeated more than once. In other words:
A B C E + E F G + D + H I
is not valid. How would you recommend I go about solving this?
Thanks,
Barry
EDIT:
I should clarify that (H, I) != (I, H) and that (I, H) is not a subsegment for ABCDEFGHI, but is a subsegment of ABIHJ.
Another thing I should mention is that scores is a very large set (millions) and the segment on which we're calculating the score has an average length of around 10. Furthermore, the way in which I calculate the score might change in the future. Maybe I'd like to add the subsegments and take an average instead of multipling, who knows... for that reason it might be better to seperate the code which calculates the possible combinations from the actual calculation of the score. At the moment, I'm inclined to think that itertools.combinations might offer a good starting point.

Brute-forcing, by using recursion (for each segment in order, we recursively find the best score using the segment, and the best score not using the segment. A score of 0 is assigned if there is no possible combination of segments for the remaining items):
segment_scores = (('A', 'B', 'C', 'D'), .99), (('A', 'B', 'C', 'E'), .77) #, ...
def best_score_for(items, segments, subtotal = 1.0):
if not items: return subtotal
if not segments: return 0.0
segment, score = segments[0]
best_without = best_score_for(items, segments[1:], subtotal)
return max(
best_score_for(items.difference(segment), segments[1:], subtotal * score),
best_without
) if items.issuperset(segment) else best_without
best_score_for(set('ABCDEFGHI'), segment_scores) # .430155

This sounds like a NP-complete problem in disguise, a derivative of the Knapsack problem. This means you may have to walk through all possibilities to get an exact solution.
Even though... wait. Your values are between 0 and 1. That is the results can only get smaller of at most stay equal. Therefore the solution is trivial: Get the single group with the highest value, and be done with. (I'm aware that's probably not what you want, but you might have to add another condition, e.g. all elements have to be used..?)
A beginning of a brute force approach:
import operator
segment_scores = {(A, B, C, D): .99, (A, B, C, E): .77} #...
def isvalid(segments):
"""returns True if there are no duplicates
for i in range(len(segments)-1):
for element in segments[i]:
for j in range(len(segments)-i-1):
othersegment = segments[j+i+1]
if element in othersegment:
return False
return True
better way:
"""
flattened = [item for sublist in segments for item in sublist]
# http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
return len(set(flattened)) == len(flattened)
def getscore(segments):
"""
p = 1.0
for segment in segments:
p *= segment_scores[segment]
return p
better way:
"""
return reduce(operator.mul, [segment_scores[segment] for segment in segments])
Now, create all 2^(num segments) possible combinations of segments, check for each if it is valid, and if it is, compute the score while keeping the current winner and its highscore. Just a starting point...
OK just another update: There's lots of space for optimizations here, in particular since you're multiplying (I'm assuming now you have to use each element).
Since your total score never increases, you can drop any exploration path [segment0, segment1] that drops below the current high score because you'll only get works for any segment2.
If you don't just iterate over all possibilities but start by exploring all segment lists that contain the first segment (by recursively exploring all segment lists that contain in addition the second segment and so on), you can break as soon as, for example, the first and the second segment are invalid, i.e. no need to explore all possibilities of grouping (A,B,C,D) and (A,B,C,D,E)
Since multiplying hurts, trying to minimize the number of segments might be a suitable heuristic, so start with big segments with high scores.

First, I'd suggest assigning a unique symbol to the segments that make sense.
Then you probably want combinations of those symbols (or perhaps permutations, I'm sure you know your problem better than I do), along with a "legal_segment_combination" function you'd use to throw out bad possibilities - based on a matrix of which ones conflict and which don't.
>>> import itertools
>>> itertools.combinations([1,2,3,4], 2)
<itertools.combinations object at 0x7fbac9c709f0>
>>> list(itertools.combinations([1,2,3,4], 2))
[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
>>>
Then max the valid possibilities that make it past legal_segment_combination().

First, you could take the logarithm of each score, since then the problem is to maximize the sum of the scores instead of the product. Then, you can solve the problem as an Assignment Problem, where to each data point you assign one sequence.

Related

Is there any way to optimise this function?

I was told to make a program that solves a simple python exercise which should run under 12000ms. I managed to get a working piece of code. However it only works for small numbers passed into the n parameter.
def function(n):
res = [(a, b) for a in range(1, n+1) for b in range(1, n+1) if a*b == sum([i for i in range(1, n+1) if i!=a and i!=b])]
return res
Is there any way to optimise the code so that it runs under 12000ms for large numbers of n (e.g. n=100000)?
Exercise:
A friend of mine takes the sequence of all numbers from 1 to n (where n > 0).
Within that sequence, he chooses two numbers, a and b.
He says that the product of a and b should be equal to the sum of all numbers in the sequence, excluding a and b.
Given a number n, could you tell me the numbers he excluded from the sequence?
The function takes the parameter: n (n is always strictly greater than 0) and returns an array or a string (depending on the language) of the form:
[(a, b), ...] with all (a, b) which are the possible removed numbers in the sequence 1 to n.
[(a, b), ...] will be sorted in increasing order of the "a".
It happens that there are several possible (a, b). The function returns an empty array (or an empty string) if no possible numbers are found which will prove that my friend has not told the truth! (Go: in this case return nil).
E.g. function(26) should return [(15, 21), (21, 15)]

sum([i for i in range(1, n+1) if i!=a and i!=b])
is pretty easily optimized out. Just put:
basesum = sum(range(1, n+1))
outside the listcomp, then change the test to:
if a*b == basesum - sum({a, b}) # Accounts for possibility of a == b by deduping
or if a==b is not supposed to be allowed, the even simpler:
if a*b == basesum - a - b
That instantly reduces the per element work from O(n) to O(1), which should cut overall work from O(n**3) to O(n**2).
There's other optimizations available, but that's an easy one with a huge impact on big-O runtime.
If I'm reading the prompt correctly, your a and b are order-insensitive. So if your results can just show (a, b) and not (b, a) as well, you can replace:
for a in range(1, n+1) for b in range(1, n+1)
with:
for a, b in itertools.combinations(range(1, n+1), 2)
or if a == b is allowed:
for a, b in itertools.combinations_with_replacement(range(1, n+1), 2)
which halves the amount of work to do "for free" (and does more of it at the C layer instead of the Python bytecode layer, which often speeds things up a little more). If you must get the results in both orders, you can post-process to produce the reverse of each non-duplicated pair as well (or be a lazy programmer and use for a, b in itertools.permutations(range(1, n+1), 2) or for a, b in itertools.product(range(1, n+1), repeat=2) instead of combinations or combinations_with_replacement respectively, doing most/all of the work of your original nested loop, but shoving more to the C layer so the same theoretical work runs a little faster in practice).

This is more of a math problem than anything else:
Isolate b:
a*b = sum - (a+b)
(a+1)*b = sum - a
b = (sum - a)/(a+1)
Now you can substitute b where needed. With b out of the way, you don't have to iterate over the list for each element in it. You can iterate over the list just once, applying the equation for each element.
In fact, you don't even have to go through the whole list. Verifying its first sqrt(sum) elements is enough, as anything bigger than that has to be multiplied by another smaller than that number.
Here is the code:
import math
n = 26
valid = []
sum_n = (n+1)*n/2
limit = int(math.sqrt(sum_n)-0.5)
for a in range(1, (limit+1)):
if (sum_n-a) % (a+1) == 0:
valid.append(( a, int((sum_n-a) / (a+1)) ))
if valid:
if valid[-1][0] == valid[-1][1]:
valid += [(x, y) for y, x in reversed(valid[:-1])]
else:
valid += [(x, y) for y, x in reversed(valid)]
print(valid)
And the output:
[(1, 175), (3, 87), (7, 43), (10, 31), (15, 21), (21, 15), (31, 10), (43, 7), (87, 3), (175, 1)]

Elements in preexisting Python list are disappearing and changing into None

I am having difficulty with a python project involving iterating over several lists at the same time. I am analyzing data collected from a serial device and iterating over it to make derivative calculations, find peak values, write raw data and results to a csv file, and more. I am not brand new to python or programming in general, but new enough that I may not see easy solutions immediately, so please bear with me.
Here is a portion of my code for context:
def processData():
x = elapTime
y = adcData
dx = []
dy = []
peakTime = []
peakData = []
cutOff = len(adcData) - 6
for i, (n, m) in enumerate(zip(x, y)):
dx.append(x[i+5] - x[i])
dy.append(y[i+5] - y[i])
if i == cutOff:
break
dx = np.asarray(dx)
dy = np.asarray(dy)
dydx = dy/dx
der = dydx.tolist()
oneZero = dydx
oneZero = np.where(oneZero >= 0, 1, oneZero)
oneZero = np.where(oneZero < 0, 0, oneZero)
oneZero = oneZero.tolist()
definedPeak = [1,1,1,1,1,1,1,1,1,1,0]
for j, (a, b, c, d) in enumerate(itertools.zip_longest(x, y, der, oneZero)):
if c > 3:
chunk = oneZero[j:j+11]
if chunk == definedPeak:
peakTime.append(x[j+9])
peakData.append(y[j+9])
with open(fileName, 'w',) as testfile:
wr = csv.writer(testfile, quoting=csv.QUOTE_ALL)
wr.writerows(itertools.zip_longest(elapTime,
adcData,
der,
oneZero,
peakTime,
peakData))
And here is the part giving me trouble:
for j, (a, b, c, d) in enumerate(itertools.zip_longest(x, y, der, oneZero)):
if c > 3:
chunk = oneZero[j:j+11]
if chunk == definedPeak:
peakTime.append(x[j+9])
peakData.append(y[j+9])
When I run this code on my Raspberry Pi 4 I get this error on the if statement asking if 'c' is greater than 3:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
My expectations for this entire block of code is that it returns to me the raw time and data I collected from my serial device, the derivative values I calculated, the 'oneZero' list where each of the derivative values were changed to a one or zero, and the specific locations of peaks in the data. I have defined what a peak should look like and as long the rolling window I created of the oneZero list matches my predefined peak list, then I will save the time and data value in respective lists.
This code was all stable and functional until I added this line:
if c > 3:
I used a print statement within the corresponding for loop to see what could be causing the error and saw that the elapTime and adcData lists (which have existed before I called the processData() funcion) had at some arbitrary time lost the recorded values they had and were all replaced by 'None.'
Even though removing the if statement returns my program to a stable and functional state, I need a condition like this since it will filter out any noise in the derivatives I am calculating. I appreciate your time and patience with me if I have missed something elementary.
EDIT:
The size of the lists I am using are up to 2000+ entries, and the shortest list I iterate through is only about 5 entries less than the longest one. Until I added the if statement above, all the data I collected was present, but following that line I will notice that I will only have about a few hundred for the same amount of time I run my serial device.

As juanpa.arrivillaga mentioned in the comments, at least one of the iterables x, y, oneZero is longer than der, and zip_longest will fill the shorter iterables with None values to match the length of the longest iterable.
For example
list(zip_longest([1, 2, 3, 4], ['a', 'b']))
will return
[(1, 'a'), (2, 'b'), (3, None), (4, None)]
Bulit-in zip on the other hand cuts off the longer iterables:
list(zip([1, 2, 3, 4], ['a', 'b']))
will return
[(1, 'a'), (2, 'b')]
zip will also work for multiple iterables, just like zip_longest.
If you want to skip the c > 3 check for None values you can simply extend it like this:
if c is not None and c > 3:
But as it is, your program will not do anything once the end of der is reached, you will need to define a behaviour for the case c is None. If you want to handle c is None the same as c > 3 you should use
if c is None or c > 3:
If you want the opposite you should just switch zip_longest to zip.

"Smart" method to solve a subset sum problem with multiple sets

I have a certain amount of sets, each containing a variable amount of unique numbers - unique in the set they belong to and that can't be found in others.
I'd like to make an algorithm implemented preferably in Python - but it can be any other language - that find one combination of number from each of these sets that sums to a specified number, knowing, if this helps, that there can be the same set multiple times, and an element from a set can be reused.
Practical example: let's say I have the following sets:
A = {1, 3, 6, 7, 15}
B = {2, 8, 10}
C = {4, 5, 9, 11, 12}
I want to obtain a number combination with a method find_subset_combination(expected_sum, subset_list)
>>> find_subset_combination(41, [A, B, B, C, B])
[1, 8, 10, 12, 10]
A solution to this problem has already been proposed here, however it is rather a brute-force approach; as the number of sets and their size will be much larger in my case, I'd like an algorithm functioning with the least number of iteration possible.
What approach would you suggest me ?

Firstly lets solve this for just two sets. This is known as the 'two sum' problem. You have two sets a and b that add to l. Since a + b = l we know that l - a = b. This is important as we can determine if l - a is in b in O(1) time. Rather than looping through b to find it in O(b) time. This means we can solve the 2 sum problem in O(a) time.
Note: For brevity the provided code only produces one solution. However changing two_sum to a generator function can return them all.
def two_sum(l, a, b):
for i in a:
if l - i in b:
return i, l - i
raise ValueError('No solution found')
Next we can solve the 'four sum' problem. This time we have four sets c, d, e and f. By combining c and d into a, and e and f into b we can use two_sum to solve the problem in O(cd + ef) space and time. To combine the sets we can just use a cartesian product, adding the results together.
Note: To get all results perform a cartesian product on all resulting a[i] and b[j].
import itertools
def combine(*sets):
result = {}
for keys in itertools.product(*sets):
results.setdefault(sum(keys), []).append(keys)
return results
def four_sum(l, c, d, e, f):
a = combine(c, d)
b = combine(e, f)
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
It should be apparent that the 'three sum' problem is just a simplified version of the 'four sum' problem. The difference is that we're given a at the start rather than being asked to calculate it. This runs in O(a + ef) time and O(ef) space.
def three_sum(l, a, e, f):
b = combine(e, f)
i, j = two_sum(l, a, b)
return (i, *b[j][0])
Now we have enough information to solve the 'six sum' problem. The question comes down to how do we divide all these sets?
If we decide to pair them together then we can use the 'three sum' solution to get what we want. But this may not run in the best time, as it runs in O(ab + bcde), or O(n^4) time if they're all the same size.
If we decide to put them in trios then we can use the 'two sum' to get what we want. This runs in O(abc + def), or O(n^3) if they're all the same size.
At this point we should have all the information to make a generic version that runs in O(n^⌈s/2⌉) time and space. Where s is the amount of sets entered into the function.
def n_sum(l, *sets):
midpoint = len(sets) // 2
a = combine(*sets[:midpoint])
b = combine(*sets[midpoint:])
i, j = two_sum(l, a, b)
return (*a[i][0], *b[j][0])
You can further optimize the code. The size of both sides of the two sum matter quite a lot.
To exemplify this you can imagine 4 sets of 1 number on one side and 4 sets of 1000 numbers on the other. This will run in O(1^4 + 1000^4) time. Which is obviously really bad. Instead you can balance both sides of the two sum to make it much smaller. By having 2 sets of 1 number and 2 sets of 1000 numbers on both sides of the equation the performance increases; O(1^2×1000^2 + 1^2×1000^2) or simply O(1000^2). Which is far smaller than O(1000^4).
Expanding on the previous point if you have 3 sets of 1000 numbers and 3 sets of 10 numbers then the best solution is to put two 1000s on one side and everything else on the other side:
1000^2 + 10^3×1000 = 2_000_000
Interlaced sorted and same size either side (10, 1000, 10), (1000, 10, 1000)
10^2×1000 + 10×1000^2 = 10_100_000
Additionally if there is an even amount of each set provided then you can cut the time it takes to run in half by only calling combine once. For example if the input is n_sum(l, a, b, c, a, b, c) (without the above optimizations) it should be apparent that the second call to combine is only a waste of time and space.

Python Get Random Unique N Pairs

Say I have a range(1, n + 1). I want to get m unique pairs.
What I found is, if the number of pairs is close to n(n-1)/2 (maxiumum number of pairs), one can't simply generate random pairs everytime because they will start overriding eachother. I'm looking for a somewhat lazy solution, that will be very efficient (in Python's world).
My attempt so far:
def get_input(n, m):
res = str(n) + "\n" + str(m) + "\n"
buffet = range(1, n + 1)
points = set()
while len(points) < m:
x, y = random.sample(buffet, 2)
points.add((x, y)) if x > y else points.add((y, x)) # meeh
for (x, y) in points:
res += "%d %d\n" % (x, y);
return res

You can use combinations to generate all pairs and use sample to choose randomly. Admittedly only lazy in the "not much to type" sense, and not in the use a generator not a list sense :-)
from itertools import combinations
from random import sample
n = 100
sample(list(combinations(range(1,n),2)),5)
If you want to improve performance you can make it lazy by studying this
Python random sample with a generator / iterable / iterator
the generator you want to sample from is this: combinations(range(1,n)

Here is an approach which works by taking a number in the range 0 to n*(n-1)/2 - 1 and decodes it to a unique pair of items in the range 0 to n-1. I used 0-based math for convenience, but you could of course add 1 to all of the returned pairs if you want:
import math
import random
def decode(i):
k = math.floor((1+math.sqrt(1+8*i))/2)
return k,i-k*(k-1)//2
def rand_pair(n):
return decode(random.randrange(n*(n-1)//2))
def rand_pairs(n,m):
return [decode(i) for i in random.sample(range(n*(n-1)//2),m)]
For example:
>>> >>> rand_pairs(5,8)
[(2, 1), (3, 1), (4, 2), (2, 0), (3, 2), (4, 1), (1, 0), (4, 0)]
The math is hard to easily explain, but the k in the definition of decode is obtained by solving a quadratic equation which gives the number of triangular numbers which are <= i, and where i falls in the sequence of triangular numbers tells you how to decode a unique pair from it. The interesting thing about this decode is that it doesn't use n at all but implements a one-to-one correspondence from the set of natural numbers (starting at 0) to the set of all pairs of natural numbers.

I don't think any thing on your line can improve. After all, as your m get closer and closer to the limit n(n-1)/2, you have thinner and thinner chance to find the unseen pair.
I would suggest to split into two cases: if m is small, use your random approach. But if m is large enough, try
pairs = list(itertools.combination(buffet,2))
ponits = random.sample(pairs, m)
Now you have to determine the threshold of m that determines which code path it should go. You need some math here to find the right trade off.

Optimizing this dynamic programming solution

The Problem:
You are given an array m of size n, where each value of m is composed of a weight w, and a percentage p.
m = [m0, m1, m2, ... , mn] = [[m0w, m0p], [m1w, m1p], [m2w, m2p], ..., [mnw, mnp] ]
So we'll represent this in python as a list of lists.
We are then trying to find the minimum value of this function:
# chaima is so fuzzy how come?
def minimize_me(m):
t = 0
w = 1
for i in range(len(m)):
current = m[i]
t += w * current[0]
w *= current[1]
return t
where the only thing we can change about m is its ordering. (i. e. rearrange the elements of m in any way) Additionally, this needs to complete in better than O(n!).
Brute Force Solution:
import itertools
import sys
min_t = sys.maxint
min_permutation = None
for permutation in itertools.permutations(m):
t = minimize_me(list(permutation), 0, 1)
if t < min_t:
min_t = t
min_permutation = list(permutation)
Ideas On How To Optimize:
the idea:
Instead of finding the best order, see if we can find a way to compare two given values in m, when we know the state of the problem. (The code might explain this more clearly). If I can build this using a bottom-up approach (so, starting from the end, assuming I have no optimal solution) and I can create an equation that can compare two values in m and say one is definitively better than the other, then I can construct an optimal solution, by using that new value, and comparing the next set of values of m.
the code:
import itertools
def compare_m(a, b, v):
a_first = b[0] + b[1] * (a[0] + a[1] * v)
b_first = a[0] + a[1] * (b[0] + b[1] * v)
if a_first > b_first:
return a, a_first
else:
return b, b_first
best_ordering = []
v = 0
while len(m) > 1:
best_pair_t = sys.maxint
best_m = None
for pair in itertools.combinations(m, 2):
m, pair_t = compare_m(pair[0], pair[1], v)
if pair_t < best_pair_t:
best_pair_t = pair_t
best_m = m
best_ordering.append(best_m)
m.remove(best_m)
v = best_m[0] + best_m[1] * v
first = m[0]
best_ordering.append(first)
However, this is not working as intended. The first value is always right, and roughly 60-75% of the time, the entire solution is optimal. However, in some cases, it looks like the way I am changing the value v which then gets passed back into my compare is evaluating much higher than it should. Here's the script I'm using to test against:
import random
m = []
for i in range(0, 5):
w = random.randint(1, 1023)
p = random.uniform(0.01, 0.99)
m.append([w, p])
Here's a particular test case demonstrating the error:
m = [[493, 0.7181996086105675], [971, 0.19915848527349228], [736, 0.5184210526315789], [591, 0.5904761904761905], [467, 0.6161290322580645]]
optimal solution (just the indices) = [1, 4, 3, 2, 0]
my solution (just the indices) = [4, 3, 1, 2, 0]
It feels very close, but I cannot for the life of me figure out what is wrong. Am I looking at this the wrong way? Does this seem like it's on the right track? Any help or feedback would be greatly appreciated!

We don't need any information about the current state of the algorithm to decide which elements of m are better. We can just sort the values using the following key:
def key(x):
w, p = x
return w/(1-p)
m.sort(key=key)
This requires explanation.
Suppose (w1, p1) is directly before (w2, p2) in the array. Then after processing these two items, t will be increased by an increment of w * (w1 + p1*w2) and w will be multiplied by a factor of p1*p2. If we switch the order of these items, t will be increased by an increment of w * (w2 + p2*w1) and w will be multiplied by a factor of p1*p2. Clearly, we should perform the switch if (w1 + p1*w2) > (w2 + p2*w1), or equivalently after a little algebra, if w1/(1-p1) > w2/(1-p2). If w1/(1-p1) <= w2/(1-p2), we can say that these two elements of m are "correctly" ordered.
In the optimal ordering of m, there will be no pair of adjacent items worth switching; for any adjacent pair of (w1, p1) and (w2, p2), we will have w1/(1-p1) <= w2/(1-p2). Since the relation of having w1/(1-p1) <= w2/(1-p2) is the natural total ordering on the w/(1-p) values, the fact that w1/(1-p1) <= w2/(1-p2) holds for any pair of adjacent items means that the list is sorted by the w/(1-p) values.
Your attempted solution fails because it only considers what a pair of elements would do to the value of the tail of the array. It doesn't consider the fact that rather than using a low-p element now, to minimize the value of the tail, it might be better to save it for later, so you can apply that multiplier to more elements of m.
Note that the proof of our algorithm's validity relies on all p values being at least 0 and strictly less than 1. If p is 1, we can't divide by 1-p, and if p is greater than 1, dividing by 1-p reverses the direction of the inequality. These problems can be resolved using a comparator or a more sophisticated sort key. If p is less than 0, then w can switch sign, which reverses the logic of what items should be switched. Then we do need to know about the current state of the algorithm to decide which elements are better, and I'm not sure what to do then.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Establishing highest score for a set of combinations - python

First, you could take the logarithm of each score, since then the problem is to maximize the sum of the scores instead of the product. Then, you can solve the problem as an Assignment Problem, where to each data point you assign one sequence.

Related

Is there any way to optimise this function?

Elements in preexisting Python list are disappearing and changing into None

"Smart" method to solve a subset sum problem with multiple sets

Python Get Random Unique N Pairs

Optimizing this dynamic programming solution

Categories

Resources