I have a (large) list of lists of integers, e.g.,
a = [
[1, 2],
[3, 6],
[2, 1],
[3, 5],
[3, 6]
]
Most of the pairs will appear twice, where the order of the integers doesn't matter (i.e., [1, 2] is equivalent to [2, 1]). I'd now like to find the pairs that appear only once, and get a Boolean list indicating that. For the above example,
b = [False, False, False, True, False]
Since a is typically large, I'd like to avoid explicit loops. Mapping to frozensets may be advised, but I'm not sure if that's overkill.
ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]
We can use Counter to get counts of each list (turn list to frozenset to ignore order) and then for each list check if it only appears once.
Here's a solution with NumPy that 10 times faster than the suggested frozenset solution:
a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
print(ct[inv] == 1)
Sorting is fast and makes sure that the edges [i, j], [j, i] in the original array identify with each other. Much faster than frozensets or tuples.
Row uniquification inspired by https://stackoverflow.com/a/16973510/353337.
Speed comparison for different array sizes:
The plot was created with
from collections import Counter
import numpy
import perfplot
def fs(a):
ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]
return b
def with_numpy(a):
a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
res = ct[inv] == 1
return res
perfplot.save(
"out.png",
setup=lambda n: numpy.random.randint(0, 10, size=(n, 2)),
kernels=[fs, with_numpy],
labels=["frozenset", "numpy"],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
)
You could scan the list from start to end, while maintaining a map of encountered pairs to their first position. Whenever you process a pair, you check to see if you've encountered it before. If that's the case, both the first encounter's index in b and the current encounter's index must be set to False. Otherwise, we just add the current index to the map of encountered pairs and change nothing about b. b will start initially all True. To keep things equivalent wrt [1,2] and [2,1], I'd first simply sort the pair, to obtain a stable representation. The code would look something like this:
def proc(a):
b = [True] * len(a) # Better way to allocate this
filter = {}
idx = 0
for p in a:
m = min(p)
M = max(p)
pp = (m, M)
if pp in filter:
# We've found the element once previously
# Need to mark both it and the current value as "False"
# If we encounter pp multiple times, we'll set the initial
# value to False multiple times, but that's not an issue
b[filter[pp]] = False
b[idx] = False
else:
# This is the first time we encounter pp, so we just add it
# to the filter for possible later encounters, but don't affect
# b at all.
filter[pp] = idx
idx++
return b
The time complexity is O(len(a)) which is good, but the space complexity is also O(len(a)) (for filter), so this might not be so great. Depending on how flexible you are, you can use an approximate filter such as a Bloom filter.
#-*- coding : utf-8 -*-
a = [[1, 2], [3, 6], [2, 1], [3, 5], [3, 6]]
result = filter(lambda el:(a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1),a)
bool_res = [ (a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1) for el in a]
print result
print bool_res
wich gives :
[[3, 5]]
[False, False, False, True, False]
Use a dictionary for an O(n) solution.
a = [ [1, 2], [3, 6], [2, 1], [3, 5], [3, 6] ]
dict = {}
boolList = []
# Iterate through a
for i in range (len(a)):
# Assume that this element is not a duplicate
# This 'True' is added to the corresponding index i of boolList
boolList += [True]
# Set elem to the current pair in the list
elem = a[i]
# If elem is in ascending order, it will be entered into the map as is
if elem[0] <= elem[1]:
key = repr(elem)
# If not, change it into ascending order so keys can easily be compared
else:
key = repr( [ elem[1] ] + [ elem[0] ])
# If this pair has not yet been seen, add it as a key to the dictionary
# with the value a list containing its index in a.
if key not in dict:
dict[key] = [i]
# If this pair is a duploicate, add the new index to the dict. The value
# of the key will contain a list containing the indeces of that pair in a.
else:
# Change the value to contain the new index
dict[key] += [i]
# Change boolList for this to True for this index
boolList[i] = False
# If this is the first duplicate for the pair, make the first
# occurrence of the pair into a duplicate as well.
if len(dict[key]) <= 2:
boolList[ dict[key][0] ] = False
print a
print boolList
Related
I was hoping I could receive some help on thinking through a python problem. I have a General Ledger of Data, and would like to delete, or turn to zero, any accrual. All this means is, I want to find one number in a column, and search for it in another column. If i find a match, I want to turn both numbers (the iterable number, and the found number) to zero.
I know I need to use some form of an iterable like the following:
for x in df[column 1]:
if x is in df[column 2]:
x == 0
df[column 2 [index?]] == 0
else:
continue
Could someone assist me in writing the correct code to accomplish this? My goal is two essentially iterate through two columns, find where two values match, and turn those values to 0. Thank you.
You'll want to use enumerate to get the index in order to set the value in the list to zero:
for i, x in enumerate(df[col1]):
matches = [j for j, y in enumerate(df[col2].isin([x])) if y is True]
if len(matches) == 0: continue
df[col1][i] = 0
k = matches[0]
df[col2][k] = 0
If an element in df[col1] appears in df[col2] multiple times, this will only set the first occurrence to 0.
If you want to remove all occurrences, you could use this code:
for i, x in enumerate(df[col1]):
matches = [j for j, y in enumerate(df[col2].isin([x])) if y is True]
if len(matches) == 0: continue
df[col1][i] = 0
for k in matches:
df[col2][k] = 0
import numpy as np
idx = np.flatnonzero(df['col1'] == df['col2'])
df['col1'][idx] = 0
df['col2'][idx] = 0
Instead of exact equal you can test for approximate equality using numpy.allclose():
idx = np.flatnonzero(np.allclose(df['col1'], df['col2']))
You can do this with any:
list = ["one", "two", "three"]
foo = ["bannana", "apple", "two", "blue"]
o = []
for x in foo:
if any(n in x for n in list):
o.append(0)
else:
o.append(x)
print(o)
which iterates through foo and appends zero to the output list if a match is found in list else returns the value if no match is found:
['bannana', 'apple', 0, 'blue']
Update:
Actually, this answers your question better than my previous - using list comprehension:
foo = [[1, 2, 3, 5], [9, 7, 2, 4]]
for x in foo[1]:
if x in foo[0]:
foo[0] = [0 if i == x else i for i in foo[0]]
foo[1] = [0 if i == x else i for i in foo[1]]
print(foo)
The foo[0] is the list with the items to be checked for and foo[1] are the items to be checked against. This outputs [[1, 0, 3, 5], [9, 7, 0, 4]].
I think you can use isin
df[column 1] = df[column 1] * (~df[column 1].isin(df[column 2])
df[column 2] = df[column 2] * (~df[column 2].isin(df[column 1])
So, I'm working on a python script that will take a list of integers (S), and output a list of lists based off of the integers in S, and the sum of each list must be the same value.
I'm having a problem with appending values that are the same. Python seems to be aggregating them as the same value, when I want it to create another entry.
I've tried using .extend with the same results. Also, I've read up and seen posts about multipling by a constant to create multiple values. The problem here is that I don't know how many times I will be adding the element. Is there an easy solution to this? Sorry if this has been answered before, but I can't find it.
import itertools
def arrangeBoxes(stacks, arr):
perms = itertools.permutations(arr)
total = sum(arr)
stackSize = total / stacks
if (not(stackSize.is_integer())):
return [False, []]
for i in perms:
tempSum = 0
tempArr = []
stackArr = []
built = False
for j in i:
tempArr.append(j)
if (sum(tempArr) == stackSize):
stackArr.append(tempArr)
tempArr = []
if (j == i[len(i) - 1]):
built = True
break
else:
if (j == i[len(i) - 1]):
break
if (built):
return [True, stackArr]
return [False, []]
# Doesn't Work.
# Output: [True, [[3]]]
# Should be: [True, [[3], [3], [3]]
print(str(arrangeBoxes(3, [3, 3, 3])))
# Works fine.
# Output: [True, [[2, 1], [2, 1], [3]]]
print(str(arrangeBoxes(3, [2, 1, 2, 1, 3])))
you're explicitly breaking and returning when
j == i[len(i) - 1])
In the first iteration (of (3,3,3)) j, is obviously 3, and any cell in i is obviously 3 - this will be immediately True and break, then return.
Iterate over indices, if you want to check if you're in the last index, not over the element:
for perm in perms:
...
for j in range(len(perm)):
...
if j == len(perm) - 1:
return [True, stackArr]
No need for else or break.
I have a list that contains a random number of ints.
I would like to iterate over this list, and if a number and the successive number are within one numeric step of one another, I would like to concatenate them into a sublist.
For example:
input = [1,2,4,6,7,8,10,11]
output = [[1,2],[4],[6,7,8],[10,11]]
The input list will always contain positive ints sorted in increasing order.
I tried some of the code from here.
initerator = iter(inputList)
outputList = [c + next(initerator, "") for c in initerator]
Although I can concat every two entries in the list, I cannot seem to add a suitable if in the list comprehension.
Python version = 3.4
Unless you have to have a one-liner, you could use a simple generator function, combining elements until you hit a non consecutive element:
def consec(lst):
it = iter(lst)
prev = next(it)
tmp = [prev]
for ele in it:
if prev + 1 != ele:
yield tmp
tmp = [ele]
else:
tmp.append(ele)
prev = ele
yield tmp
Output:
In [2]: lst = [1, 2, 4, 6, 7, 8, 10, 11]
In [3]: list(consec(lst))
Out[3]: [[1, 2], [4], [6, 7, 8], [10, 11]]
Nice way (found the "splitting" indices and then slice:
input = [1,2,4,6,7,8,10,11]
idx = [0] + [i+1 for i,(x,y) in enumerate(zip(input,input[1:])) if x+1!=y] + [len(input)]
[ input[u:v] for u,v in zip(idx, idx[1:]) ]
#output:
[[1, 2], [4], [6, 7, 8], [10, 11]]
using enumerate() and zip().
Simplest version I have without any imports:
def mergeAdjNum(l):
r = [[l[0]]]
for e in l[1:]:
if r[-1][-1] == e - 1:
r[-1].append(e)
else:
r.append([e])
return r
About 33% faster than one liners.
This one handles the character prefix grouping mentioned in a comment:
def groupPrefStr(l):
pattern = re.compile(r'([a-z]+)([0-9]+)')
r = [[l[0]]]
pp, vv = re.match(pattern, l[0]).groups()
vv = int(vv)
for e in l[1:]:
p,v = re.match(pattern, e).groups()
v = int(v)
if p == pp and v == vv + 1:
r[-1].append(e)
else:
pp, vv = p, v
r.append([e])
return r
This is way slower than the number only one. Knowing the exact format of the prefix (only one char ?) could help avoid using the re module and speed things up.
Suppose I have the following list of lists:
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
I want to have the average of each n-th element in the arrays. However, when wanting to do this in a simple way, Python generated out-of-bounds errors because of the different lengths. I solved this by giving each array the length of the longest array, and filling the missing values with None.
Unfortunately, doing this made it impossible to compute an average, so I converted the arrays into masked arrays. The code shown below works, but it seems rather cumbersome.
import numpy as np
import numpy.ma as ma
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
# Determine the length of the longest list
lenlist = []
for i in a:
lenlist.append(len(i))
max = np.amax(lenlist)
# Fill each list up with None's until required length is reached
for i in a:
if len(i) <= max:
for j in range(max - len(i)):
i.append(None)
# Fill temp_array up with the n-th element
# and add it to temp_array
temp_list = []
masked_arrays = []
for j in range(max):
for i in range(len(a)):
temp_list.append(a[i][j])
masked_arrays.append(ma.masked_values(temp_list, None))
del temp_list[:]
# Compute the average of each array
avg_array = []
for i in masked_arrays:
avg_array.append(np.ma.average(i))
print avg_array
Is there a way to do this more quickly? The final list of lists will contain 600000 'rows' and up to 100 'columns', so efficiency is quite important :-).
tertools.izip_longest would do all the padding with None's for you so your code can be reduced to:
import numpy as np
import numpy.ma as ma
from itertools import izip_longest
a = [ [1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6] ]
averages = [np.ma.average(ma.masked_values(temp_list, None)) for temp_list in izip_longest(*a)]
print(averages)
[2.0, 3.0, 4.0, 6.0]
No idea what the fastest way in regard to the numpy logic but this is definitely going to be a lot more efficient than your own code.
If you wanted a faster pure python solution:
from itertools import izip_longest, imap
a = [[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]]
def avg(x):
x = filter(None, x)
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))
print(list(filt))
[2.0, 3.0, 4.0, 6.0]
If you have 0's in the arrays that won't work as 0 will be treated as Falsey, you will have to use a list comp to filter in that case but it will still be faster:
def avg(x):
x = [i for i in x if i is not None]
return sum(x, 0.0) / len(x)
filt = imap(avg, izip_longest(*a))
Here's an almost* fully vectorized solution based on np.bincount and np.cumsum -
# Store lengths of each list and their cumulative and entire summations
lens = np.array([len(i) for i in a]) # Only loop to get lengths
C = lens.cumsum()
N = lens.sum()
# Create ID array such that the first element of each list is 0,
# the second element as 1 and so on. This is needed in such a format
# for use with bincount later on.
shifts_arr = np.ones(N,dtype=int)
shifts_arr[C[:-1]] = -lens[:-1]+1
id_arr = shifts_arr.cumsum()-1
# Use bincount to get the summations and thus the
# averages across all lists based on their positions.
avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
-* Almost because we are getting the lengths of lists with a loop, but with minimal computation involved there, must not affect the total runtime hugely.
Sample run -
In [109]: a = [ [1, 2, 3],
...: [2, 3, 4],
...: [3, 4, 5, 6] ]
In [110]: lens = np.array([len(i) for i in a])
...: C = lens.cumsum()
...: N = lens.sum()
...:
...: shifts_arr = np.ones(N,dtype=int)
...: shifts_arr[C[:-1]] = -lens[:-1]+1
...: id_arr = shifts_arr.cumsum()-1
...:
...: avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
...:
In [111]: avg_out
Out[111]: array([ 2., 3., 4., 6.])
You can already clean your code to compute the max length: this single line does the job:
len(max(a,key=len))
Combining with other answer you will get the result like so:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(len(max(a,key=len)))]
You can also avoid the masked array and use np.nan instead:
def replaceNoneTypes(x):
return tuple(np.nan if isinstance(y, type(None)) else y for y in x)
a = [np.nanmean(replaceNoneTypes(temp_list)) for temp_list in zip_longest(*df[column], fillvalue=np.nan)]
On your test array:
[np.mean([x[i] for x in a if len(x) > i]) for i in range(4)]
returns
[2.0, 3.0, 4.0, 6.0]
If you are using Python version >= 3.4, then import the statistics module
from statistics import mean
if using lower versions, create a function to calculate mean
def mean(array):
sum = 0
if (not(type(array) == list)):
print("there is some bad format in your input")
else:
for elements in array:
try:
sum = sum + float(elements)
except:
print("non numerical entry found")
average = (sum + 0.0) / len(array)
return average
Create a list of lists, for example
myList = [[1,2,3],[4,5,6,7,8],[9,10],[11,12,13,14],[15,16,17,18,19,20,21,22],[23]]
iterate through myList
for i, lists in enumerate(myList):
print(i, mean(lists))
This will print down the sequence n, and the average of nth list.
To find particularly the average of only nth list, create a function
def mean_nth(array, n):
if((type(n) == int) and n >= 1 and type(array) == list):
return mean(myList[n-1])
else:
print("there is some bad format of your input")
Note that index starts from zero, so for instance if you are looking for the mean of 5th list, it will be at index 4. this explains n-1 in the code.
And then call the function, for example
avg_5thList = mean_nth(myList, 5)
print(avg_5thList)
Running the above code on myList yields following result:
0 2.0
1 6.0
2 9.5
3 12.5
4 18.5
5 23.0
18.5
where the first six lines are generated from the iterative loop, and display the index of nth list and list average. Last line (18.5) displays the average of 5th list as a result of mean_nth(myList, 5) call.
Further, for a list like yours,
a = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5, 6]
]
Lets say you want average of 1st elements, i.e. (1+2+3)/3 = 2, or 2nd elements, i.e., (2+3+4)/3 = 3, or 4th elements such as 6/1 = 6, you will need to find the length of each list so that you can identify in the nth element exists in a list or not. For that, you first need to arrange your list of lists in the order of length of lists.
You can either
1) first sort the main list according to size of constituent lists iteratively, and then go through the sorted list to identify if the constituent lists are of sufficient length
2) or you can iteratively look into the original list for length of constituent lists.
(I can definitely get back with working out a faster recursive algorithm if needed)
Computationally second one is more efficient, so assuming that your 5th element means 4th in the index(0, 1, 2, 3, 4), or nth element means (n-1)th element, lets go with that and create a function
def find_nth_average(array, n):
if(not(type(n) == int and (int(n) >= 1))):
return "Bad input format for n"
else:
if (not(type(array) == list)):
return "Bad input format for main list"
else:
total = 0
count = 0
for i, elements in enumerate(array):
if(not(type(elements) == list)):
return("non list constituent found at location " + str(i+1))
else:
listLen = len(elements)
if(int(listLen) >= n):
try:
total = total + elements[n-1]
count = count + 1
except:
return ("non numerical entity found in constituent list " + str(i+1))
if(int(count) == 0):
return "No such n-element exists"
else:
average = float(total)/float(count)
return average
Now lets call this function on your list a
print(find_nth_average(a, 0))
print(find_nth_average(a, 1))
print(find_nth_average(a, 2))
print(find_nth_average(a, 3))
print(find_nth_average(a, 4))
print(find_nth_average(a, 5))
print(find_nth_average(a, 'q'))
print(find_nth_average(a, 2.3))
print(find_nth_average(5, 5))
The corresponding results are:
Bad input format for n
2.0
3.0
4.0
6.0
No such n-element exists
Bad input format for n
Bad input format for n
Bad input format for main list
If you have an erratic list, like
a = [[1, 2, 3], 2, [3, 4, 5, 6]]
that contains a non - list element, you get an output:
non list constituent found at location 2
If your constituent list is erratic, like:
a = [[1, 'p', 3], [2, 3, 4], [3, 4, 5, 6]]
that contains a non - numerical entity in a list, and find the average of 2nd elements by print(find_nth_average(a, 2))
you get an output:
non numerical entity found in constituent list 1
I asked the same thing yesterday but was finding a hard time finding the right sentence to describe my problem, so I deleted it. But here it is again.
Let us say that we have 3 lists:
list1 = [1, 2]
list2 = [2, 3]
list3 = [1]
Let us say I want to find the 3 permutations of these list, which when added together, it results in the smallest number possible. So here, the permutations that we want would be:
1,2,1
2,2,1
1,3,1
Because the sum of the numbers on each permutation creates the smallest numbers possible.
2,3,1
Will not be a part of the solution since the sum is larger than the other three, thus, not a part of the three smallest.
Of course, using itertools and list all the permutations, and add the numbers on each permutation would be the most obvious solution, but I was wondering if there is a more efficient algorithm for this? Considering It should be able to take 1000 lists.
NOTE: If the number of list is N, then i would need to find N permutations. Thus, if there are 3 lists, I find the 3 smallest permutations.
PRECONDITIONS:
-A part of the precondition is that all of these lists are sorted.
-The number of elements on all list is 2N-1, to deal with the case where only one list have more than 1 element.
-All of the lists are sorted from smallest.
Since the lists are sorted, the smallest element in each list is the first one, the sum of which gives us the "minimal sum permutation". Picking any element except from the first one is going to increase the sum value.
We start off by calculating the difference between element i and the first one for each list. For example, for the lists [1, 3, 4, 8] and [3, 9, 12, 15], these differences would be [2, 3, 7] and [6, 9, 12] respectively. We keep them separate in cost_lists, because they will be needed later on. But in cost_global, we pool them all together and by sorting them in ascending order, we find a solution where for all lists but one we choose the minimal value. To keep track which element from which list will give us the next minimum sum, we group the difference values with both the index of the list it comes from and which element in that list it is.
However, this is not a complete approach. It is possible, for example, that taking the next value from two lists incurs a smaller cost than taking the next value from one list. So, we have to search for the product of the combinations for k = 2, 3, ..., N. Doing that normally would result to N**N complexity, but we can take some really good shortcuts.
From the partial solution above, we have a list of the minimal costs in order. Since we want only the first N minimal sums, we check what the cost value of the Nth permutation is (threshold). So, when we search for a group of two next values, we can safely ignore their sum if it exceeds our current threshold. And since the difference values within lists are in ascending order, once we cross the threshold, we can instantly exit the loop. Similarly, if we haven't found any new combinations within the threshold for k = 2, it is pointless to look for k > 2. Considering that most likely the smallest sum costs will be the result of a single nonminimal value, or a few small ones (unless most lists have massive differences between sequential values), we are bound to exit these loops rather quickly. The code I came up to achieve this is fairly ugly, but it effectively does the same as
for k in xrange(2, len(lists)):
for comb in itertools.combinations(cost_lists, k):
for group in itertools.product(*comb):
if sum(g[0] for g in group) <= threshold:
cost_global.append(group)
except that we exit the loops as soon as we guarantee not to find any results, lest we pointlessly shift through an innumerable number of combinations/products which are over the threshold.
def filter_cost(cost_lists, threshold):
cost = [[i for i in ilist if i[0] <= threshold] for ilist in cost_lists]
# the algorithm requires that we remove any lists that have become empty
return [ilist for ilist in cost if ilist]
def _combi(cost_lists, k, start, depth, subtotal, threshold):
if depth == k:
for i in xrange(start, len(cost_lists)):
for value in cost_lists[i]:
if value[0] + subtotal > threshold:
break
yield (value,)
else:
for i in xrange(start, len(cost_lists)):
for value in cost_lists[i]:
if value[0] + subtotal > threshold:
break
for c in _combi(cost_lists, k, i+1, depth+1,
value[0]+subtotal, threshold):
yield (value,) + c
def combinations_product(cost_lists, k, threshold):
for i in xrange(len(cost_lists)-k+1):
for value in cost_lists[i]:
if value[0] > threshold:
break
for comb in _combi(cost_lists, k, i+1, 2, value[0], threshold):
temp = (value,) + comb
cost, ilists, ith_items = zip(*temp)
yield sum(cost), ilists, ith_items
def find_smallest_sum_permutations(lists):
minima = [min(x) for x in lists]
cost_local = []
cost_global = []
for i, ilist in enumerate(lists):
if len(ilist) > 1:
first = ilist[0]
diff = [(num-first, i, j) for j, num in enumerate(ilist[1:], 1)]
cost_local.append(diff)
cost_global.extend(diff)
cost_global.sort()
threshold_index = len(lists) - 2
cost_threshold = cost_global[threshold_index][0]
cost_local = filter_cost(cost_local, cost_threshold)
for k in xrange(2, len(lists)):
group_combinations = tuple(combinations_product(cost_local, k,
cost_threshold))
if group_combinations:
cost_global.extend(group_combinations)
cost_global.sort()
cost_threshold = cost_global[threshold_index][0]
cost_local = filter_cost(cost_local, cost_threshold)
else:
break
permutations = [minima]
for k in xrange(N-1):
_, ilist, ith_item = cost_global[k]
if type(ilist) == int:
permutation = [minima[i]
if i != ilist else lists[ilist][ith_item]
for i in xrange(N)]
else:
# multiple nonminimal values combination
mapping = dict(zip(ilist, ith_item))
permutation = [minima[i]
if i not in mapping else lists[i][mapping[i]]
for i in xrange(N)]
permutations.append(permutation)
return permutations
Examples
Example in the question.
>>> lists = [
[1, 2],
[2, 3],
[1],
]
>>> for p in find_smallest_sum_permutations(lists):
... print p, sum(p)
[1, 2, 1] 4
[2, 2, 1] 5
[1, 3, 1] 5
Example I had generated with random lists.
>>> import random
>>> N = 5
>>> random.seed(1024)
>>> lists = [sorted(random.sample(range(10*N), 2*N-1)) for _ in xrange(N)]
>>> for p in find_smallest_sum_permutations(lists):
... print p, sum(p)
[4, 4, 1, 6, 0] 15
[4, 6, 1, 6, 0] 17
[4, 4, 3, 6, 0] 17
[4, 4, 1, 6, 4] 19
[4, 6, 3, 6, 0] 19
Example by user2357112 which had caught a glaring error in my previous iteration.
>>> lists = [
[1, 2, 30, 40],
[1, 2, 30, 40],
[10, 20, 30, 40],
[10, 20, 30, 40],
]
>>> for p in find_smallest_sum_permutations(lists):
... print p, sum(p)
[1, 1, 10, 10] 22
[2, 1, 10, 10] 23
[1, 2, 10, 10] 23
[2, 2, 10, 10] 24
The trick is to only generate the combinations that might possibly be needed, and store them in a heap. Each one that you pull out is the smallest one you have not yet seen. And the fact that THAT combination has been pulled out tells you that there are new ones which might also be small.
See https://docs.python.org/2/library/heapq.html for how to use a heap. We also need code for generating combinations. And with that, here is working code for getting the first n combinations for any list of lists:
import heapq
# Helper class for storing combinations.
class ListSelector:
def __init__(self, lists, indexes):
self.lists = lists
self.indexes = indexes
def value(self):
answer = 0
for i in range(0, len(self.lists)):
answer = answer + self.lists[i][self.indexes[i]]
return answer
def values(self):
return [self.lists[i][self.indexes[i]] for i in range(0, len(self.lists))]
# These are the next combinations. We are willing to increment any
# leading 0, or the first non-zero value. This will provide one and
# only one path to each possible combination.
def next_selectors(self):
lists = self.lists
indexes = self.indexes
selectors = []
for i in range(0, len(lists)):
if len(lists[i]) <= indexes[i] + 1:
if 0 == indexes[i]:
continue
else:
break
new_indexes = [
indexes[j] + (0 if j != i else 1)
for j in range(0, len(lists))]
selectors.append(ListSelector(lists, new_indexes))
if 0 < indexes[i]:
break
return selectors
# This will just return an iterator over all combinations, from smallest
# to largest. It does NOT generate them until needed.
def combinations(lists):
sel = ListSelector(lists, [0 for _ in range(len(lists))])
upcoming = [(sel.value(), sel)]
while len(upcoming):
value, sel = heapq.heappop(upcoming)
yield sel
for next_sel in sel.next_selectors():
heapq.heappush(upcoming, (next_sel.value(), next_sel))
# This just gets the first n of them. (It will return less if less.)
def smallest_n_combinations(n, lists):
i = 0
for sel in combinations(lists):
yield sel
i = i + 1
if i == n:
break
# Example usage
lists = [
[1, 2, 5],
[2, 3, 4],
[1]]
for sel in smallest_n_combinations(3, lists):
print(sel.value(), sel.values(), sel.indexes)
(This could be made more efficient for a long list of lists with tricks like caching the value inside of ListSelector and calculating it incrementally for new ones.)