Split list into N sublists with approximately equal sums - python

I have a list of integers, and I need to split it into a given number of sublists (with no restrictions on order or the number of elements in each), in a way that minimizes the average difference in sums of each sublist.
For example:
>>> x = [4, 9, 1, 5]
>>> sublist_creator(x, 2)
[[9], [4, 1, 5]]
because list(map(sum, sublist_creator(x, 2))) yields [9, 10], minimizing the average distance. Alternatively, [[9, 1], [4, 5]] would have been equally correct, and my use case has no preference between two possibilities.
The only way I can think of to do this is by checking, iteratively, all possible combinations, but I'm working with a list of ~5000 elements and need to split it into ~30 sublists, so that approach is prohibitively expensive.

Here's the outline:
create N empty lists
sort() your input array in ascending order
pop() the last element from the sorted array
append() the popped element to the list with the lowest sum() of the elements
repeat 3 and 4 until input array is empty
profit!!!
With M=5000 elements and N=30 lists this approach might take about O(N*M) if you carefully store the intermediate sums of the sublists instead of calculating them from the scratch every time.

#lenik's solution has the right idea, but can use a heap queue that keeps track of the total of each sub-list and its index in sorted order to improve the cost of finding the sub-list of the minimum size to O(log n), resulting in an overall O(m x log n) time complexity:
import heapq
def sublist_creator(lst, n):
lists = [[] for _ in range(n)]
totals = [(0, i) for i in range(n)]
heapq.heapify(totals)
for value in lst:
total, index = heapq.heappop(totals)
lists[index].append(value)
heapq.heappush(totals, (total + value, index))
return lists
so that:
sublist_creator(x, 2)
returns:
[[4, 1, 5], [9]]

Implementation of #lennik's idea using python's underrated priority queue module heapq. This follows his idea pretty much exactly, except that each list is given a first element that contains its sum. Since lists are sorted lexicography and heapq is a min-heap implementation, all we have to do is pop off the first elements after we finish.
Using heapreplace will help avoid unnecessary resizing operations during the updates.
from heapq import heapreplace
def sublist_creator(x, n, sort=True):
bins = [[0] for _ in range(n)]
if sort:
x = sorted(x)
for i in x:
least = bins[0]
least[0] += i
least.append(i)
heapreplace(bins, least)
return [x[1:] for x in bins]
Given M = len(x) and N = n, the sort is O(M log M) and the loop does M insertions, which are O(log N) worst case. So for M >= N, we can say that asymptotically the algorithm is O(M log M). If the array is pre-sorted, it's O(M log N).

Related

Python cross multiplication with an arbitrary number of lists

I'm not sure what the correct term is for the multiplication here but I need to multiply an element from List A for example by every element in List B and create a new list for the new elements, so that the total length of the new list is len(A)*len(B).
As an example
A = [1,3,5], B=[4,6,8]
I need to multiply the two together to get
C = [4,6,8,12,18,24,20,30,40]
I have researched this and I have found that itertools(product) have exactly what I needed, however it is for a specific number of lists and I need to generalise to any number of lists as requested by the user.
I don't have access to the full code right now but the code asks the user for some lists (can be any number of lists) and the lists can have any number of elements in the lists (but all lists contain the same number of elements). These lists are then stored in one big list.
For example (user input)
A = [2,5,8], B= [4,7,3]
The big list will be
C = [[2,5,8],[4,7,3]]
In this case there are two lists in the big list but in general it can be any number of lists.
Once the code has this I have
print([a*b for a,b in itertools.product(C[0],C[1])])
>> [8,14,6,20,35,15,32,56,24]
The output of this is exactly what I want, however in this case the code is written for exactly two lists and I need it generalised to n lists.
I've been thinking about creating a loop to somehow loop over it n times but so far I have not been successful in this. Since C could any of any length then the loop needs a way to know when it's reached the end of the list. I don't need it to compute the product with n lists at the same time
print([a0*a1*...*a(n-1) for a0,a1,...,a(n-1) in itertools.product(C[0],C[1],C[2],...C[n-1])])
The loop could multiply two lists at a time then use the result from that multiplication against the next list in C and so on until C[n-1].
I would appreciate any advice to see if I'm at least heading in the right direction.
p.s. I am using numpy and the lists are arrays.
You can pass variable number of arguments to itertools.product with *. * is the unpacking operator that unpacks the list and passes its values the values of list to the function as if they are separately passed.
import itertools
import math
A = [[1, 2], [3, 4], [5, 6]]
result = list(map(math.prod, itertools.product(*A)))
print(result)
Result:
[15, 18, 20, 24, 30, 36, 40, 48]
You can find many explanations on the internet about * operator. In short, if you call a function like f(*lst), it will be roughly equivalent to f(lst[0], lst[1], ..., lst[len(lst) - 1]). So, it will save you from the need to know the length of the list.
Edit: I just realized that math.prod is a 3.8+ feature. If you're running an older version of Python, you can replace it with its numpy equivalent, np.prod.
You could use a reduce function that is intended exactly for these types of operations, which is based on recursion and accumulation. I am providing you an example with a primitive function so you can better understand its functionality:
lists = [
[4, 6, 8],
[1, 3, 5]
]
def reduce(function, iterable, initializer=None):
it = iter(iterable)
if initializer is None:
value = next(it)
else:
value = initializer
for element in it:
value = function(value, element)
return value
def cmp(a, b):
for x in a:
for y in b:
yield x*y
summed = list(reduce(cmp, lists))
# OUTPUT
[4, 12, 20, 6, 18, 30, 8, 24, 40]
In case you need it sorted just make use of the sort() function.

Summing each element of two arrays

I have two arrays and want to sum each element of both arrays and find the maximum sum.
I have programmed it like this:
sum = []
for element in arrayOne:
sum.append(max([item + element for item in arrayTwo]))
print max(sum)
is there any better way to achieve this?
You can use numpy.
import numpy as np
a = np.array(arrayOne)
b = np.array(arrayTwo)
max = max(a + b)
print(max)
Use itertools.product with max:
from itertools import product
print(max(sum(x) for x in product(arrayOne, arrayTwo)))
Or using map:
print(max(map(sum,product(arrayOne, arrayTwo))))
max_sum = max(map(sum, zip(arrayOne, arrayTwo)))
Upd.
If you need max from sum of all elements in array:
max_sum = max(sum(arrayOne), sum(arrayTwo))
If arrayOne and arrayTwo are nested lists ([[1, 2], [3, 3], [3, 5], [4, 9]]) and you need to find element with max sum:
max_sum = max(map(sum, arrayOne + arrayTwo))
P. S. Next time, please, provide examples of input and output to not let us guess what do you need.
To find a maximum of all pairwise sums of elements of two arrays of lengths n and m respectively one can just
max(arrayOne) + max(arrayTwo)
which would perform at worst in O(max(n, m)) instead of O(n*m) when going over all the combinations.
However, if, for whatever reason, it is necessary to iterate over all the pairs, the solution might be
max(foo(one, two) for one in arrayOne for two in arrayTwo)
Where foo can be any function of two numeric parameters outputting a number (or an object of any class that implements ordering).
By the way, please avoid redefining built-ins like sum in your code.

Python heapq: Split and merge into a ordered heapq

I am wanting to split two heapqs (used as a priority queues), and then add them together and have the resulting heapq ordered in relation to both of the previous heapqs.
Is this possible in python?
My current code:
population = []
for i in range(0, 6):
heappush(population, i)
new_population = []
for i in range(4, 9):
heappush(new_population, i)
split_index = len(population) // 2
temp_population = population[:split_index]
population = new_population[:split_index] + temp_population
print(population)
print(heappop(population))
Output:
[4, 5, 6, 0, 1, 2]
4
Wanted output:
[0, 1, 2, 4, 5, 6]
0
Use nlargest instead of slicing, then reheapify the combined lists.
from heapq import nlargest, heapify
n = len(population) // 2
population = heapify(nlargest(population, n) +
nlargest(new_population, n))
print(heappop(population))
You may want to benchmark, though, if sorting the two original lists, then merging the results, is faster. Python's sort routine is fast for nearly sorted lists, and this may impose a lot less overhead than the heapq functions. The last heapify step may not be necessary if you don't actually need a priority queue (since you are sorting them anyway).
from itertools import islice
from heapq import merge, heapify
n = len(population) # == len(new_population), presumably
population = heapify(islice(merge(sorted(population), sorted(new_population)), n))

How to find position of character in a list with respect to other characters in a list in o(n) time?

Suppose I have a string PRIME on a list ['P','R','I','M','E']. If we iterate through the list, the first element 'P' has 3 elements less than it which is ['I','M','E'] and the second element 'R' has only three elements less than it (note that we are looking for smaller elements going forward in the list so while looking for elements smaller than 'R', 'P' would not be considered as we are done with it) so the positional list would be [3,3,1,1,0] in above example. I could do this in o(n**2) time by using a nested loop but is there any way to do this in o(n)? I tried something like this but it failed horribly:
for _ in range(int(input())):
x=list(input())
y=sorted(x)
lis=[]
for _ in x:
res=abs(y.index(_)-x.index(_))
lis.append(res)
print(lis)
Here is mine (not O(n), but not O(n^2) either I guess):
>>> def find_dict_position(s):
from collections import defaultdict
counter = defaultdict(int)
result = []
less_count = 0
for e in s[::-1]:
less_count = sum(counter[c] for c in counter if c<e)
result.append(less_count)
counter[e] += 1
return reversed(result)
>>> list(find_dict_position('PRIME'))
[3, 3, 1, 1, 0]
Regardless of whether if you can do this in a smaller complexity or not, you can use a list comprehension and a generator expression as follows to make your code faster and more Pythonic.
In [7]: [sum(j > t for t in lst[i+1:])for i, j in enumerate(lst)]
Out[7]: [3, 3, 1, 1, 0]
Also, Note that you cannot do this in O(n), because after all you need to compare your elements together which is a sorting type algorithm that in best case can be done in O(nlong(n)).
So basically this problem is finding the count of smaller elements on the right side of current position in an array. First replace character array with their respective ASCII values. That you can use balanced BST to solve the problem.
Here is the detailed explanation for the finding count of smaller elements on the right side in an array . Complexity O(nLogn).
But here as array elements are only characters it can be done in O(n) complexity as per algorithm written in mshsayem’s answer.

Python: Fast extraction of intersections among all possible 2-combinations in a large number of lists

I have a dataset of ca. 9K lists of variable length (1 to 100K elements). I need to calculate the length of the intersection of all possible 2-list combinations in this dataset. Note that elements in each list are unique so they can be stored as sets in python.
What is the most efficient way to perform this in python?
Edit I forgot to specify that I need to have the ability to match the intersection values to the corresponding pair of lists. Thanks everybody for the prompt response and apologies for the confusion!
If your sets are stored in s, for example:
s = [set([1, 2]), set([1, 3]), set([1, 2, 3]), set([2, 4])]
Then you can use itertools.combinations to take them two by two, and calculate the intersection (note that, as Alex pointed out, combinations is only available since version 2.6). Here with a list comrehension (just for the sake of the example):
from itertools import combinations
[ i[0] & i[1] for i in combinations(s,2) ]
Or, in a loop, which is probably what you need:
for i in combinations(s, 2):
inter = i[0] & i[1]
# processes the intersection set result "inter"
So, to have the length of each one of them, that "processing" would be:
l = len(inter)
This would be quite efficient, since it's using iterators to compute every combinations, and does not prepare all of them in advance.
Edit: Note that with this method, each set in the list "s" can actually be something else that returns a set, like a generator. The list itself could simply be a generator if you are short on memory. It could be much slower though, depending on how you generate these elements, but you wouldn't need to have the whole list of sets in memory at the same time (not that it should be a problem in your case).
For example, if each set is made from a function gen:
def gen(parameter):
while more_sets():
# ... some code to generate the next set 'x'
yield x
with open("results", "wt") as f_results:
for i in combinations(gen("data"), 2):
inter = i[0] & i[1]
f_results.write("%d\n" % len(inter))
Edit 2: How to collect indices (following redrat's comment).
Besides the quick solution I answered in comment, a more efficient way to collect the set indices would be to have a list of (index, set) instead of a list of set.
Example with new format:
s = [(0, set([1, 2])), (1, set([1, 3])), (2, set([1, 2, 3]))]
If you are building this list to calculate the combinations anyway, it should be simple to adapt to your new requirements. The main loop becomes:
with open("results", "wt") as f_results:
for i in combinations(s, 2):
inter = i[0][1] & i[1][1]
f_results.write("length of %d & %d: %d\n" % (i[0][0],i[1][0],len(inter))
In the loop, i[0] and i[1] would be a tuple (index, set), so i[0][1] is the first set, i[0][0] its index.
As you need to produce a (N by N/2) matrix of results, i.e., O(N squared) outputs, no approach can be less than O(N squared) -- in any language, of course. (N is "about 9K" in your question). So, I see nothing intrinsically faster than (a) making the N sets you need, and (b) iterating over them to produce the output -- i.e., the simplest approach. IOW:
def lotsofintersections(manylists):
manysets = [set(x) for x in manylists]
moresets = list(manysets)
for s in reversed(manysets):
moresets.pop()
for z in moresets:
yield s & z
This code's already trying to add some minor optimization (e.g. by avoiding slicing or popping off the front of lists, which might add other O(N squared) factors).
If you have many cores and/or nodes available and are looking for parallel algorithms, it's a different case of course -- if that's your case, can you mention the kind of cluster you have, its size, how nodes and cores can best communicate, and so forth?
Edit: as the OP has casually mentioned in a comment (!) that they actually need the numbers of the sets being intersected (really, why omit such crucial parts of the specs?! at least edit the question to clarify them...), this would only require changing this to:
L = len(manysets)
for i, s in enumerate(reversed(manysets)):
moresets.pop()
for j, z in enumerate(moresets):
yield L - i, j + 1, s & z
(if you need to "count from 1" for the progressive identifiers -- otherwise obvious change).
But if that's part of the specs you might as well use simpler code -- forget moresets, and:
L = len(manysets)
for i xrange(L):
s = manysets[i]
for j in range(i+1, L):
yield i, j, s & manysets[z]
this time assuming you want to "count from 0" instead, just for variety;-)
Try this:
_lists = [[1, 2, 3, 7], [1, 3], [1, 2, 3], [1, 3, 4, 7]]
_sets = map( set, _lists )
_intersection = reduce( set.intersection, _sets )
And to obtain the indexes:
_idxs = [ map(_i.index, _intersection ) for _i in _lists ]
Cheers,
José María García
PS: Sorry I misunderstood the question

Categories

Resources