Optimizing generating a list of sums in Python - python

I am attempting to use Python for the following task: given a set of integers S, produce S + S, the set of integers expressible as s1 + s2 for s1, s2 members of S (not necessarily distinct).
I am using the following code:
def sumList(l):
# generates a list of numbers which are sums of two elements of l
sumL = []
howlong = len(l)
for i in range(howlong):
for j in range(i+1):
if not l[i]+l[j] in sumL:
sumL.append(l[i]+l[j])
return sumL
This works fine for short enough lists, but when handed a longer list (say, 5000 elements between 0 and 20000) goes incredibly slowly (20+ minutes).
Question: what is making this slow? My guess is that asking whether the sum is already a member of the list is taking some time, but I am a relative newcomer to both Python and programming, so I am not sure. I am also looking for suggestions on how to perform the task of producing S + S in a quick fashion.

Python has a built-in type set that has very fast lookups. You can't store duplicates or unhashable objects in a set, but since you want a set of integers, it's perfect for your needs. In the below, I also use itertools.product to generate the pairs.
from itertools import product
def sums(l):
return {x+y for x, y in product(l, repeat=2)}
print(sums([1, 2, 3, 4]))
# {2, 3, 4, 5, 6, 7, 8}
As to why your existing solution is so slow, you might want to look up the term "algorithmic complexity". Basically, it's a way of categorizing algorithms into general groups based on how well they scale to many inputs. Your algorithm is a O(n^3) algorithm (it will do about n^3 comparisons). In comparison, the set solution is O(n^2). It accomplished this by discarding the need to check if a particular sum is already in the set.

Related

How to speed up obstacle checking algorithm?

I've been given the following problem: "You want to build an algorithm that allows you to build blocks along a number line, and also to check if a given range is block-free. Specifically, we must allow two types of operations:
[1, x] builds a block at position x.
[2, x, size] checks whether it's possible to build a block of size size that begins at position x (inclusive). Returns 1 if possible else 0.
Given a stream of operations of types 1 and 2, return a list of outputs for each type 2 operations."
I tried to create a set of blocks so we can lookup in O(1) time, that way for a given operation of type 2, I loop in range(x, x + size) and see if any of those points are in the set. This algorithm runs too slowly and I'm looking for alternative approaches that are faster. I also tried searching the entire set of blocks if the size specified in the type 2 call is greater than len(blocks), but this also times out. Can anyone think of a faster way to do this? I've been stuck on this for a while.
Store the blocks in a red-black tree (or any self-balancing tree), and when you're given a query, find the smallest element in the tree greater than or equal to x and return 1 if it's greater than x+size. This is O(n + mlogn) where n is the number of blocks, and m is the number of queries.
If you use a simple binary search tree (rather than a self-balancing one), a large test case with blocks at (1, 2, 3, ..., n) will cause your search tree to be very deep and queries will run in linear (rather than logarithmic) time.

Can this (find + sort) problem be solved in O(n)?

I went thru this problem on geeksforgeeks.com and while my solution managed to pass all test cases, I actually used .sort() so I know it doesn't fit the Expected Time Complexity of O(n): I mean we all know no sorting algorithm works on O(n), not even the best implementation of Timsort (which is what Python uses). So I went to check the website's Answer/Solution and found this:
def printRepeating(arr, n):
# First check all the
# values that are
# present in an array
# then go to that
# values as indexes
# and increment by
# the size of array
for i in range(0, n):
index = arr[i] % n
arr[index] += n
# Now check which value
# exists more
# than once by dividing
# with the size
# of array
for i in range(0, n):
if (arr[i]/n) >= 2:
print(i, end=" ")
I tried to follow the logic behind that algorithm but honestly couldn't, so I tested different datasets until I found that it failed for some. For instance:
arr = [5, 6, 3, 1, 3, 6, 6, 0, 0, 11, 11, 1, 1, 50, 50]
Output: 0 1 3 5 6 11 13 14
Notice that:
Number 5 IS NOT repeated in the array,
Numbers 13 and 14 are not even present in the array, and
Number 50 is both, present and repeated, and the solution won't show it.
I already reported the problem to the website, I just wanted to know if, since these problems are supposed to be curated, there is a solution in O(n). My best guess is there isn't unless you can somehow insert every repeated number in O(1) within the mapping of all keys/values.
The reason the code doesn't work with your example data set is that you're violating one of the constraints that is given in the problem. The input array (of length n) is supposed to only contain values from 0 to n-1. Your values of 50 are too big (since you have 15 elements in your list). That constraint is why adding n to the existing values doesn't break things. You have a less-than-n original value (that can be extracted with arr[i] % n), and the count (that can be extracted with arr[i] // n). The two values are stacked on top of each other, cleverly reusing the existing array with no extra space needed.
The problem can be solved with dict().
And for Python here: https://docs.python.org/3.10/library/stdtypes.html#mapping-types-dict
It's an abstract data type that accesses in amortized O(1), which as you've mentioned, is exactly what you need.
Python stdlib also has collections.Counter, which is a specialization of dict that accomplishes 90% of what the problem asks for.
edit
Oh, the results have to be sorted too. Looks like they want you to use a list() "as a dict", mapping integers to their number of occurrences via their own value as an index.

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

Iterating through a generator of itertools.combinations object takes forever

Edit::
after all these discussions with juanpa & fusion here in the comments and Kevin on python chat , i have come to a conclusion that iterating through a generator takes the same time as it would take iterating through any other object because generator itself generates those combinations on the fly. Moreover the approach by fusion worked great for len(arr) up to 1000(maybe up to 5k - but it terminates due to time out, of course on an online judge - Please Note it is not because of trying to get the min_variance_sub, but I also have to get the sum of absolute differences of all the pairs possible in the min_variance_sub). I am going to accept fusion's approach as an answer for this question, because it answered the question.
But I will also create a new question for that problem statement (more like a QnA, where I will also answer the question for future visitors - i got the answer from submissions by other candidates, an editorial by problem setter, and a code by problem setter himself - though I do not understand the approach they used). I will link to the other question as I create it :)
It's HERE
The original question starts below
I'm using itertools.combinations on an array so first up I tried something like
aList = [list(x) for x in list(cmb(arr, k))]
where cmb = itertools.combinations, arr is the list, and k is an int.
This works totally good for len(arr) < 20 or so but this Raised a MemoryError when len(arr) became 50 or more.
On a suggestion by kevin on Python Chat, I used a generator, and it worked amazingly fast in generating those combinations like this
aGen = (list(x) for x in cmb(arr, k))
But It's so slow to iterate through this generator object.
I tried something like
for p in aGen:
continue
and even this code seems to take forever.
Kevin also suggested an answer talking about kth combination which was nice but in my case I actually want to test all the possible combinations and select the one with minimum variance.
So what would be the memory efficient way of checking all the possible combinations of an array (a list) to have minimum variance (to be precise, I only need to consider sub arrays having exactly k number of elements)
Thank You For Any Help.
You can sort the list with n elements first,
Then use a moving window of k length along the sorted list.
And find the minimum variance of the n-k+1 possible combinations.
The minimum should be the minimum of all combinations.
def myvar(arr):
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
input_list = [.......]
sorted_list = sorted(input_list)
variance = None
min_variance_sub = None
for i in range(len(sorted_list) - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
print(min_variance_sub)

Efficient way to get family of subsets by its number in Python

I have an n-element set, and I want to consider families of its k-element subsets of fixed size s.
For example, if n = 3, k = 1, s = 2, we have these families:
{{1}, {2}}, {{1}, {3}}, {{2}, {3}}
In my problem n, k, s are not so small, e.g. s = n = 50, k = 20.
Let us say all such families are ordered lexicographically (or really in any clearly stated order). I want to have an efficient way to get a family by its number.
I thought of using itertools, but I am afraid it won't work with such big numbers. Possibly I need to implement something myself, but I have no clear understanding of how to do it. I have only a following idea: enumerate all k-element subsets of n-element set (there is an efficient algorithm to get i-th element by i). Then enumerate all s-element subsets of comb(n, k)-element set, using the same operation. Now we need to generate a number in range (0, comb(comb(n, k), s)) and turn it firstly to the number of s-element subset and then to a family of k-elements sets.
However, such approach looks a bit complicated. Maybe there is an easier one?

Categories

Resources