Speeding up this Python code for large input

Speeding up this Python code for large input - python

I wrote this Python code to do a particular computation in a bigger project and it works fine for smaller values of N but it doesn't scale up very well for large values and even though I ran it for a number of hours to collect the data, I was wondering if there was a way to speed this up
import numpy as np
def FillArray(arr):
while(0 in arr):
ind1 = np.random.randint(0,N)
if(arr[ind1]==0):
if(ind1==0):
arr[ind1] = 1
arr[ind1+1] = 2
elif(ind1==len(arr)-1):
arr[ind1] = 1
arr[ind1-1] = 2
else:
arr[ind1] = 1
arr[ind1+1] = 2
arr[ind1-1] = 2
else:
continue
return arr
N=50000
dist = []
for i in range(1000):
arr = [0 for x in range(N)]
dist.append(Fillarr(arr).count(2))
For N = 50,000, it currently takes slightly over a minute on my computer for one iteration to fill the array. So if I want to simulate this, lets say, a 1000 times, it takes many hours. Is there something I can do to speed this up?
Edit 1: I forgot to mention what it actually does. I have a list of length N and I initialize it by having zeros in each entry. Then I pick a random number between 0 and N and if that index of the list has a zero, I replace it by 1 and its neighboring indices by 2 to indicate they are not filled by 1 but they can't be filled again. I keep doing this till I populate the whole list by 1 and 2 and then I count how many of the entries contain 2 which is the result of this computation. Thus I want to find out if I fill an array randomly with this constraint, how many entries will not be filled.
Obviously I do not claim that this is the most efficient way find this number so I am hoping that perhaps there is a better alternative way if this code can't be speeded up.

As #SylvainLeroux noted in the comments, the approach of trying to find what zero you're going to change by drawing a random location and hoping it's zero is going to slow down when you start running out of zeros. Simply choosing from the ones you know are going to be zero will speed it up dramatically. Something like
def faster(N):
# pad on each side
arr = np.zeros(N+2)
arr[0] = arr[-1] = -1 # ignore edges
while True:
# zeros left
zero_locations = np.where(arr == 0)[0]
if not len(zero_locations):
break # we're done
np.random.shuffle(zero_locations)
for zloc in zero_locations:
if arr[zloc] == 0:
arr[zloc-1:zloc+2] = [2, 1, 2]
return arr[1:-1] # remove edges
will be much faster (times on my old notebook):
>>> %timeit faster(50000)
10 loops, best of 3: 105 ms per loop
>>> %time [(faster(50000) == 2).sum() for i in range(1000)]
CPU times: user 1min 46s, sys: 4 ms, total: 1min 46s
Wall time: 1min 46s
We could improve this by vectorizing more of the computation, but depending on your constraints this might already suffice.

First I will reformulate the problem from tri-variate to bi-variate. What you are doing is spliting the vector of length N into two smaller vectors at random point k.
Lets assume that you start with a vector of zeros, then you put '1' at randomly selected k and from there take two smaller vectors of zeros - [0..k-2] & [k+2.. N-1]. No need for 3rd state. You repeat the process until exhaustion - when you are left with vectors containing only one element.
Using recusion this is reasonably fast even on my iPad mini with Pythonista.
import numpy as np
from random import randint
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr[k] = 1
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
N = 50000
L = 1000
dist=np.zeros(L)
for i in xrange(L):
arr = [0 for x in xrange(N)]
SplitArray(0, N-1)
dist[i] = arr.count(0)
print dist, np.mean(dist), np.std(dist)
However if you would like to make it really fast then bivariate problem could be coded very effectively and naturally as bit arrays instead of storing 1 and 0 in arrays of integers or worse floats in numpy arrays. The bit manipulation should be quick and in some you easily could get close to machine level speed.
Something along the line: (this is an idea not optimal code)
from bitarray import BitArray
from random import randint
import numpy as np
def SplitArray(l, r):
while(l < r):
k = randint(l, r)
arr.set_bit(k)
return SplitArray(l, k-2) + [k] + SplitArray(k+2, r)
return []
def count0(ba):
i = 0
for n in xrange(1, N):
if ba.get_bit(n) == 0:
i += 1
return i
N = 50000
L = 1000
dist = np.zeros(L)
for i in xrange(L):
arr = BitArray(N, initialize = 0)
SplitArray(1, N)
dist[i] = count0(arr)
print np.mean(dist), np.std(dist)
using bitarray
The solution converges very nicely so perhaps half an hour spent looking for an analytical solution would make this whole MC excercise unnecessary?

Related

Vectorize itertools combination_with_replacement

My goal is to speed up the creation of a list of combinations by using my GPU. How can I accomplish this?
By way of example, the following code creates a list of 260 text strings ranging from "aa" through "jz". We then use itertools combinations_with_replacement() to create all possible combinations of R elements of this list. The use of timeit shows that, beyond 3 elements, extracting a list of these combinations slows exponentially. I suspect this can be done with numba cuda, but I don't know how.
import timeit
timeit.timeit('''
from itertools import combinations_with_replacement
combo_count = 2
alphabet = 'a'
alpha_list = []
item_list = []
for i in range(0,26):
alpha_list.append(alphabet)
alphabet = chr(ord(alphabet) + 1)
for first_letter in alpha_list[0:10]:
for second_letter in alpha_list:
item_list.append(first_letter+second_letter)
print("Length of item list:",len(item_list))
combos = combinations_with_replacement(item_list,combo_count)
cmb_lst = [bla for bla in combos]
print("Length of list of all {} combinations: {}".format(combo_count,len(cmb_lst)))
''', number=1)

As mentioned in the comments, there is no way to "vectorize" the combinations_with_replacement() call from the itertools library directly (with Numba CUDA). Numba CUDA doesn't work that way.
However, I believe it should be possible to generate an equivalent result dataset, using Numba CUDA, in a way that seems to run faster than the itertools library function for certain cases. I imagine there are probably a number of ways to accomplish this, and I make no claims that the method I describe is in any way optimal. It certainly is not, and could certainly be made to run faster. However according to my testing, even this not-very-optimized approach can run a particular test case about 10x faster than python itertools or so on a V100 GPU.
As background, I consider this and this (or equivalent material) to be essential reading.
From the above, the formula for the number of combinations of n items with k choices, with replacement, is given by:
(n-1+k)!
-------- (Formula 1)
(n-1)!k!
In the code below, I have encapsulated the above calculation in count_comb_with_repl (device) and host_count_comb_with_repl (host) functions. It turns out we can use this one basic calculation, with a cascading-smaller sequence of choices for n and k, to drive the entire calculation process to compute a combination given only an index into the final result array. To visualize what we are doing, it helps to have a simple example. Let's take the case of 3 items, and 3 choices. Indexing items from zero, the array of possibilities looks like this:
n = 3, k = 3
index choices first digit calculation
0 0,0,0 -----------------
1 0,0,1
2 0,0,2
3 0,1,1 equivalent to n = 3, k = 2
4 0,1,2
5 0,2,2 -----------------
6 1,1,1 -----------------
7 1,1,2 equivalent to n = 2, k = 2
8 1,2,2 -----------------
9 2,2,2 equivalent to n = 1, k = 2
The length of the above list is given by plugging the values of n = 3 and k = 3 into formula 1. The key observation to understanding the method I present is that to compute the first digit of the choices result given only the index, we can compute the dividing points between 0, and 1 for example by observing that considering the results where the first choice index is 0, the length of this range is given by plugging the values of n = 3 and k = 2 into formula 1. Therefore if our given index is less than this value (6) then we know the first digit is 0. If it is greater than this value then we know the first digit is 1 or 2, and with suitable offsetting we can recompute the next range (corresponding to first digit of 1) and see if our index falls within this range.
Once we know the first digit, we can repeat the process (with suitable list reduction and offsetting) to find the next digit, and the next digit, etc.
Here is a python code that implements the above method. As I mentioned, for a test case of n=260 and k=4 this runs in less than 3 seconds on my V100.
$ cat t2.py
from numba import cuda,jit
import numpy as np
#cuda.jit(device=True)
def get_next_count_comb_with_repl(n,k,prev):
return int(round((prev*(n))/(n+k)))
#cuda.jit(device=True)
def count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
#intended to be identical to the previous function
#I just need a version I can call from host code
def host_count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
#cuda.jit(device=True)
def find_first_digit(n,k,i):
psum = 0
count = count_comb_with_repl(n, k-1)
if (i-psum) < count:
return 0,psum
psum += count
for j in range(1,n):
count = get_next_count_comb_with_repl(n-j,k-1,count)
if (i-psum) < count:
return j,psum
psum += count
return -1,0 # error
#cuda.jit
def kernel_count_comb_with_repl(n, k, l, r):
for i in range(cuda.grid(1), l, cuda.gridsize(1)):
new_ll = n
new_cc = k
new_i = i
new_digit = 0
for j in range(k):
digit,psum = find_first_digit(new_ll, new_cc, new_i)
new_digit += digit
new_ll -= digit
new_cc -= 1
new_i -= psum
r[i+j*l] = new_digit
combo_count = 4
ll = 260
cl = host_count_comb_with_repl(ll, combo_count)
print(cl)
# bug if cl > 2G
if cl < 2**31:
my_dtype = np.uint8
if ll > 255:
my_dtype = np.uint16
r = np.empty(cl*combo_count, dtype=my_dtype)
d_r = cuda.device_array_like(r)
block = 256
grid = (cl//block)+1
#grid = 640
kernel_count_comb_with_repl[grid,block](ll, combo_count, cl, d_r)
r = d_r.copy_to_host()
print(r.reshape(combo_count,cl))
$ time python t2.py
194831715
[[ 0 0 0 ... 258 258 259]
[ 0 0 0 ... 258 259 259]
[ 0 0 0 ... 259 259 259]
[ 0 1 2 ... 259 259 259]]
real 0m2.212s
user 0m1.110s
sys 0m1.077s
$
(The above test case: n=260, k = 4, takes ~30s on my system using OP's code.)
This should be considered to be a sketch of an idea. I make no claims that it is defect free. This type of problem can quickly exhaust the memory on a GPU (for large enough choices of n and/or k), and your only indication of that would probably be a crude out of memory error from numba.
Yes, the above code does not produce concatenations of aa through jz but this is just an indexing exercise using the result. You would use the result indices to index into your array of items, as needed to convert a result like 0,0,0,1 to a result like aa,aa,aa,ab
This isn't a performance win across the board. They python method is still faster for smaller test cases, and larger test cases (e.g. n = 260, k = 5) will exceed available memory on the GPU.

Fastest way to count duplicate integers in two distinct sections of an array

In this snippet of Python code,
fun iterates through the array arr and counts the number of identical integers in two array sections for every section pair. (It simulates a matrix.) This makes n*(n-1)/2*m comparisons in total, giving a time complexity of O(n^2).
Are there programming solutions or ways of reframing this problem that would yield equivalent results but have reduced time complexity?
# n > 500000, 0 < i < n, m = 100
# dim(arr) = n*m, 0 < arr[x] < 4294967311
arr = mp.RawArray(ctypes.c_uint, n*m)
def fun(i):
for j in range(i-1,0,-1):
count = 0
for k in range(0,m):
count += (arr[i*m+k] == arr[j*m+k])
if count/m > 0.7:
return (i,j)
return ()
arr is a shared memory array, therefore it's best kept read-only for simplicity and performance reasons.
arr is implemented as a 1D RawArray from multiprocessing. The reason for this it has by far the fastest performance according to my tests. Using a numpy 2D array, for example, like this:
arr = np.ctypeslib.as_array(mp.RawArray(ctypes.c_uint, n*m)).reshape(n,m)
would provide vectorization capabilities, but increases the total runtime by an order of magnitude - 250s vs. 30s for n = 1500, which amounts to 733%.

Since you can't change the array characteristics at all, I think you're stuck with O(n^2). numpy would gain some vectorization, but would change the access for others sharing the array. Start with the innermost operation:
for k in range(0,m):
count += (arr[i][k] == arr[j][k])
Change this to a one-line assignment:
count = sum(arr[i][k] == arr[j][k] for k in range(m))
Now, if this is truly an array, rather than a list of lists, use the array package's vectorization to simplify the loops, one at a time:
count = sum(arr[i] == arr[j]) # results in a vector of counts
You can now return the j indices where count[j] / m > 0.7. Note that there's no real need to return i for each one: it's constant within the function, and the calling program already has the value. Your array package likely has a pair of vectorized indexing operations that can return those indices. If you're using numpy, those are easy enough to look up on this site.

So after fiddling around some more, I was able to cut down the running time greatly with help from NumPy's vectorization and Numba's JIT compiler. Going back to the original code:
arr = mp.RawArray(ctypes.c_uint, n*m)
def fun(i):
for j in range(i-1,0,-1):
count = 0
for k in range(0,m):
count += (arr[i*m+k] == arr[j*m+k])
if count/m > 0.7:
return (i,j)
return ()
We can leave out the bottom return statement as well as dismiss the idea of using count entirely, leaving us with:
def fun(i):
for j in range(i-1,0,-1):
if sum(arr[i*m+k] == arr[j*m+k] for k in range(m)) > 0.7*m:
return (i,j)
Then, we change the array arr to a NumPy format:
np_arr = np.frombuffer(arr,dtype='int32').reshape(m,n)
The important thing to note here is that we do not use a NumPy array as a shared memory array to be written from multiple processes, avoiding the overhead pitfall.
Finally, we apply Numba's decorator and rewrite the sum function in vector form so that it works with the new array:
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def fun(i):
for j in range(i-1, 0, -1):
if np.sum(np_arr[i] == np_arr[j]) > 0.7*m:
return (i,j)
This reduced the running time to 7.9s, which is definitely a victory for me.

Constraining random number generation in Python

I am trying to create a loop in Python with numpy that will give me a variable "times" with 5 numbers generated randomly between 0 and 20. However, I want there to be one condition: that none of the differences between two adjacent elements in that list are less than 1. What is the best way to achieve this? I tried with the last two lines of code, but this is most likely wrong.
for j in range(1,6):
times = np.random.rand(1, 5) * 20
times.sort()
print times
da = np.diff(times)
if da.sum < 1: break
For instance, for one iteration, this would not be good:
4.25230915 4.36463992 10.35915732 12.39446368 18.46893283
But something like this would be perfect:
1.47166904 6.85610453 10.81431629 12.10176092 15.53569052

Since you are using numpy, you might as well use the built-in functions for uniform random numbers.
def uniform_min_range(a, b, n, min_dist):
while True:
x = np.random.uniform(a, b, size=n)
np.sort(x)
if np.all(np.diff(x) >= min_dist):
return x
It uses the same trial-and-error approach as the previous answer, so depending on the parameters the time to find a solution can be large.

Use a hit and miss approach to guarantee uniform distribution. Here is a straight-Python implementation which should be tweakable for numpy:
import random
def randSpacedPoints(n,a,b,minDist):
#draws n random numbers in [a,b]
# with property that their distance apart is >= minDist
#uses a hit-miss approach
while True:
nums = [a + (b-a)*random.random() for i in range(n)]
nums.sort()
if all(nums[i] + minDist < nums[i+1] for i in range(n-1)):
return nums
For example,
>>> randSpacedPoints(5,0,20,1)
[0.6681336968970486, 6.882374558960349, 9.73325447748434, 11.774594560239493, 16.009157676493903]
If there is no feasible solution this will hang in an infinite loop (so you might want to add a safety parameter which controls the number of trials).

Speed up loop to fill an array with closest values from another array

I have a block of code that I need to optimize as much as possible since I have to run it several thousand times.
What it does is it finds the closest float in a sub-list of a given array for a random float and stores the corresponding float (ie: with the same index) stored in another sub-list of that array. It repeats the process until the sum of floats stored reaches a certain limit.
Here's the MWE to make it clearer:
import numpy as np
# Define array with two sub-lists.
a = [np.random.uniform(0., 100., 10000), np.random.random(10000)]
# Initialize empty final list.
b = []
# Run until the condition is met.
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a[1]))
# Store value located in sub-list a[0].
b.append(a[0][idx])
The code is reasonably simple but I haven't found a way to speed it up. I tried to adapt the great (and very fast) answer given in a similar question I made some time ago, to no avail.

OK, here's a slightly left-field suggestion. As I understand it, you are just trying to sample uniformally from the elements in a[0] until you have a list whose sum exceeds some limit.
Although it will be more costly memory-wise, I think you'll probably find it's much faster to generate a large random sample from a[0] first, then take the cumsum and find where it first exceeds your limit.
For example:
import numpy as np
# array of reference float values, equivalent to a[0]
refs = np.random.uniform(0, 100, 10000)
def fast_samp_1(refs, lim=10000, blocksize=10000):
# sample uniformally from refs
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# find where the cumsum first exceeds your limit
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
# # if it's ok to be just under lim rather than just over then this might
# # be quicker
# return samp[samp_sum <= lim]
Of course, if the sum of the sample of blocksize elements is < lim then this will fail to give you a sample whose sum is >= lim. You could check whether this is the case, and append to your sample in a loop if necessary.
def fast_samp_2(refs, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Note that concatenating arrays is pretty slow, so it would probably be better to make blocksize large enough to be reasonably sure that the sum of a single block will be >= to your limit, without being excessively large.
Update
I've adapted your original function a little bit so that its syntax more closely resembles mine.
def orig_samp(refs, lim=10000):
# Initialize empty final list.
b = []
a1 = np.random.random(10000)
# Run until the condition is met.
while (sum(b) < lim):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a1))
# Store value located in sub-list a[0].
b.append(refs[idx])
return b
Here's some benchmarking data.
%timeit orig_samp(refs, lim=10000)
# 100 loops, best of 3: 11 ms per loop
%timeit fast_samp_2(refs, lim=10000, blocksize=1000)
# 10000 loops, best of 3: 62.9 µs per loop
That's a good 3 orders of magnitude faster. You can do a bit better by reducing the blocksize a fraction - you basically want it to be comfortably larger than the length of the arrays you're getting out. In this case, you know that on average the output will be about 200 elements long, since the mean of all real numbers between 0 and 100 is 50, and 10000 / 50 = 200.
Update 2
It's easy to get a weighted sample rather than a uniform sample - you can just pass the p= parameter to np.random.choice:
def weighted_fast_samp(refs, weights=None, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True, p=weights)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True,
p=weights)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]

Write it in cython. That's going to get you a lot more for a high iteration operation.
http://cython.org/

One obvious optimization - don't re-calculate sum on each iteration, accumulate it
b_sum = 0
while b_sum<10000:
....
idx = np.argmin(np.abs(u - a[1]))
add_val = a[0][idx]
b.append(add_val)
b_sum += add_val
EDIT:
I think some minor improvement (check it out if you feel like it) may be achieved by pre-referencing sublists before the loop
a_0 = a[0]
a_1 = a[1]
...
while ...:
....
idx = np.argmin(np.abs(u - a_1))
b.append(a_0[idx])
It may save some on run time - though I don't believe it will matter that much.

Sort your reference array.
That allows log(n) lookups instead of needing to browse the whole list. (using bisect for example to find the closest elements)
For starters, I reverse a[0] and a[1] to simplify the sort:
a = np.sort([np.random.random(10000), np.random.uniform(0., 100., 10000)])
Now, a is sorted by order of a[0], meaning if you are looking for the closest value to an arbitrary number, you can start by a bisect:
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[0].
idx = bisect.bisect(a[0], u)
# now, idx can either be idx or idx-1
if idx is not 0 and np.abs(a[0][idx] - u) > np.abs(a[0][idx - 1] - u):
idx = idx - 1
# Store value located in sub-list a[1].
b.append(a[1][idx])

Subset sum Problem

recently I became interested in the subset-sum problem which is finding a zero-sum subset in a superset. I found some solutions on SO, in addition, I came across a particular solution which uses the dynamic programming approach. I translated his solution in python based on his qualitative descriptions. I'm trying to optimize this for larger lists which eats up a lot of my memory. Can someone recommend optimizations or other techniques to solve this particular problem? Here's my attempt in python:
import random
from time import time
from itertools import product
time0 = time()
# create a zero matrix of size a (row), b(col)
def create_zero_matrix(a,b):
return [[0]*b for x in xrange(a)]
# generate a list of size num with random integers with an upper and lower bound
def random_ints(num, lower=-1000, upper=1000):
return [random.randrange(lower,upper+1) for i in range(num)]
# split a list up into N and P where N be the sum of the negative values and P the sum of the positive values.
# 0 does not count because of additive identity
def split_sum(A):
N_list = []
P_list = []
for x in A:
if x < 0:
N_list.append(x)
elif x > 0:
P_list.append(x)
return [sum(N_list), sum(P_list)]
# since the column indexes are in the range from 0 to P - N
# we would like to retrieve them based on the index in the range N to P
# n := row, m := col
def get_element(table, n, m, N):
if n < 0:
return 0
try:
return table[n][m - N]
except:
return 0
# same definition as above
def set_element(table, n, m, N, value):
table[n][m - N] = value
# input array
#A = [1, -3, 2, 4]
A = random_ints(200)
[N, P] = split_sum(A)
# create a zero matrix of size m (row) by n (col)
#
# m := the number of elements in A
# n := P - N + 1 (by definition N <= s <= P)
#
# each element in the matrix will be a value of either 0 (false) or 1 (true)
m = len(A)
n = P - N + 1;
table = create_zero_matrix(m, n)
# set first element in index (0, A[0]) to be true
# Definition: Q(1,s) := (x1 == s). Note that index starts at 0 instead of 1.
set_element(table, 0, A[0], N, 1)
# iterate through each table element
#for i in xrange(1, m): #row
# for s in xrange(N, P + 1): #col
for i, s in product(xrange(1, m), xrange(N, P + 1)):
if get_element(table, i - 1, s, N) or A[i] == s or get_element(table, i - 1, s - A[i], N):
#set_element(table, i, s, N, 1)
table[i][s - N] = 1
# find zero-sum subset solution
s = 0
solution = []
for i in reversed(xrange(0, m)):
if get_element(table, i - 1, s, N) == 0 and get_element(table, i, s, N) == 1:
s = s - A[i]
solution.append(A[i])
print "Solution: ",solution
time1 = time()
print "Time execution: ", time1 - time0

I'm not quite sure if your solution is exact or a PTA (poly-time approximation).
But, as someone pointed out, this problem is indeed NP-Complete.
Meaning, every known (exact) algorithm has an exponential time behavior on the size of the input.
Meaning, if you can process 1 operation in .01 nanosecond then, for a list of 59 elements it'll take:
2^59 ops --> 2^59 seconds --> 2^26 years --> 1 year
-------------- ---------------
10.000.000.000 3600 x 24 x 365
You can find heuristics, which give you just a CHANCE of finding an exact solution in polynomial time.
On the other side, if you restrict the problem (to another) using bounds for the values of the numbers in the set, then the problem complexity reduces to polynomial time. But even then the memory space consumed will be a polynomial of VERY High Order.
The memory consumed will be much larger than the few gigabytes you have in memory.
And even much larger than the few tera-bytes on your hard drive.
( That's for small values of the bound for the value of the elements in the set )
May be this is the case of your Dynamic programing algorithm.
It seemed to me that you were using a bound of 1000 when building your initialization matrix.
You can try a smaller bound. That is... if your input is consistently consist of small values.
Good Luck!

Someone on Hacker News came up with the following solution to the problem, which I quite liked. It just happens to be in python :):
def subset_summing_to_zero (activities):
subsets = {0: []}
for (activity, cost) in activities.iteritems():
old_subsets = subsets
subsets = {}
for (prev_sum, subset) in old_subsets.iteritems():
subsets[prev_sum] = subset
new_sum = prev_sum + cost
new_subset = subset + [activity]
if 0 == new_sum:
new_subset.sort()
return new_subset
else:
subsets[new_sum] = new_subset
return []
I spent a few minutes with it and it worked very well.

An interesting article on optimizing python code is available here. Basically the main result is that you should inline your frequent loops, so in your case this would mean instead of calling get_element twice per loop, put the actual code of that function inside the loop in order to avoid the function call overhead.
Hope that helps! Cheers

, 1st eye catch
def split_sum(A):
N_list = 0
P_list = 0
for x in A:
if x < 0:
N_list+=x
elif x > 0:
P_list+=x
return [N_list, P_list]
Some advices:
Try to use 1D list and use bitarray to reduce memory footprint at minimum (http://pypi.python.org/pypi/bitarray) so you will just change get / set functon. This should reduce your memory footprint by at lest 64 (integer in list is pointer to integer whit type so it can be factor 3*32)
Avoid using try - catch, but figure out proper ranges at beginning, you might found out that you will gain huge speed.

The following code works for Python 3.3+ , I have used the itertools module in Python that has some great methods to use.
from itertools import chain, combinations
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
nums = input("Enter the Elements").strip().split()
inputSum = int(input("Enter the Sum You want"))
for i, combo in enumerate(powerset(nums), 1):
sum = 0
for num in combo:
sum += int(num)
if sum == inputSum:
print(combo)
The Input Output is as Follows:
Enter the Elements 1 2 3 4
Enter the Sum You want 5
('1', '4')
('2', '3')

Just change the values in your set w and correspondingly make an array x as big as the len of w then pass the last value in the subsetsum function as the sum for which u want subsets and you wl bw done (if u want to check by giving your own values).
def subsetsum(cs,k,r,x,w,d):
x[k]=1
if(cs+w[k]==d):
for i in range(0,k+1):
if x[i]==1:
print (w[i],end=" ")
print()
elif cs+w[k]+w[k+1]<=d :
subsetsum(cs+w[k],k+1,r-w[k],x,w,d)
if((cs +r-w[k]>=d) and (cs+w[k]<=d)) :
x[k]=0
subsetsum(cs,k+1,r-w[k],x,w,d)
#driver for the above code
w=[2,3,4,5,0]
x=[0,0,0,0,0]
subsetsum(0,0,sum(w),x,w,7)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up this Python code for large input - python

Related

Vectorize itertools combination_with_replacement

Fastest way to count duplicate integers in two distinct sections of an array

Constraining random number generation in Python

Speed up loop to fill an array with closest values from another array

Subset sum Problem

Categories

Resources