Python/Numpy - Speeding up Monte Carlo method for radioactive decay

Python/Numpy - Speeding up Monte Carlo method for radioactive decay - python

I am trying to optimize the generation of decay times for a radioactive isotope Monte Carlo.
That is given nsims atoms of an isotope with a halflife of t12, when does each isotope decay?
I tried to optimize this by generating random numbers for all un-decayed atoms at once using a single numpy.random.random call (I call this method parallel), but I hope that there is still more performance to be gained. I also show a method that does this calculation for each isotope individually (serial).
import numpy as np
import time
import matplotlib.pyplot as plt
import scipy.optimize
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
decay_exp = lambda t,A,tau: A * np.exp(-t/tau)
def serial(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
for i in range(nsims):
t = dt
while decay_time[i] == 0:
if np.random.random() > np.exp(-ln2*dt/t12):
decay_time[i] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
def parallel(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
t = dt
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
I'm interested in any suggestions that performs better than the parallel function that is pure python, i.e. not cython.
This method already improves greatly upon the serial method of calculating this for large nsims.

There are still some speed gains to be had from your original "parallel" (vectorized is the correct word) execution.
Remark, this is micro-management, but it does still give a small performance increase.
import numpy as np
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
s = 98765
def parallel(nsims): # your code, unaltered, except removed inaccurate timing method
decay_time = np.zeros(nsims)
t = dt
np.random.seed(s) # also had to add a seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
return decay_time
def parallel_micro(nsims): # micromanaged code
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12) # there was no need to calculate this again in every loop iteration
np.random.seed(s) # fixed seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0] # only here you need the call to np.where
# to my own surprise, len(some_array) is quicker than some_array.size (function lookup vs attribute lookup)
idecay_check = np.random.random(len(inot_decayed)) > half_time
decay_time[inot_decayed[idecay_check]] = t # no need for another np.where and certainly not for another boolean comparison
t += dt
return decay_time
You can run timing measurements with the timeit module. Profiling will tell you that the bottleneck here is the call to np.where.
Knowing that the bottleneck is np.where, you could get rid of it like this:
def parallel_micro2(nsims):
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12)
np.random.seed(s)
indices = np.where(decay_time==0)[0]
u = len(indices)
while u:
decayed = np.random.random(u) > half_time
decay_time[indices[decayed]] = t
indices = indices[np.logical_not(decayed)]
u = len(indices)
t += dt
return decay_time
And that does give a rather large speed increase:
In [2]: %timeit -n1 -r1 parallel_micro2(1e4)
1 loops, best of 1: 7.81 s per loop
In [3]: %timeit -n1 -r1 parallel_micro(1e4)
1 loops, best of 1: 29 s per loop
In [4]: %timeit -n1 -r1 parallel(1e4)
1 loops, best of 1: 33.5 s per loop
Don't forget to get rid of the call to np.random.seed when you're done optimizing.

Related

Converting `for` loop that can't be vectorized to sparse matrix

There are 2 boxes and a small gap that allows 1 particle per second from one box to enter the other box. Whether a particle will go from A to B, or B to A depends on the ratio Pa/Ptot (Pa: number of particles in box A, Ptot: total particles in both boxes).
To make it faster, I need to get rid of the for loops, however I can't find a way to either vectorize them or turn them into a sparse matrix that represents my for loop:
What about for loops you can't vectorize? The ones where the result at iteration n depends on what you calculated in iteration n-1, n-2, etc. You can define a sparse matrix that represents your for loop and then do a sparse matrix solve.
But I can't figure out how to define a sparse matrix out of this. The simulation boils down to calculating:
where
is the piece that gives me trouble when trying to express my problem as described here. (Note: the contents in the parenthesis are a bool operation)
Questions:
Can I vectorize the for loop?
If not, how can I define a sparse matrix?
(bonus question) Why is the execution time x27 faster in Python (0.027s) than Octave (0.75s)?
Note: I implemented the simulation in both Python and Octave and will soon do it on Matlab, therefor the tags are correct.
Octave code
1; % starting with `function` causes errors
function arr = Px_simulation (Pa_init, Ptot, t_arr)
t_size = size(t_arr);
arr = zeros(t_size); % fixed size array is better than arr = []
rand_arr = rand(t_size); % create all rand values at once
_Pa = Pa_init;
for _j=t_arr()
if (rand_arr(_j) * Ptot > _Pa)
_Pa += 1;
else
_Pa -= 1;
endif
arr(_j) = _Pa;
endfor
endfunction
t = 1:10^5;
for _i=1:3
Ptot = 100*10^_i;
tic()
Pa_simulation = Px_simulation(Ptot, Ptot, t);
toc()
subplot(2,2,_i);
plot(t, Pa_simulation, "-2;simulation;")
title(strcat("{P}_{a0}=", num2str(Ptot), ',P=', num2str(Ptot)))
endfor
Python
import numpy
import matplotlib.pyplot as plt
import timeit
import cpuinfo
from random import random
print('\nCPU: {}'.format(cpuinfo.get_cpu_info()['brand']))
PARTICLES_COUNT_LST = [1000, 10000, 100000]
DURATION = 10**5
t_vals = numpy.linspace(0, DURATION, DURATION)
def simulation(na_initial, ntotal, tvals):
shape = numpy.shape(tvals)
arr = numpy.zeros(shape)
na_current = na_initial
for i in range(len(tvals)):
if random() > (na_current/ntotal):
na_current += 1
else:
na_current -= 1
arr[i] = na_current
return arr
plot_lst = []
for i in PARTICLES_COUNT_LST:
start_t = timeit.default_timer()
n_a_simulation = simulation(na_initial=i, ntotal=i, tvals=t_vals)
execution_time = (timeit.default_timer() - start_t)
print('Execution time: {:.6}'.format(execution_time))
plot_lst.append(n_a_simulation)
for i in range(len(PARTICLES_COUNT_LST)):
plt.subplot('22{}'.format(i))
plt.plot(t_vals, plot_lst[i], 'r')
plt.grid(linestyle='dotted')
plt.xlabel("time [s]")
plt.ylabel("Particles in box A")
plt.show()

IIUC you can use cumsum() in both Octave and Numpy:
Octave:
>> p = rand(1, 5);
>> r = rand(1, 5);
>> p
p =
0.43804 0.37906 0.18445 0.88555 0.58913
>> r
r =
0.70735 0.41619 0.37457 0.72841 0.27605
>> cumsum (2*(p<(r+0.03)) - 1)
ans =
1 2 3 2 1
>> (2*(p<(r+0.03)) - 1)
ans =
1 1 1 -1 -1
Also note that the following function will return values ([-1, 1]):

performance loss after vectorization in numpy

I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.
However, I found vectorized program to have much worse performance than the for loop version:
import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583
What is happening here?

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:
def setup(n1=10000, n2=1000, n3=20, seed=None):
gen = np.random.RandomState(seed)
kpt_list = gen.randn(n1, n3).astype(np.float)
rpt_list = gen.randn(n2, n3).astype(np.float)
h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
r_ndegen = gen.randn(1000).astype(np.float)
return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
kpt_data = h_r.dot(phase)
return kpt_data
def original_loop(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
kpt_data = np.zeros((20, 20, 10000), dtype='complex')
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
kpt_data[:, :, i] = h_r.dot(phase)
return kpt_data
I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.
Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:
In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 def original_vec(*args, **kwargs):
13
14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
15
16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
19
20 1 4 4.0 0.0 return kpt_data
You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).
The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:
In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop
I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.
Here's a vectorized version that incorporates the reshaping operation:
def new_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
return kpt_data.reshape(h_r.shape[:2] + (-1,))
The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.
By using the same random input data, we can check that the new version gives the same result as the original:
In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0)
In [6]: np.allclose(ans1, ans2)
Out[6]: True
Some performance benchmarks:
In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop
Update:
I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...
In the meantime, it's a nice performance trick to be aware of when multiplying tensors.
Update 2:
I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:
In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

I suspect the first operation is hitting the the resource limit. May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays, and Dot product of huge arrays in numpy.

Optimize python for Connected Component Labeling Area of Subsets

I have a binary map on which I do Connected Component Labeling and get something like this for a 64x64 grid - http://pastebin.com/bauas0NJ
Now I want to group them by label, so that I can find their area and their center of mass. This is what I do:
#ccl_np is the computed array from the previous step (see pastebin)
#I discard the label '1' as its the background
unique, count = np.unique(ccl_np, return_counts = True)
xcm_array = []
ycm_array = []
for i in range(1,len(unique)):
subarray = np.where(ccl_np == unique[i])
xcm_array.append("{0:.5f}".format((sum(subarray[0]))/(count[i]*1.)))
ycm_array.append("{0:.5f}".format((sum(subarray[1]))/(count[i]*1.)))
final_array = zip(xcm_array,ycm_array,count[1:])
I want a fast code (as I will be doing this for grids of size 4096x4096) and was told to check out numba. Here's my naive attempt :
unique, inverse, count = np.unique(ccl_np, return_counts = True, return_inverse = True)
xcm_array = np.zeros(len(count),dtype=np.float32)
ycm_array = np.zeros(len(count),dtype=np.float32)
inverse = inverse.reshape(64,64)
#numba.autojit
def mysolver(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i][j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
mysolver(xcm_array, ycm_array, inverse, count)
final_array = zip(xcm_array,ycm_array,count)
To my surprise, using numba was slower or at best equal to the speed of the previous way. What am I doing wrong ?
Also, can this be done in Cython and will that be faster ?
I am using the included packages in the latest Anaconda python 2.7 distribution.

I believe the issue might be that you are timing jit'd code incorrectly. The first time you run the code, your timing includes the time it takes numba to compile the code. This is called warming up the jit. If you call it again, that cost is gone.
import numpy as np
import numba as nb
unique, inverse, count = np.unique(ccl_np, return_counts = True, return_inverse = True)
xcm_array = np.zeros(len(count),dtype=np.float32)
ycm_array = np.zeros(len(count),dtype=np.float32)
inverse = inverse.reshape(64,64)
def mysolver(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i][j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
#nb.jit(nopython=True)
def mysolver_nb(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i,j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
Then the timings with timeit which runs the code multiple times. First the plain python version:
In [4]:%timeit mysolver(xcm_array, ycm_array, inverse, count)
10 loops, best of 3: 25.8 ms per loop
and then with numba:
In [5]: %timeit mysolver_nb(xcm_array, ycm_array, inverse, count)
The slowest run took 3630.44 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 33.1 µs per loop
The numba code is ~1000 times faster.

Vectorizing ndimage functions for my code

I want to be able to vectorize this code:
def sobHypot(rec):
a, b, c = rec.shape
hype = np.ones((a,b,c))
for i in xrange(c):
x=ndimage.sobel(abs(rec[...,i])**2,axis=0, mode='constant')
y=ndimage.sobel(abs(rec[...,i])**2,axis=1, mode='constant')
hype[...,i] = np.hypot(x,y)
hype[...,i] = hype[...,i].mean()
index = hype.argmax()
return index
where rec,shape returns (1024,1024,20)

Here's how you can avoid the for-loop with the sobel filter:
import numpy as np
from scipy.ndimage import sobel
def sobHypot_vec(rec):
r = np.abs(rec)
x = sobel(r, 0, mode='constant')
y = sobel(r, 1, mode='constant')
h = np.hypot(x, y)
h = np.apply_over_axes(np.mean, h, [0,1])
return h.argmax()
I'm not sure if the sobel filter is particularly necessary in your application, and this is hard to test without your particular 20-layer 'image', but you could try using np.gradient instead of running the sobel twice. The advantage is that gradient runs in three dimensions. You can ignore the component in the third, and take the hypot of just the first two. This seems wasteful but is actually still faster in my tests.
For a variety of randomly generated images, r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j, this gives the same answer as your code, but test it to be sure, and possibly fiddle with the dx, dy arguments of np.gradient
def grad_max(rec):
g = np.gradient(np.abs(rec))[:2] # ignore derivative in third dimension
h = np.hypot(*g)
h = np.apply_over_axes(np.mean, h, [0,1]) # mean along first and second dimension
return h.argmax()
Using this code for timing:
def sobHypot_clean(rec):
rs = rec.shape
hype = np.ones(rs)
r = np.abs(rec)
for i in xrange(rs[-1]):
ri = r[...,i]
x = sobel(ri, 0, mode='constant')
y = sobel(ri, 1, mode='constant')
hype[...,i] = np.hypot(x,y).mean()
return hype.argmax()
Timing:
In [1]: r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j
# Original Post
In [2]: timeit sobHypot(r)
1 loops, best of 3: 9.85 s per loop
#cleaned up a bit:
In [3]: timeit sobHypot_clean(r)
1 loops, best of 3: 7.64 s per loop
# vectorized:
In [4]: timeit sobHypot_vec(r)
1 loops, best of 3: 5.98 s per loop
# using np.gradient:
In [5]: timeit grad_max(r)
1 loops, best of 3: 4.12 s per loop
Please test any of these functions on your own images to be sure they give the desired output, since different types of arrays could react differently from the simple random tests I did.

paralellize loop over iter

I am having performance issues with my code.
step # IIII consumes hours of time. I used to materialize the
the itertools.prodct before, but thanks to a user I dont do pro_data = product(array_b,array_a) anymore. This helped me with memory issues, but the still is heavily time consuming.
I would like to paralellize it with multithreading or multiprocesisng, whatever you can suggest, I am grateful.
Explanation. I have two arrays that contain x and y values of particles. For each particle (defined by two coordinates) I want to calculate a function with another. For combinations I use the itertools.product method and loop over every particle. I run over 50000 particels in total, so I have N*N/2 combinations to calculate.
Thanks in advance
import numpy as np
import matplotlib.pyplot as plt
from itertools import product,combinations_with_replacement
def func(ar1,ar2,ar3,ar4): #example func that takes four arguments
return (ar1*ar2**22+np.sin(ar3)+ar4)
def newdist(a):
return func(a[0][0],a[0][1],a[1][0],a[1][1])
x_edges = np.logspace(-3,1, num=25) #prepare x-axis for histogram
x_mean = 10**((np.log10(x_edges[:-1])+np.log10(x_edges[1:]))/2)
x_width=x_edges[1:]-x_edges[:-1]
hist_data=np.zeros([len(x_edges)-1])
array1=np.random.uniform(0.,10.,100)
array2=np.random.uniform(0.,10.,100)
array_a = np.dstack((array1,array1))[0]
array_b = np.dstack((array2,array2))[0]
# IIII
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i),bins=x_edges)
hist_data+=result
hist_data = np.array(map(float, hist_data))
plt.bar(x_mean,hist_data,width=x_width,color='r')
plt.show()
-----EDIT-----
I used this code now:
def mp_dist(array_a,array_b, d, bins): #d chunks AND processes
def worker(array_ab, out_q):
""" push result in queue """
outdict = {}
outdict = vec_chunk(array_ab, bins)
out_q.put(outdict)
out_q = mp.Queue()
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_size_a=len(array_a)-(len(array_a)%d)
array_size_b=len(array_b)-(len(array_b)%d)
a_chunk = array_size_a / d
b_chunk = array_size_b / d
procs = []
#prepare arrays for mp
array_ab = np.empty((4, a_chunk, b_chunk))
for j in xrange(d):
for k in xrange(d):
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
p = mp.Process(target=worker, args=(array_ab, out_q))
procs.append(p)
p.start()
resultarray = np.empty(len(bins)-1)
for i in range(d):
resultarray+=out_q.get()
# Wait for all worker processes to finish
for pro in procs:
pro.join()
print resultarray
return resultarray
Problem here is that I cannot control the numbers of processes. How Can I use a mp.Pool() instead?
than

First, lets look at a straightforward vectorization of your problem. I have a feeling that you want your array_a and array_b to be the exact same, i.e. the coordinates of the particles, but I am keeping them separate here.
I have turned your code into a function, to make timing easier:
def IIII(array_a, array_b, bins) :
hist_data=np.zeros([len(bins)-1])
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i), bins=bins)
hist_data+=result
hist_data = np.array(map(float, hist_data))
return hist_data
You can, by the way, generate your sample data in a less convoluted way as follows:
n = 100
array_a = np.random.uniform(0, 10, size=(n, 2))
array_b = np.random.uniform(0, 10, size=(n, 2))
So first we need to vectorize your func. I have done it so it can take any array of shape (4, ...). To spare memory, it is doing the calculation in place, and returning the first plane, i.e. array[0].
def func_vectorized(a) :
a[1] **= 22
np.sin(a[2], out=a[2])
a[0] *= a[1]
a[0] += a[2]
a[0] += a[3]
return a[0]
With this function in place, we can write a vectorized version of IIII:
def IIII_vec(array_a, array_b, bins) :
array_ab = np.empty((4, len(array_a), len(array_b)))
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_ab[[0, 1]] = a[:, :, None]
array_ab[[2, 3]] = b[:, None, :]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
return hist
With n = 100 points, they both return the same:
In [2]: h1 = IIII(array_a, array_b, x_edges)
In [3]: h2 = IIII_bis(array_a, array_b, x_edges)
In [4]: np.testing.assert_almost_equal(h1, h2)
But the timing differences are already very relevant:
In [5]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 654 ms per loop
In [6]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.08 ms per loop
A 300x speedup!. If you try it again with longer sample data, n = 1000, you can see that they both scale equally bad, as n**2, so the 300x stays there:
In [10]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 68.2 s per loop
In [11]: %timeit IIII_bis(array_a, array_b, x_edges)
1 loops, best of 3: 229 ms per loop
So you are still looking at a good 10 min. of processing, which is not really that much when compared to the more than 2 days that your current solution would require.
Of course, for things to be so nice, you will need to fit a (4, 50000, 50000) array of floats into memory, something that my system cannot handle. But you can still keep things relatively fast, by processing it in chunks. The following version of IIII_vec divides each array into d chunks. As written, the length of the array should be divisible by d. It wouldn't bee too hard to overcome that limitation, but it would obfuscate the true purpose:
def IIII_vec_bis(array_a, array_b, bins, d=1) :
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
a_chunk = len(array_a) // d
b_chunk = len(array_b) // d
array_ab = np.empty((4, a_chunk, b_chunk))
hist_data = np.zeros((len(bins) - 1,))
for j in xrange(d) :
for k in xrange(d) :
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
hist_data += hist
return hist_data
First, lets check that it really works:
In [4]: h1 = IIII_vec(array_a, array_b, x_edges)
In [5]: h2 = IIII_vec_bis(array_a, array_b, x_edges, d=10)
In [6]: np.testing.assert_almost_equal(h1, h2)
And now some timings. With n = 100:
In [7]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.02 ms per loop
In [8]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
100 loops, best of 3: 12 ms per loop
But as you start having to have a larger and larger array in memory, doing it in chunks starts to pay off. With n = 1000:
In [12]: %timeit IIII_vec(array_a, array_b, x_edges)
1 loops, best of 3: 223 ms per loop
In [13]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 208 ms per loop
With n = 10000 I can no longer call IIII_vec without an array is too big error, but the chunky version is still running:
In [18]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 21.8 s per loop
And just to show that it can be done, I have run it once with n = 50000:
In [23]: %timeit -n1 -r1 IIII_vec_bis(array_a, array_b, x_edges, d=50)
1 loops, best of 1: 543 s per loop
So a good 9 minutes of number crunching, which is not all that bad given it has computed 2.5 billion interactions.

Use vectorized numpy operations. Replace the for-loop over product() with a single newdist() call by creating arguments using meshgrid().
To parallize the problem compute newdist() on slices of array_a, array_b that correspond to subblocks of meshgrid(). Here's an example using slices and multiprocessing.
Here's another example to demonstrate the steps: python loop -> vectorized numpy version -> parallel:
#!/usr/bin/env python
from __future__ import division
import math
import multiprocessing as mp
import numpy as np
try:
from itertools import izip as zip
except ImportError:
zip = zip # Python 3
def pi_loop(x, y, npoints):
"""Compute pi using Monte-Carlo method."""
# note: the method converges to pi very slowly.
return 4 * sum(1 for xx, yy in zip(x, y) if (xx**2 + yy**2) < 1) / npoints
def pi_vectorized(x, y, npoints):
return 4 * ((x**2 + y**2) < 1).sum() / npoints # or just .mean()
def mp_init(x_shared, y_shared):
global mp_x, mp_y
mp_x, mp_y = map(np.frombuffer, [x_shared, y_shared]) # no copy
def mp_pi(args):
# perform computations on slices of mp_x, mp_y
start, end = args
x = mp_x[start:end] # no copy
y = mp_y[start:end]
return ((x**2 + y**2) < 1).sum()
def pi_parallel(x, y, npoints):
# compute pi using multiple processes
pool = mp.Pool(initializer=mp_init, initargs=[x, y])
step = 100000
slices = ((start, start + step) for start in range(0, npoints, step))
return 4 * sum(pool.imap_unordered(mp_pi, slices)) / npoints
def main():
npoints = 1000000
# create shared arrays
x_sh, y_sh = [mp.RawArray('d', npoints) for _ in range(2)]
# initialize arrays
x, y = map(np.frombuffer, [x_sh, y_sh])
x[:] = np.random.uniform(size=npoints)
y[:] = np.random.uniform(size=npoints)
for f, a, b in [(pi_loop, x, y),
(pi_vectorized, x, y),
(pi_parallel, x_sh, y_sh)]:
pi = f(a, b, npoints)
precision = int(math.floor(math.log10(npoints)) / 2 - 1 + 0.5)
print("%.*f %.1e" % (precision + 1, pi, abs(pi - math.pi)))
if __name__=="__main__":
main()
Time performance for npoints = 10_000_000:
pi_loop pi_vectorized pi_parallel
32.6 0.159 0.069 # seconds
It shows that the main performance benefit is from converting the python loop to its vectorized numpy analog.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Numpy - Speeding up Monte Carlo method for radioactive decay - python

Related

Converting `for` loop that can't be vectorized to sparse matrix

performance loss after vectorization in numpy

Optimize python for Connected Component Labeling Area of Subsets

Vectorizing ndimage functions for my code

paralellize loop over iter

Categories

Resources