I have a binary map on which I do Connected Component Labeling and get something like this for a 64x64 grid - http://pastebin.com/bauas0NJ
Now I want to group them by label, so that I can find their area and their center of mass. This is what I do:
#ccl_np is the computed array from the previous step (see pastebin)
#I discard the label '1' as its the background
unique, count = np.unique(ccl_np, return_counts = True)
xcm_array = []
ycm_array = []
for i in range(1,len(unique)):
subarray = np.where(ccl_np == unique[i])
xcm_array.append("{0:.5f}".format((sum(subarray[0]))/(count[i]*1.)))
ycm_array.append("{0:.5f}".format((sum(subarray[1]))/(count[i]*1.)))
final_array = zip(xcm_array,ycm_array,count[1:])
I want a fast code (as I will be doing this for grids of size 4096x4096) and was told to check out numba. Here's my naive attempt :
unique, inverse, count = np.unique(ccl_np, return_counts = True, return_inverse = True)
xcm_array = np.zeros(len(count),dtype=np.float32)
ycm_array = np.zeros(len(count),dtype=np.float32)
inverse = inverse.reshape(64,64)
#numba.autojit
def mysolver(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i][j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
mysolver(xcm_array, ycm_array, inverse, count)
final_array = zip(xcm_array,ycm_array,count)
To my surprise, using numba was slower or at best equal to the speed of the previous way. What am I doing wrong ?
Also, can this be done in Cython and will that be faster ?
I am using the included packages in the latest Anaconda python 2.7 distribution.
I believe the issue might be that you are timing jit'd code incorrectly. The first time you run the code, your timing includes the time it takes numba to compile the code. This is called warming up the jit. If you call it again, that cost is gone.
import numpy as np
import numba as nb
unique, inverse, count = np.unique(ccl_np, return_counts = True, return_inverse = True)
xcm_array = np.zeros(len(count),dtype=np.float32)
ycm_array = np.zeros(len(count),dtype=np.float32)
inverse = inverse.reshape(64,64)
def mysolver(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i][j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
#nb.jit(nopython=True)
def mysolver_nb(xcm_array, ycm_array, inverse, count):
for i in range(64):
for j in range(64):
pos = inverse[i,j]
local_count = count[pos]
xcm_array[pos] += i/(local_count*1.)
ycm_array[pos] += j/(local_count*1.)
Then the timings with timeit which runs the code multiple times. First the plain python version:
In [4]:%timeit mysolver(xcm_array, ycm_array, inverse, count)
10 loops, best of 3: 25.8 ms per loop
and then with numba:
In [5]: %timeit mysolver_nb(xcm_array, ycm_array, inverse, count)
The slowest run took 3630.44 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 33.1 µs per loop
The numba code is ~1000 times faster.
Related
I have come across a problem which is to rewrite a piece of code in vectorized form. The code shown below is a simplified illustration of initial problem
K = 20
h, w = 15, 20
H, W = 1000-h, 2000-w
q = np.random.randint(0, 20, size=(H, W, K)) # random just for illustration
Q = np.zeros((H+h, W+w, K))
for n in range(H):
for m in range(W):
Q[n:n+h, m:m+w, :] += q[n, m, :]
This code takes long to execute and it seems to me it is rather simple to allow vectorized implementation.
I am aware of numpy's s_ function which allows to construct slices which in turn can help in code vectorizing. But because every single element in Q is the result of multiple subsequent additions of elements from q I found it difficult to proceed in that simple way.
I guess that np.add.at could be useful to cope with sequential addition. But i have spent much time trying to make this two functions work for me and decided to ask for help because I constantly get an
IndexError: failed to coerce slice entry of type numpy.ndarray to integer
for any attempt i make.
Maybe there is some another numpy's magic which I am unaware of and which could help me in my task but it seems extremely difficult to google for it.
Well you are basically summing across sliding windows along the first and second axes, which in signal processing domain is termed as convolution. For two axes that would be 2D convolution. Now, Scipy has it implemented as convolve2d and could be used for each slice along the third axis.
Thus, we would have an implementation with it, like so -
from scipy.signal import convolve2d
kernel = np.ones((h,w),dtype=int)
m,n,r = q.shape[0]+h-1, q.shape[1]+w-1, q.shape[2]
out = np.empty((m,n,r),dtype=q.dtype)
for i in range(r):
out[...,i] = convolve2d(q[...,i],kernel)
As it turns out, we can use fftconvolve from the same repo that allows us to work with higher-dimensional arrays. This would get us the output in a fully vectorized way, like so -
from scipy.signal import fftconvolve
out = fftconvolve(q,np.ones((h,w,1),dtype=int))
Runtime test
Function definitions -
def original_app(q,K,h,w,H,W):
Q = np.zeros((H+h-1, W+w-1, K))
for n in range(H):
for m in range(W):
Q[n:n+h, m:m+w, :] += q[n, m, :]
return Q
def convolve2d_app(q,K,h,w,H,W):
kernel = np.ones((h,w),dtype=int)
m,n,r = q.shape[0]+h-1, q.shape[1]+w-1, q.shape[2]
out = np.empty((m,n,r),dtype=q.dtype)
for i in range(r):
out[...,i] = convolve2d(q[...,i],kernel)
return out
def fftconvolve_app(q,K,h,w,H,W):
return fftconvolve(q,np.ones((h,w,1),dtype=int))
Timings and verification -
In [128]: # Setup inputs
...: K = 20
...: h, w = 15, 20
...: H, W = 200-h, 400-w
...: q = np.random.randint(0, 20, size=(H, W, K))
...:
In [129]: %timeit original_app(q,K,h,w,H,W)
1 loops, best of 3: 2.05 s per loop
In [130]: %timeit convolve2d_app(q,K,h,w,H,W)
1 loops, best of 3: 2.05 s per loop
In [131]: %timeit fftconvolve_app(q,K,h,w,H,W)
1 loops, best of 3: 233 ms per loop
In [132]: np.allclose(original_app(q,K,h,w,H,W),convolve2d_app(q,K,h,w,H,W))
Out[132]: True
In [133]: np.allclose(original_app(q,K,h,w,H,W),fftconvolve_app(q,K,h,w,H,W))
Out[133]: True
So, it seems fftconvolve based approach is doing really well there!
I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.
However, I found vectorized program to have much worse performance than the for loop version:
import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583
What is happening here?
The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:
def setup(n1=10000, n2=1000, n3=20, seed=None):
gen = np.random.RandomState(seed)
kpt_list = gen.randn(n1, n3).astype(np.float)
rpt_list = gen.randn(n2, n3).astype(np.float)
h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
r_ndegen = gen.randn(1000).astype(np.float)
return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
kpt_data = h_r.dot(phase)
return kpt_data
def original_loop(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
kpt_data = np.zeros((20, 20, 10000), dtype='complex')
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
kpt_data[:, :, i] = h_r.dot(phase)
return kpt_data
I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.
Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:
In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 def original_vec(*args, **kwargs):
13
14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
15
16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
19
20 1 4 4.0 0.0 return kpt_data
You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).
The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:
In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop
I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.
Here's a vectorized version that incorporates the reshaping operation:
def new_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
return kpt_data.reshape(h_r.shape[:2] + (-1,))
The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.
By using the same random input data, we can check that the new version gives the same result as the original:
In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0)
In [6]: np.allclose(ans1, ans2)
Out[6]: True
Some performance benchmarks:
In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop
Update:
I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...
In the meantime, it's a nice performance trick to be aware of when multiplying tensors.
Update 2:
I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:
In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop
I suspect the first operation is hitting the the resource limit. May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays, and Dot product of huge arrays in numpy.
Part of my Python program contains the follow piece of code, where a new grid
is calculated based on data found in the old grid.
The grid i a two-dimensional list of floats. The code uses three for-loops:
for t in xrange(0, t, step):
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
gr = new_gr
return gr
The code is extremly slow for a large grid and a large time t.
I've tried to use Numpy to speed up this code, by substituting the inner loop
with:
J = np.arange(1, width-1)
new_gr[h][J] = gr[h][J] + gr[h][J-1] ...
But the results produced (the floats in the array) are about 10% smaller than
their list-calculation counterparts.
What loss of accuracy is to be expected when converting lists of floats to Numpy array of floats using np.array(pylist) and then doing a calculation?
How should I go about converting a triple for-loop to pretty and fast Numpy code? (or are there other suggestions for speeding up the code significantly?)
If gr is a list of floats, the first step if you are looking to vectorize with NumPy would be to convert gr to a NumPy array with np.array().
Next up, I am assuming that you have new_gr initialized with zeros of shape (height,width). The calculations being performed in the two innermost loops basically represent 2D convolution. So, you can use signal.convolve2d with an appropriate kernel. To decide on the kernel, we need to look at the scaling factors and make a 3 x 3 kernel out of them and negate them to simulate the calculations we are doing with each iteration. Thus, you would have a vectorized solution with the two innermost loops being removed for better performance, like so -
import numpy as np
from scipy import signal
# Get the scaling factors and negate them to get kernel
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
# Initialize output array and run 2D convolution and set values into it
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
Verify output and runtime tests
Define functions :
def org_app(gr,t):
new_gr = np.zeros((height,width))
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
return new_gr
def proposed_app(gr,t):
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
return out
Verify -
In [244]: # Inputs
...: gr = np.random.rand(40,50)
...: height,width = gr.shape
...: t = 1
...:
In [245]: np.allclose(org_app(gr,t),proposed_app(gr,t))
Out[245]: True
Timings -
In [246]: # Inputs
...: gr = np.random.rand(400,500)
...: height,width = gr.shape
...: t = 1
...:
In [247]: %timeit org_app(gr,t)
1 loops, best of 3: 2.13 s per loop
In [248]: %timeit proposed_app(gr,t)
10 loops, best of 3: 19.4 ms per loop
#Divakar, I tried a couple of variations on your org_app. The fully vectorized version is:
def org_app4(gr,t):
new_gr = np.zeros((height,width))
I = np.arange(1,height-1)[:,None]
J = np.arange(1,width-1)
new_gr[I,J] = gr[I,J] + gr[I,J-1] + gr[I-1,J] + t * gr[I+1,J-1]-2 * (gr[I,J-1] + t * gr[I-1,J])
return new_gr
While half the speed of your proposed_app, it is closer in style to the original. And thus may help with understanding how nested loops can be vectorized.
An important step is the conversion of I into a column array, so that together I,J index a block of values.
I am trying to optimize the generation of decay times for a radioactive isotope Monte Carlo.
That is given nsims atoms of an isotope with a halflife of t12, when does each isotope decay?
I tried to optimize this by generating random numbers for all un-decayed atoms at once using a single numpy.random.random call (I call this method parallel), but I hope that there is still more performance to be gained. I also show a method that does this calculation for each isotope individually (serial).
import numpy as np
import time
import matplotlib.pyplot as plt
import scipy.optimize
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
decay_exp = lambda t,A,tau: A * np.exp(-t/tau)
def serial(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
for i in range(nsims):
t = dt
while decay_time[i] == 0:
if np.random.random() > np.exp(-ln2*dt/t12):
decay_time[i] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
def parallel(nsims):
sim_start_time = time.clock()
decay_time = np.zeros(nsims)
t = dt
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
sim_end_time = time.clock()
return (sim_end_time - sim_start_time,decay_time)
I'm interested in any suggestions that performs better than the parallel function that is pure python, i.e. not cython.
This method already improves greatly upon the serial method of calculating this for large nsims.
There are still some speed gains to be had from your original "parallel" (vectorized is the correct word) execution.
Remark, this is micro-management, but it does still give a small performance increase.
import numpy as np
t12 = 3.1*60.
dt = 0.01
ln2 = np.log(2)
s = 98765
def parallel(nsims): # your code, unaltered, except removed inaccurate timing method
decay_time = np.zeros(nsims)
t = dt
np.random.seed(s) # also had to add a seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0]
idecay_check = np.random.random(len(inot_decayed)) > np.exp(-ln2*dt/t12)
decay_time[inot_decayed[np.where(idecay_check==True)[0]]] = t
t += dt
return decay_time
def parallel_micro(nsims): # micromanaged code
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12) # there was no need to calculate this again in every loop iteration
np.random.seed(s) # fixed seed to get comparable results
while 0 in decay_time:
inot_decayed = np.where(decay_time == 0)[0] # only here you need the call to np.where
# to my own surprise, len(some_array) is quicker than some_array.size (function lookup vs attribute lookup)
idecay_check = np.random.random(len(inot_decayed)) > half_time
decay_time[inot_decayed[idecay_check]] = t # no need for another np.where and certainly not for another boolean comparison
t += dt
return decay_time
You can run timing measurements with the timeit module. Profiling will tell you that the bottleneck here is the call to np.where.
Knowing that the bottleneck is np.where, you could get rid of it like this:
def parallel_micro2(nsims):
decay_time = np.zeros(nsims)
t = dt
half_time = np.exp(-ln2*dt/t12)
np.random.seed(s)
indices = np.where(decay_time==0)[0]
u = len(indices)
while u:
decayed = np.random.random(u) > half_time
decay_time[indices[decayed]] = t
indices = indices[np.logical_not(decayed)]
u = len(indices)
t += dt
return decay_time
And that does give a rather large speed increase:
In [2]: %timeit -n1 -r1 parallel_micro2(1e4)
1 loops, best of 1: 7.81 s per loop
In [3]: %timeit -n1 -r1 parallel_micro(1e4)
1 loops, best of 1: 29 s per loop
In [4]: %timeit -n1 -r1 parallel(1e4)
1 loops, best of 1: 33.5 s per loop
Don't forget to get rid of the call to np.random.seed when you're done optimizing.
I want to be able to vectorize this code:
def sobHypot(rec):
a, b, c = rec.shape
hype = np.ones((a,b,c))
for i in xrange(c):
x=ndimage.sobel(abs(rec[...,i])**2,axis=0, mode='constant')
y=ndimage.sobel(abs(rec[...,i])**2,axis=1, mode='constant')
hype[...,i] = np.hypot(x,y)
hype[...,i] = hype[...,i].mean()
index = hype.argmax()
return index
where rec,shape returns (1024,1024,20)
Here's how you can avoid the for-loop with the sobel filter:
import numpy as np
from scipy.ndimage import sobel
def sobHypot_vec(rec):
r = np.abs(rec)
x = sobel(r, 0, mode='constant')
y = sobel(r, 1, mode='constant')
h = np.hypot(x, y)
h = np.apply_over_axes(np.mean, h, [0,1])
return h.argmax()
I'm not sure if the sobel filter is particularly necessary in your application, and this is hard to test without your particular 20-layer 'image', but you could try using np.gradient instead of running the sobel twice. The advantage is that gradient runs in three dimensions. You can ignore the component in the third, and take the hypot of just the first two. This seems wasteful but is actually still faster in my tests.
For a variety of randomly generated images, r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j, this gives the same answer as your code, but test it to be sure, and possibly fiddle with the dx, dy arguments of np.gradient
def grad_max(rec):
g = np.gradient(np.abs(rec))[:2] # ignore derivative in third dimension
h = np.hypot(*g)
h = np.apply_over_axes(np.mean, h, [0,1]) # mean along first and second dimension
return h.argmax()
Using this code for timing:
def sobHypot_clean(rec):
rs = rec.shape
hype = np.ones(rs)
r = np.abs(rec)
for i in xrange(rs[-1]):
ri = r[...,i]
x = sobel(ri, 0, mode='constant')
y = sobel(ri, 1, mode='constant')
hype[...,i] = np.hypot(x,y).mean()
return hype.argmax()
Timing:
In [1]: r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j
# Original Post
In [2]: timeit sobHypot(r)
1 loops, best of 3: 9.85 s per loop
#cleaned up a bit:
In [3]: timeit sobHypot_clean(r)
1 loops, best of 3: 7.64 s per loop
# vectorized:
In [4]: timeit sobHypot_vec(r)
1 loops, best of 3: 5.98 s per loop
# using np.gradient:
In [5]: timeit grad_max(r)
1 loops, best of 3: 4.12 s per loop
Please test any of these functions on your own images to be sure they give the desired output, since different types of arrays could react differently from the simple random tests I did.