Fast sequential addition to rectangular subarray in numpy

Fast sequential addition to rectangular subarray in numpy - python

I have come across a problem which is to rewrite a piece of code in vectorized form. The code shown below is a simplified illustration of initial problem
K = 20
h, w = 15, 20
H, W = 1000-h, 2000-w
q = np.random.randint(0, 20, size=(H, W, K)) # random just for illustration
Q = np.zeros((H+h, W+w, K))
for n in range(H):
for m in range(W):
Q[n:n+h, m:m+w, :] += q[n, m, :]
This code takes long to execute and it seems to me it is rather simple to allow vectorized implementation.
I am aware of numpy's s_ function which allows to construct slices which in turn can help in code vectorizing. But because every single element in Q is the result of multiple subsequent additions of elements from q I found it difficult to proceed in that simple way.
I guess that np.add.at could be useful to cope with sequential addition. But i have spent much time trying to make this two functions work for me and decided to ask for help because I constantly get an
IndexError: failed to coerce slice entry of type numpy.ndarray to integer
for any attempt i make.
Maybe there is some another numpy's magic which I am unaware of and which could help me in my task but it seems extremely difficult to google for it.

Well you are basically summing across sliding windows along the first and second axes, which in signal processing domain is termed as convolution. For two axes that would be 2D convolution. Now, Scipy has it implemented as convolve2d and could be used for each slice along the third axis.
Thus, we would have an implementation with it, like so -
from scipy.signal import convolve2d
kernel = np.ones((h,w),dtype=int)
m,n,r = q.shape[0]+h-1, q.shape[1]+w-1, q.shape[2]
out = np.empty((m,n,r),dtype=q.dtype)
for i in range(r):
out[...,i] = convolve2d(q[...,i],kernel)
As it turns out, we can use fftconvolve from the same repo that allows us to work with higher-dimensional arrays. This would get us the output in a fully vectorized way, like so -
from scipy.signal import fftconvolve
out = fftconvolve(q,np.ones((h,w,1),dtype=int))
Runtime test
Function definitions -
def original_app(q,K,h,w,H,W):
Q = np.zeros((H+h-1, W+w-1, K))
for n in range(H):
for m in range(W):
Q[n:n+h, m:m+w, :] += q[n, m, :]
return Q
def convolve2d_app(q,K,h,w,H,W):
kernel = np.ones((h,w),dtype=int)
m,n,r = q.shape[0]+h-1, q.shape[1]+w-1, q.shape[2]
out = np.empty((m,n,r),dtype=q.dtype)
for i in range(r):
out[...,i] = convolve2d(q[...,i],kernel)
return out
def fftconvolve_app(q,K,h,w,H,W):
return fftconvolve(q,np.ones((h,w,1),dtype=int))
Timings and verification -
In [128]: # Setup inputs
...: K = 20
...: h, w = 15, 20
...: H, W = 200-h, 400-w
...: q = np.random.randint(0, 20, size=(H, W, K))
...:
In [129]: %timeit original_app(q,K,h,w,H,W)
1 loops, best of 3: 2.05 s per loop
In [130]: %timeit convolve2d_app(q,K,h,w,H,W)
1 loops, best of 3: 2.05 s per loop
In [131]: %timeit fftconvolve_app(q,K,h,w,H,W)
1 loops, best of 3: 233 ms per loop
In [132]: np.allclose(original_app(q,K,h,w,H,W),convolve2d_app(q,K,h,w,H,W))
Out[132]: True
In [133]: np.allclose(original_app(q,K,h,w,H,W),fftconvolve_app(q,K,h,w,H,W))
Out[133]: True
So, it seems fftconvolve based approach is doing really well there!

Related

Sliding standard deviation on a 1D NumPy array

Suppose that you have an array and want to create another array, which's values are equal to standard deviation of first array's 10 elements successively. With the help of for loop, it can be written easily like below code. What I want to do is avoid using for loop for faster execution time. Any suggestions?
Code
a = np.arange(20)
b = np.empty(11)
for i in range(11):
b[i] = np.std(a[i:i+10])

You could create a 2D array of sliding windows with np.lib.stride_tricks.as_strided that would be views into the given 1D array and as such won't be occupying any more memory. Then, simply use np.std along the second axis (axis=1) for the final result in a vectorized way, like so -
W = 10 # Window size
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
out = np.std(a2D, axis=1)
Runtime test
Function definitions -
def original_app(a, W):
b = np.empty(a.size-W+1)
for i in range(b.size):
b[i] = np.std(a[i:i+W])
return b
def vectorized_app(a, W):
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
return np.std(a2D,1)
Timings and verification -
In [460]: # Inputs
...: a = np.arange(10000)
...: W = 10
...:
In [461]: np.allclose(original_app(a, W), vectorized_app(a, W))
Out[461]: True
In [462]: %timeit original_app(a, W)
1 loops, best of 3: 522 ms per loop
In [463]: %timeit vectorized_app(a, W)
1000 loops, best of 3: 1.33 ms per loop
So, around 400x speedup there!
For completeness, here's the equivalent pandas version -
import pandas as pd
def pdroll(a, W): # a is 1D ndarray and W is window-size
return pd.Series(a).rolling(W).std(ddof=0).values[W-1:]

Not so fancy, but the code with no loops would be something like this:
a = np.arange(20)
b = [a[i:i+10].std() for i in range(len(a)-10)]

performance loss after vectorization in numpy

I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.
However, I found vectorized program to have much worse performance than the for loop version:
import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583
What is happening here?

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:
def setup(n1=10000, n2=1000, n3=20, seed=None):
gen = np.random.RandomState(seed)
kpt_list = gen.randn(n1, n3).astype(np.float)
rpt_list = gen.randn(n2, n3).astype(np.float)
h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
r_ndegen = gen.randn(1000).astype(np.float)
return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
kpt_data = h_r.dot(phase)
return kpt_data
def original_loop(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
kpt_data = np.zeros((20, 20, 10000), dtype='complex')
for i in range(10000):
kpt = kpt_list[i, :]
phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
kpt_data[:, :, i] = h_r.dot(phase)
return kpt_data
I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.
Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:
In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 def original_vec(*args, **kwargs):
13
14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
15
16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
19
20 1 4 4.0 0.0 return kpt_data
You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).
The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:
In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop
I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.
Here's a vectorized version that incorporates the reshaping operation:
def new_vec(*args, **kwargs):
kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
return kpt_data.reshape(h_r.shape[:2] + (-1,))
The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.
By using the same random input data, we can check that the new version gives the same result as the original:
In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0)
In [6]: np.allclose(ans1, ans2)
Out[6]: True
Some performance benchmarks:
In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop
Update:
I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...
In the meantime, it's a nice performance trick to be aware of when multiplying tensors.
Update 2:
I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:
In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

I suspect the first operation is hitting the the resource limit. May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays, and Dot product of huge arrays in numpy.

Converting a nested loop calculation to Numpy for speedup

Part of my Python program contains the follow piece of code, where a new grid
is calculated based on data found in the old grid.
The grid i a two-dimensional list of floats. The code uses three for-loops:
for t in xrange(0, t, step):
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
gr = new_gr
return gr
The code is extremly slow for a large grid and a large time t.
I've tried to use Numpy to speed up this code, by substituting the inner loop
with:
J = np.arange(1, width-1)
new_gr[h][J] = gr[h][J] + gr[h][J-1] ...
But the results produced (the floats in the array) are about 10% smaller than
their list-calculation counterparts.
What loss of accuracy is to be expected when converting lists of floats to Numpy array of floats using np.array(pylist) and then doing a calculation?
How should I go about converting a triple for-loop to pretty and fast Numpy code? (or are there other suggestions for speeding up the code significantly?)

If gr is a list of floats, the first step if you are looking to vectorize with NumPy would be to convert gr to a NumPy array with np.array().
Next up, I am assuming that you have new_gr initialized with zeros of shape (height,width). The calculations being performed in the two innermost loops basically represent 2D convolution. So, you can use signal.convolve2d with an appropriate kernel. To decide on the kernel, we need to look at the scaling factors and make a 3 x 3 kernel out of them and negate them to simulate the calculations we are doing with each iteration. Thus, you would have a vectorized solution with the two innermost loops being removed for better performance, like so -
import numpy as np
from scipy import signal
# Get the scaling factors and negate them to get kernel
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
# Initialize output array and run 2D convolution and set values into it
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
Verify output and runtime tests
Define functions :
def org_app(gr,t):
new_gr = np.zeros((height,width))
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
return new_gr
def proposed_app(gr,t):
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
return out
Verify -
In [244]: # Inputs
...: gr = np.random.rand(40,50)
...: height,width = gr.shape
...: t = 1
...:
In [245]: np.allclose(org_app(gr,t),proposed_app(gr,t))
Out[245]: True
Timings -
In [246]: # Inputs
...: gr = np.random.rand(400,500)
...: height,width = gr.shape
...: t = 1
...:
In [247]: %timeit org_app(gr,t)
1 loops, best of 3: 2.13 s per loop
In [248]: %timeit proposed_app(gr,t)
10 loops, best of 3: 19.4 ms per loop

#Divakar, I tried a couple of variations on your org_app. The fully vectorized version is:
def org_app4(gr,t):
new_gr = np.zeros((height,width))
I = np.arange(1,height-1)[:,None]
J = np.arange(1,width-1)
new_gr[I,J] = gr[I,J] + gr[I,J-1] + gr[I-1,J] + t * gr[I+1,J-1]-2 * (gr[I,J-1] + t * gr[I-1,J])
return new_gr
While half the speed of your proposed_app, it is closer in style to the original. And thus may help with understanding how nested loops can be vectorized.
An important step is the conversion of I into a column array, so that together I,J index a block of values.

Element-wise maximum of two sparse matrices

Is there an easy/build-in way to get the element-wise maximum of two (or ideally more) sparse matrices? I.e. a sparse equivalent of np.maximum.

This did the trick:
def maximum (A, B):
BisBigger = A-B
BisBigger.data = np.where(BisBigger.data < 0, 1, 0)
return A - A.multiply(BisBigger) + B.multiply(BisBigger)

No, there's no built-in way to do this in scipy.sparse. The easy solution is
np.maximum(X.A, Y.A)
but this is obviously going to be very memory-intensive when the matrices have large dimensions and it might crash your machine. A memory-efficient (but by no means fast) solution is
# convert to COO, if necessary
X = X.tocoo()
Y = Y.tocoo()
Xdict = dict(((i, j), v) for i, j, v in zip(X.row, X.col, X.data))
Ydict = dict(((i, j), v) for i, j, v in zip(Y.row, Y.col, Y.data))
keys = list(set(Xdict.iterkeys()).union(Ydict.iterkeys()))
XmaxY = [max(Xdict.get((i, j), 0), Ydict.get((i, j), 0)) for i, j in keys]
XmaxY = coo_matrix((XmaxY, zip(*keys)))
Note that this uses pure Python instead of vectorized idioms. You can try shaving some of the running time off by vectorizing parts of it.

Here's another memory-efficient solution that should be a bit quicker than larsmans'. It's based on finding the set of unique indices for the nonzero elements in the two arrays using code from Jaime's excellent answer here.
import numpy as np
from scipy import sparse
def sparsemax(X, Y):
# the indices of all non-zero elements in both arrays
idx = np.hstack((X.nonzero(), Y.nonzero()))
# find the set of unique non-zero indices
idx = tuple(unique_rows(idx.T).T)
# take the element-wise max over only these indices
X[idx] = np.maximum(X[idx].A, Y[idx].A)
return X
def unique_rows(a):
void_type = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
b = np.ascontiguousarray(a).view(void_type)
idx = np.unique(b, return_index=True)[1]
return a[idx]
Testing:
def setup(n=1000, fmt='csr'):
return sparse.rand(n, n, format=fmt), sparse.rand(n, n, format=fmt)
X, Y = setup()
Z = sparsemax(X, Y)
print np.all(Z.A == np.maximum(X.A, Y.A))
# True
%%timeit X, Y = setup()
sparsemax(X, Y)
# 100 loops, best of 3: 4.92 ms per loop

The latest scipy (13.0) defines element-wise booleans for sparse matricies. So:
BisBigger = B>A
A - A.multiply(BisBigger) + B.multiply(BisBigger)
np.maximum does not (yet) work because it uses np.where, which is still trying to get the truth value of an array.
Curiously B>A returns a boolean dtype, while B>=A is float64.

Here is a function that returns a sparse matrix that is element-wise maximum of two sparse matrices. It implements the answer by hpaulj:
def sparse_max(A, B):
"""
Return the element-wise maximum of sparse matrices `A` and `B`.
"""
AgtB = (A > B).astype(int)
M = AgtB.multiply(A - B) + B
return M
Testing:
A = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
B = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
M = sparse_max(A, B)
M2 = sparse_max(B, A)
# Test symmetry:
print((M.A == M2.A).all())
# Test that M is larger or equal to A and B, element-wise:
print((M.A >= A.A).all())
print((M.A >= B.A).all())

from scipy import sparse
from numpy import array
I = array([0,3,1,0])
J = array([0,3,1,2])
V = array([4,5,7,9])
A = sparse.coo_matrix((V,(I,J)),shape=(4,4))
A.data.max()
9
If you haven't already, you should try out ipython, you could have saved your self time my making your spare matrix A then simply typing A. then tab, this will print a list of methods that you can call on A. From this you would see A.data gives you the non-zero entries as an array and hence you just want the maximum of this.

Vectorizing ndimage functions for my code

I want to be able to vectorize this code:
def sobHypot(rec):
a, b, c = rec.shape
hype = np.ones((a,b,c))
for i in xrange(c):
x=ndimage.sobel(abs(rec[...,i])**2,axis=0, mode='constant')
y=ndimage.sobel(abs(rec[...,i])**2,axis=1, mode='constant')
hype[...,i] = np.hypot(x,y)
hype[...,i] = hype[...,i].mean()
index = hype.argmax()
return index
where rec,shape returns (1024,1024,20)

Here's how you can avoid the for-loop with the sobel filter:
import numpy as np
from scipy.ndimage import sobel
def sobHypot_vec(rec):
r = np.abs(rec)
x = sobel(r, 0, mode='constant')
y = sobel(r, 1, mode='constant')
h = np.hypot(x, y)
h = np.apply_over_axes(np.mean, h, [0,1])
return h.argmax()
I'm not sure if the sobel filter is particularly necessary in your application, and this is hard to test without your particular 20-layer 'image', but you could try using np.gradient instead of running the sobel twice. The advantage is that gradient runs in three dimensions. You can ignore the component in the third, and take the hypot of just the first two. This seems wasteful but is actually still faster in my tests.
For a variety of randomly generated images, r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j, this gives the same answer as your code, but test it to be sure, and possibly fiddle with the dx, dy arguments of np.gradient
def grad_max(rec):
g = np.gradient(np.abs(rec))[:2] # ignore derivative in third dimension
h = np.hypot(*g)
h = np.apply_over_axes(np.mean, h, [0,1]) # mean along first and second dimension
return h.argmax()
Using this code for timing:
def sobHypot_clean(rec):
rs = rec.shape
hype = np.ones(rs)
r = np.abs(rec)
for i in xrange(rs[-1]):
ri = r[...,i]
x = sobel(ri, 0, mode='constant')
y = sobel(ri, 1, mode='constant')
hype[...,i] = np.hypot(x,y).mean()
return hype.argmax()
Timing:
In [1]: r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j
# Original Post
In [2]: timeit sobHypot(r)
1 loops, best of 3: 9.85 s per loop
#cleaned up a bit:
In [3]: timeit sobHypot_clean(r)
1 loops, best of 3: 7.64 s per loop
# vectorized:
In [4]: timeit sobHypot_vec(r)
1 loops, best of 3: 5.98 s per loop
# using np.gradient:
In [5]: timeit grad_max(r)
1 loops, best of 3: 4.12 s per loop
Please test any of these functions on your own images to be sure they give the desired output, since different types of arrays could react differently from the simple random tests I did.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast sequential addition to rectangular subarray in numpy - python

Related

Sliding standard deviation on a 1D NumPy array

performance loss after vectorization in numpy

Converting a nested loop calculation to Numpy for speedup

Element-wise maximum of two sparse matrices

Vectorizing ndimage functions for my code

Categories

Resources