I have a 3D numpy array of shape (t, n1, n2):
x = np.random.rand(10, 2, 4)
I need to calculate another 3D array y which is of shape (t, n1, n1) such that:
y[0] = np.cov(x[0,:,:])
...and so on for all slices along the first axis.
So, a loopy implementation would be:
y = np.zeros((10,2,2))
for i in np.arange(x.shape[0]):
y[i] = np.cov(x[i, :, :])
Is there any way to vectorize this so I can calculate all covariance matrices in one go? I tried doing:
x1 = x.swapaxes(1, 2)
y = np.dot(x, x1)
But it didn't work.
Hacked into numpy.cov source code and tried using the default parameters. As it turns out, np.cov(x[i,:,:]) would be simply :
N = x.shape[2]
m = x[i,:,:]
m -= np.sum(m, axis=1, keepdims=True) / N
cov = np.dot(m, m.T) /(N - 1)
So, the task was to vectorize this loop that would iterate through i and process all of the data from x in one go. For the same, we could use broadcasting at the third step. For the final step, we are performing sum-reduction there along all slices in first axis. This could be efficiently implemented in a vectorized manner with np.einsum. Thus, the final implementation came to this -
N = x.shape[2]
m1 = x - x.sum(2,keepdims=1)/N
y_out = np.einsum('ijk,ilk->ijl',m1,m1) /(N - 1)
Runtime test
In [155]: def original_app(x):
...: n = x.shape[0]
...: y = np.zeros((n,2,2))
...: for i in np.arange(x.shape[0]):
...: y[i]=np.cov(x[i,:,:])
...: return y
...:
...: def proposed_app(x):
...: N = x.shape[2]
...: m1 = x - x.sum(2,keepdims=1)/N
...: out = np.einsum('ijk,ilk->ijl',m1,m1) / (N - 1)
...: return out
...:
In [156]: # Setup inputs
...: n = 10000
...: x = np.random.rand(n,2,4)
...:
In [157]: np.allclose(original_app(x),proposed_app(x))
Out[157]: True # Results verified
In [158]: %timeit original_app(x)
1 loops, best of 3: 610 ms per loop
In [159]: %timeit proposed_app(x)
100 loops, best of 3: 6.32 ms per loop
Huge speedup there!
Related
Given two NumPy arrays, say:
import numpy as np
import numpy.random as rand
n = 1000
x = rand.binomial(n=1, p=.5, size=(n, 10))
y = rand.binomial(n=1, p=.5, size=(n, 10))
Is there a more efficient way to compute X in the following:
X = np.zeros((n, n))
for i in range(n):
for j in range(n):
X[i, j] = 1 * np.all(x[i] == y[j])
Approach #1 : Input arrays with 0s & 1s
For input arrays with 0s and 1s only, we can reduce each of their rows to scalars and hence the input arrays to 1D and then leverage broadcasting, like so -
n = x.shape[1]
s = 2**np.arange(n)
x1D = x.dot(s)
y1D = y.dot(s)
Xout = (x1D[:,None] == y1D).astype(float)
Approach #2 : Generic case
For a generic case, we can use views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
x1D, y1D = view1D(x, y)
Xout = (x1D[:,None] == y1D).astype(float)
Runtime test
# Setup
In [287]: np.random.seed(0)
...: n = 1000
...: x = rand.binomial(n=1, p=.5, size=(n, 10))
...: y = rand.binomial(n=1, p=.5, size=(n, 10))
# Original approach
In [288]: %%timeit
...: X = np.zeros((n, n))
...: for i in range(n):
...: for j in range(n):
...: X[i, j] = 1 * np.all(x[i] == y[j])
1 loop, best of 3: 4.69 s per loop
# Approach #1
In [290]: %%timeit
...: n = x.shape[1]
...: s = 2**np.arange(n)
...: x1D = x.dot(s)
...: y1D = y.dot(s)
...: Xout = (x1D[:,None] == y1D).astype(float)
1000 loops, best of 3: 1.42 ms per loop
# Approach #2
In [291]: %%timeit
...: x1D, y1D = view1D(x, y)
...: Xout = (x1D[:,None] == y1D).astype(float)
100 loops, best of 3: 18.5 ms per loop
Im writing a script to handle some data from a sensor represented in the signal_gen function. As you can see in the testing function it is quite loop sentered. Since this function is called many times it makes it a bit slow and it would be lovely with a push in the right direction for optimising it.
I have read that it is possible to exchange the for loop with a vectorizatid array, but I can't get my head around how the i_avg[i] line should be written, since we have single element y[i] multiplied with the whole array x inside a np.cos, and all this is again just one irritation of i_avg.
def testing(signal):
y = np.arange(0.0108, 0.0135, 0.001) # this one changes over time, set
#to constant for easier reading
x = np.arange(0, (len(signal)))
I_avg = np.zeros(len(y))
Q_avg = np.zeros_like(I_avg)
for i in range(0, len(y)):
I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
D = np.power(I_avg, 2) + np.power(Q_avg, 2)
max_index = np.argmax(D)
phaseOut = np.arctan2(Q_avg[max_index], I_avg[max_index])
#just a test signal
def signal_gen():
signal = np.random.random(size=251)
return signal
One vectorized approach using matrix-multiplication with numpy.dot to replace the nested loop to give us I_avg, Q_avg and also incorporating NumPy broadcasting and thus achieve a more efficient solution would be like so -
mult = 2*np.pi*y[:,None]*x
I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
Please note that for the given sample, we are competing against a loopy version that only has to iterate for 3 iterations (y being of length 3). As such, we won't be seeing a huge speedup here.
Runtime test -
In [9]: #just a test signal
...: signal = np.random.random(size=251)
...: y = np.arange(0.0108, 0.0135, 0.001)
...: x = np.arange(0, (len(signal)))
...:
# Original approach
In [10]: %%timeit I_avg = np.zeros(len(y))
...: Q_avg = np.zeros_like(I_avg)
...: for i in range(0, len(y)):
...: I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
...: Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
...:
10000 loops, best of 3: 68 µs per loop
# Proposed approach
In [11]: %%timeit mult = 2*np.pi*y[:,None]*x
...: I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
...:
10000 loops, best of 3: 34.8 µs per loop
You can use np.einsum for broadcasting:
yx = 2*np.pi*np.einsum("i,j->ij", y, x)
I_avg = np.sin(yx) # signal
Q_avg = np.cos(yx) # signal
I have some code which generates the coordinates of a cylindrically-symmetric surface, with coordinates given as (r, theta, phi). At the moment, I generate the coordinates of one phi slice, and store that in a 2xN array (for N bins), and then in a for loop I copy this array for each value of phi from 0 to 2pi:
import numpy as np
# this is the number of bins that my surface is chopped into
numbins = 50
# these are the values for r
r_vals = np.linspace(0.0001, 50, numbins, endpoint = True)
# these are the values for theta
theta_vals = np.linspace(0, np.pi / 2, numbins, endpoint = True)
# I combine the r and theta values into a 2xnumbins array for one "slice" of phi
phi_slice = np.vstack([r_vals,theta_vals]).T
# this is the array which will store all of the coordinates of the surface
surface = np.zeros((numbins**2,3))
lower_bound = 0
upper_bound = numbins
# this is the size of the bin for phi
dphi = (2 * np.pi) / numbins
# this is the for loop I'd like to eliminate.
# For every value of phi, it puts a copy of the phi_slice array into
# the surface array, so that the surface is cylindrical about phi.
for phi in np.linspace(0, (2 * np.pi) - dphi, numbins):
surface[lower_bound:upper_bound, :2] = phi_slice
surface[lower_bound:upper_bound,2] = phi
lower_bound += numbins
upper_bound += numbins
I'm calling this routine in a numerical integration of 1e6 or 1e7 steps, and while numbins is 50 in the example above, in practice it'll be in the thousands. This for loop is a choking point, and I'd really like to eliminate it to speed things up. Is there a good NumPythonic way to do the same thing as this for loop?
Timing that loop:
In [9]: %%timeit
...: lower_bound, upper_bound = 0, numbins
...: for phi in np.linspace(0, (2 * np.pi) - dphi, numbins):
...: surface[lower_bound:upper_bound,:2] = phi_slice
...: surface[lower_bound:upper_bound,2] = phi
...: lower_bound += numbins
...: upper_bound += numbins
10000 loops, best of 3: 176 µs per loop
Off hand that doesn't look bad, though if repeated in some larger context that does matter. You are looping 50 times to fill an array of 75000 items. For the size of the task the number of loops isn't large.
Daniel's alternative is a little faster, but not drastic
In [12]: %%timeit
...: phi_slices = np.tile(phi_slice.T, numbins).T
...: phi_indices = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)
...: surface1 = np.c_[phi_slices, phi_indices]
...:
10000 loops, best of 3: 137 µs per loop
kazemakase'sis a bit better still:
In [15]: %%timeit
...: phis = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)[:, np.newaxis]
...: slices = np.repeat(phi_slice[np.newaxis, :, :], numbins, axis=0).reshape(-1, 2)
...: surface2 = np.hstack([slices, phis])
...:
10000 loops, best of 3: 115 µs per loop
And my nomination (thanks to the others for helping me see the pattern); I'm taking advantage of broadcasting in assignment.
surface3 = np.zeros((numbins,numbins,3))
phis = np.linspace(0, (2 * np.pi) - dphi, numbins)
surface3[:,:,2] = phis[:,None]
surface3[:,:,:2] = phi_slice[None,:,:]
surface3.shape = (numbins**2,3)
A bit better:
In [50]: %%timeit
...: surface3 = np.zeros((numbins,numbins,3))
...: phis=np.linspace(0, (2 * np.pi) - dphi, numbins)
...: surface3[:,:,2]=phis[:,None]
...: surface3[:,:,:2]=phi_slice[None,:,:]
...: surface3.shape=(numbins**2,3)
...:
10000 loops, best of 3: 73.3 µs per loop
edit
Replacing:
surface3[:,:,:2]=phi_slice[None,:,:]
with
surface3[:,:,0]=r_vals[None,:]
surface3[:,:,1]=theta_vals[None,:]
squeezes out a bit more of a time improvement, especially if phi_slice was constructed solely for this use.
Building this sort of blockwise array is done with np.repeat and np.tile
phi_slices = np.tile(phi_slice.T, numbins).T
phi_indices = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)
surface = np.c_[phi_slices, phi_indices]
This is a possible way to vectorize the loop:
phis = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)[:, np.newaxis]
slices = np.repeat(phi_slice[np.newaxis, :, :], numbins, axis=0).reshape(-1, 2)
surface2 = np.hstack([slices, phis])
print(np.allclose(surface, surface2))
# True
That's what is going on in detail:
np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins) takes the array with all values of phi and repeats each element numbins times. [:, np.newaxis] brings the result into 2D shape (numbins**2, 1).
phi_slice[np.newaxis, :, :] brings phi_slice into 3D shape (1, numbins, 2.. This array is repeated along the first axis, resulting in shape (numbins, numbins, 2). Finally reshape(-1, 2) brings the first two dimensions back together to final shape (numbins**2, 2).
np.hstack([slices, phis]) combines both parts of the final array.
is there an elegant, numpy way to apply the dot product elementwise? Or how can the below code be translated into a nicer version?
m0 # shape (5, 3, 2, 2)
m1 # shape (5, 2, 2)
r = np.empty((5, 3, 2, 2))
for i in range(5):
for j in range(3):
r[i, j] = np.dot(m0[i, j], m1[i])
Thanks in advance!
Approach #1
Use np.einsum -
np.einsum('ijkl,ilm->ijkm',m0,m1)
Steps involved :
Keep the first axes from the inputs aligned.
Lose the last axis from m0 against second one from m1 in sum-reduction.
Let remaining axes from m0 and m1 spread-out/expand with elementwise multiplications in an outer-product fashion.
Approach #2
If you are looking for performance and with the axis of sum-reduction having a smaller length, you are better off with one-loop and using matrix-multiplication with np.tensordot, like so -
s0,s1,s2,s3 = m0.shape
s4 = m1.shape[-1]
r = np.empty((s0,s1,s2,s4))
for i in range(s0):
r[i] = np.tensordot(m0[i],m1[i],axes=([2],[0]))
Approach #3
Now, np.dot could be efficiently used on 2D inputs for some further performance boost. So, with it, the modified version, though a bit longer one, but hopefully the most performant one would be -
s0,s1,s2,s3 = m0.shape
s4 = m1.shape[-1]
m0.shape = s0,s1*s2,s3 # Get m0 as 3D for temporary usage
r = np.empty((s0,s1*s2,s4))
for i in range(s0):
r[i] = m0[i].dot(m1[i])
r.shape = s0,s1,s2,s4
m0.shape = s0,s1,s2,s3 # Put m0 back to 4D
Runtime test
Function definitions -
def original_app(m0, m1):
s0,s1,s2,s3 = m0.shape
s4 = m1.shape[-1]
r = np.empty((s0,s1,s2,s4))
for i in range(s0):
for j in range(s1):
r[i, j] = np.dot(m0[i, j], m1[i])
return r
def einsum_app(m0, m1):
return np.einsum('ijkl,ilm->ijkm',m0,m1)
def tensordot_app(m0, m1):
s0,s1,s2,s3 = m0.shape
s4 = m1.shape[-1]
r = np.empty((s0,s1,s2,s4))
for i in range(s0):
r[i] = np.tensordot(m0[i],m1[i],axes=([2],[0]))
return r
def dot_app(m0, m1):
s0,s1,s2,s3 = m0.shape
s4 = m1.shape[-1]
m0.shape = s0,s1*s2,s3 # Get m0 as 3D for temporary usage
r = np.empty((s0,s1*s2,s4))
for i in range(s0):
r[i] = m0[i].dot(m1[i])
r.shape = s0,s1,s2,s4
m0.shape = s0,s1,s2,s3 # Put m0 back to 4D
return r
Timings and verification -
In [291]: # Inputs
...: m0 = np.random.rand(50,30,20,20)
...: m1 = np.random.rand(50,20,20)
...:
In [292]: out1 = original_app(m0, m1)
...: out2 = einsum_app(m0, m1)
...: out3 = tensordot_app(m0, m1)
...: out4 = dot_app(m0, m1)
...:
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
...: print np.allclose(out1, out4)
...:
True
True
True
In [293]: %timeit original_app(m0, m1)
...: %timeit einsum_app(m0, m1)
...: %timeit tensordot_app(m0, m1)
...: %timeit dot_app(m0, m1)
...:
100 loops, best of 3: 10.3 ms per loop
10 loops, best of 3: 31.3 ms per loop
100 loops, best of 3: 5.12 ms per loop
100 loops, best of 3: 4.06 ms per loop
I think numpy.inner() is what you really want?
I'd appreciate some help in finding and understanding a pythonic way to optimize the following array manipulations in nested for loops:
def _func(a, b, radius):
"Return 0 if a>b, otherwise return 1"
if distance.euclidean(a, b) < radius:
return 1
else:
return 0
def _make_mask(volume, roi, radius):
mask = numpy.zeros(volume.shape)
for x in range(volume.shape[0]):
for y in range(volume.shape[1]):
for z in range(volume.shape[2]):
mask[x, y, z] = _func((x, y, z), roi, radius)
return mask
Where volume.shape (182, 218, 200) and roi.shape (3,) are both ndarray types; and radius is an int
Approach #1
Here's a vectorized approach -
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2
Possible improvement : We can probably speedup the last step with numexpr module -
import numexpr as ne
mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
Approach #2
We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi on the fly without actually creating the meshes as done earlier with np.mgrid. This would be benefited by the use of broadcasting for efficiency purposes. The implementation would look like this -
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2
Simplified version : Thanks to #Bi Rico for suggesting an improvement here as we can use np.ogrid to perform those operations in a bit more concise manner, like so -
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2
Runtime test
Function definitions -
def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2
def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2
def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2
Timings -
In [106]: # Setup input arrays
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
...:
In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop
In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop
In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop
In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop
In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop
So, as always broadcasting showing its magic for a crazy almost 10,000x speedup over the original code and more than 10x better than creating meshes by using on-the-fly broadcasted operations!
Say you first build an xyzy array:
import itertools
xyz = [np.array(p) for p in itertools.product(range(volume.shape[0]), range(volume.shape[1]), range(volume.shape[2]))]
Now, using numpy.linalg.norm,
np.linalg.norm(xyz - roi, axis=1) < radius
checks whether the distance for each tuple from roi is smaller than radius.
Finally, just reshape the result to the dimensions you need.