Vectorized Implementation for a Convoluted Numpy Script

Vectorized Implementation for a Convoluted Numpy Script - python

I am working on a personal project which involves predicting weather pattern movements from radar. I have three n by m numpy arrays; one with precipitation intensity values, one with the movement (in pixels) in the X direction of that precipitation and one with the movement (in pixels) in the Y direction of that precipitation. I want to use these three arrays to determine the location of the precipitation pixels using the offsets in the other two arrays.
xMax = currentReflectivity.shape[0]
yMax = currentReflectivity.shape[1]
for x in xrange(currentReflectivity.shape[0]):
for y in xrange(currentReflectivity.shape[1]):
targetPixelX = xOffsetArray[x,y] + x
targetPixelY = yOffsetArray[x,y] + y
targetPixelX = int(targetPixelX)
targetPixelY = int(targetPixelY)
if targetPixelX < xMax and targetPixelY < yMax:
interpolatedReflectivity[targetPixelX,targetPixelY] = currentReflectivity[x,y]
I can't think of a way to vectorize this; any ideas?

Here's a vectorized approach making use of broadcasting -
x_arr = np.arange(currentReflectivity.shape[0])[:,None]
y_arr = np.arange(currentReflectivity.shape[1])
targetPixelX_arr = (xOffsetArray[x_arr, y_arr] + x_arr).astype(int)
targetPixelY_arr = (yOffsetArray[x_arr, y_arr] + y_arr).astype(int)
valid_mask = (targetPixelX_arr < xMax) & (targetPixelY_arr < yMax)
R = targetPixelX_arr[valid_mask]
C = targetPixelY_arr[valid_mask]
interpolatedReflectivity[R,C] = currentReflectivity[valid_mask]
Runtime test
Approaches -
def org_app(currentReflectivity, xOffsetArray, yOffsetArray):
m,n = currentReflectivity.shape
interpolatedReflectivity = np.zeros((m,n))
xMax = currentReflectivity.shape[0]
yMax = currentReflectivity.shape[1]
for x in xrange(currentReflectivity.shape[0]):
for y in xrange(currentReflectivity.shape[1]):
targetPixelX = xOffsetArray[x,y] + x
targetPixelY = yOffsetArray[x,y] + y
targetPixelX = int(targetPixelX)
targetPixelY = int(targetPixelY)
if targetPixelX < xMax and targetPixelY < yMax:
interpolatedReflectivity[targetPixelX,targetPixelY] = \
currentReflectivity[x,y]
return interpolatedReflectivity
def broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray):
m,n = currentReflectivity.shape
interpolatedReflectivity = np.zeros((m,n))
xMax, yMax = m,n
x_arr = np.arange(currentReflectivity.shape[0])[:,None]
y_arr = np.arange(currentReflectivity.shape[1])
targetPixelX_arr = (xOffsetArray[x_arr, y_arr] + x_arr).astype(int)
targetPixelY_arr = (yOffsetArray[x_arr, y_arr] + y_arr).astype(int)
valid_mask = (targetPixelX_arr < xMax) & (targetPixelY_arr < yMax)
R = targetPixelX_arr[valid_mask]
C = targetPixelY_arr[valid_mask]
interpolatedReflectivity[R,C] = currentReflectivity[valid_mask]
return interpolatedReflectivity
Timings and verification -
In [276]: # Setup inputs
...: m,n = 100,110 # currentReflectivity.shape
...: max_r = 120 # xOffsetArray's extent
...: max_c = 130 # yOffsetArray's extent
...:
...: currentReflectivity = np.random.rand(m, n)
...: xOffsetArray = np.random.randint(0,max_r,(m, n))
...: yOffsetArray = np.random.randint(0,max_c,(m, n))
...:
In [277]: out1 = org_app(currentReflectivity, xOffsetArray, yOffsetArray)
...: out2 = broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray)
...: print np.allclose(out1, out2)
...:
True
In [278]: %timeit org_app(currentReflectivity, xOffsetArray, yOffsetArray)
100 loops, best of 3: 6.86 ms per loop
In [279]: %timeit broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray)
1000 loops, best of 3: 212 µs per loop
In [280]: 6860.0/212 # Speedup number
Out[280]: 32.35849056603774

I am pretty sure that you can vectorize this by just taking everything out of the loop:
targetPixelX = (xOffsetArray + np.arange(xMax).reshape(xMax, 1)).astype(np.int)
targetPixelY = (yOffsetArray + np.arange(yMax)).astype(np.int)
mask = ((targetPixelX < xMax) & (targetPixelY < yMax))
interpolatedReflectivity[mask] = currentReflectivity[mask]
This will be much faster but more memory intensive. Basically, targetPixelX and targetPixelY are now arrays containing the values for each pixel that were before computed on a per-iteration basis.
Only the masked values are set in interpolatedReflectivity, similarly to what the if statement was doing in the loop.

Related

A way to speed up the smoothing function

I have a function that is used for smoothing a curve by taking a moving average over a factor of 2. However, in it's current form the function is slow because of the loops. I have added numba to increase the speed but still it's slow. Any suggestions on how I could improve the speed?
from numba import prange, jit
#jit(nopython=True, parallel=True)
def smoothing_function(x,y, window=2, pad = 1):
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
for i in prange(len_x):
x0 = x[i]
xf = window*x[i]
if xf < max_x:
e = np.where(x == x.flat[np.abs(x - xf).argmin()])[0][0]
if e<len(x):
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x[i] + np.log10(0.5) * (x[i] - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout
# Working example
f = lambda x: x**(-1.7)*2*np.random.rand(len(x))
x = np.logspace(np.log10(1e-5), np.log10(1), 1000)
xvals, yvals = x, f(x)
%timeit res =smoothing_function(xvals, yvals, window=2, pad = 1)
# plot results
plt.loglog(xvals, yvals)
plt.loglog(res[1], res[2])

Issue is that you are computing end index (e) very inefficiently. If you use the fact that x is in logspace, this can be done much faster since you know the distance between 2 consecutive points and just need to compute index which is log(window) far from initial point. Working example is as follows:
#jit(nopython=True, parallel=True)
def smoothing_function2(x,y, window=2, pad = 1):
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
f_idx = int(len(x)*np.log10(window)/(np.log10(x[-1])-np.log10(x[0])))
for i in prange(len_x):
if window*x[i] < max_x:
e = min(i+f_idx, len_x-1)
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x[i] + np.log10(0.5) * (x[i] - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout
f = lambda x: x**(-1.7)*2*np.random.rand(len(x))
x = np.logspace(np.log10(1e-5), np.log10(1), 1000)
xvals, yvals = x, f(x)
res1 = smoothing_function(xvals, yvals, window=2, pad = 1)
res2 = smoothing_function2(xvals, yvals, window=2, pad = 1)
print([np.nansum((r1 - r2)**2) for r1, r2 in zip(res1, res2)]) # verify that all the outputs are same
%timeit res = smoothing_function(xvals, yvals, window=2, pad = 1)
%timeit res = smoothing_function2(xvals, yvals, window=2, pad = 1)
Output is the following:
[0.0, 0.0, 0.0]
337 µs ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
49.1 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Which verifies that both functions return same output but smoothing_function2 is ~6.8x faster. If x is not restricted to be in logspace, you can still use the properties of whatever space you are using to get a similar improvement. There could be more ways to improve this further, it depends on what your target is. You could also try implementing this in C++ or Cython.

#Abhinav's solution works perfectly. Another very slightly faster solution is this one:
from numba import prange, jit
#jit(nopython=True, parallel=True)
def smoothing_function(x,y, window=2, pad = 1):
def bisection(array,value):
'''Given an ``array`` , and given a ``value`` , returns an index j such that ``value`` is between array[j]
and array[j+1]. ``array`` must be monotonic increasing. j=-1 or j=len(array) is returned
to indicate that ``value`` is out of range below and above respectively.'''
n = len(array)
if (value < array[0]):
return -1
elif (value > array[n-1]):
return n
jl = 0# Initialize lower
ju = n-1# and upper limits.
while (ju-jl > 1):# If we are not yet done,
jm=(ju+jl) >> 1# compute a midpoint with a bitshift
if (value >= array[jm]):
jl=jm# and replace either the lower limit
else:
ju=jm# or the upper limit, as appropriate.
# Repeat until the test condition is satisfied.
if (value == array[0]):# edge cases at bottom
return 0
elif (value == array[n-1]):# and top
return n-1
else:
return jl
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
for i in prange(len_x):
x0 = x[i]
xf = window*x0
if xf < max_x:
#e = np.where(x == x[np.abs(x - xf).argmin()])[0][0]
e = bisection(x,xf)
if e<len_x:
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x0 + np.log10(0.5) * (x0 - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout

Efficiently compute pairwise equal for NumPy arrays

Given two NumPy arrays, say:
import numpy as np
import numpy.random as rand
n = 1000
x = rand.binomial(n=1, p=.5, size=(n, 10))
y = rand.binomial(n=1, p=.5, size=(n, 10))
Is there a more efficient way to compute X in the following:
X = np.zeros((n, n))
for i in range(n):
for j in range(n):
X[i, j] = 1 * np.all(x[i] == y[j])

Approach #1 : Input arrays with 0s & 1s
For input arrays with 0s and 1s only, we can reduce each of their rows to scalars and hence the input arrays to 1D and then leverage broadcasting, like so -
n = x.shape[1]
s = 2**np.arange(n)
x1D = x.dot(s)
y1D = y.dot(s)
Xout = (x1D[:,None] == y1D).astype(float)
Approach #2 : Generic case
For a generic case, we can use views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
x1D, y1D = view1D(x, y)
Xout = (x1D[:,None] == y1D).astype(float)
Runtime test
# Setup
In [287]: np.random.seed(0)
...: n = 1000
...: x = rand.binomial(n=1, p=.5, size=(n, 10))
...: y = rand.binomial(n=1, p=.5, size=(n, 10))
# Original approach
In [288]: %%timeit
...: X = np.zeros((n, n))
...: for i in range(n):
...: for j in range(n):
...: X[i, j] = 1 * np.all(x[i] == y[j])
1 loop, best of 3: 4.69 s per loop
# Approach #1
In [290]: %%timeit
...: n = x.shape[1]
...: s = 2**np.arange(n)
...: x1D = x.dot(s)
...: y1D = y.dot(s)
...: Xout = (x1D[:,None] == y1D).astype(float)
1000 loops, best of 3: 1.42 ms per loop
# Approach #2
In [291]: %%timeit
...: x1D, y1D = view1D(x, y)
...: Xout = (x1D[:,None] == y1D).astype(float)
100 loops, best of 3: 18.5 ms per loop

Vectorization/optimising for loop with numpy in Python

Im writing a script to handle some data from a sensor represented in the signal_gen function. As you can see in the testing function it is quite loop sentered. Since this function is called many times it makes it a bit slow and it would be lovely with a push in the right direction for optimising it.
I have read that it is possible to exchange the for loop with a vectorizatid array, but I can't get my head around how the i_avg[i] line should be written, since we have single element y[i] multiplied with the whole array x inside a np.cos, and all this is again just one irritation of i_avg.
def testing(signal):
y = np.arange(0.0108, 0.0135, 0.001) # this one changes over time, set
#to constant for easier reading
x = np.arange(0, (len(signal)))
I_avg = np.zeros(len(y))
Q_avg = np.zeros_like(I_avg)
for i in range(0, len(y)):
I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
D = np.power(I_avg, 2) + np.power(Q_avg, 2)
max_index = np.argmax(D)
phaseOut = np.arctan2(Q_avg[max_index], I_avg[max_index])
#just a test signal
def signal_gen():
signal = np.random.random(size=251)
return signal

One vectorized approach using matrix-multiplication with numpy.dot to replace the nested loop to give us I_avg, Q_avg and also incorporating NumPy broadcasting and thus achieve a more efficient solution would be like so -
mult = 2*np.pi*y[:,None]*x
I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
Please note that for the given sample, we are competing against a loopy version that only has to iterate for 3 iterations (y being of length 3). As such, we won't be seeing a huge speedup here.
Runtime test -
In [9]: #just a test signal
...: signal = np.random.random(size=251)
...: y = np.arange(0.0108, 0.0135, 0.001)
...: x = np.arange(0, (len(signal)))
...:
# Original approach
In [10]: %%timeit I_avg = np.zeros(len(y))
...: Q_avg = np.zeros_like(I_avg)
...: for i in range(0, len(y)):
...: I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
...: Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
...:
10000 loops, best of 3: 68 µs per loop
# Proposed approach
In [11]: %%timeit mult = 2*np.pi*y[:,None]*x
...: I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
...:
10000 loops, best of 3: 34.8 µs per loop

You can use np.einsum for broadcasting:
yx = 2*np.pi*np.einsum("i,j->ij", y, x)
I_avg = np.sin(yx) # signal
Q_avg = np.cos(yx) # signal

Vectorizing NumPy covariance for 3D array

I have a 3D numpy array of shape (t, n1, n2):
x = np.random.rand(10, 2, 4)
I need to calculate another 3D array y which is of shape (t, n1, n1) such that:
y[0] = np.cov(x[0,:,:])
...and so on for all slices along the first axis.
So, a loopy implementation would be:
y = np.zeros((10,2,2))
for i in np.arange(x.shape[0]):
y[i] = np.cov(x[i, :, :])
Is there any way to vectorize this so I can calculate all covariance matrices in one go? I tried doing:
x1 = x.swapaxes(1, 2)
y = np.dot(x, x1)
But it didn't work.

Hacked into numpy.cov source code and tried using the default parameters. As it turns out, np.cov(x[i,:,:]) would be simply :
N = x.shape[2]
m = x[i,:,:]
m -= np.sum(m, axis=1, keepdims=True) / N
cov = np.dot(m, m.T) /(N - 1)
So, the task was to vectorize this loop that would iterate through i and process all of the data from x in one go. For the same, we could use broadcasting at the third step. For the final step, we are performing sum-reduction there along all slices in first axis. This could be efficiently implemented in a vectorized manner with np.einsum. Thus, the final implementation came to this -
N = x.shape[2]
m1 = x - x.sum(2,keepdims=1)/N
y_out = np.einsum('ijk,ilk->ijl',m1,m1) /(N - 1)
Runtime test
In [155]: def original_app(x):
...: n = x.shape[0]
...: y = np.zeros((n,2,2))
...: for i in np.arange(x.shape[0]):
...: y[i]=np.cov(x[i,:,:])
...: return y
...:
...: def proposed_app(x):
...: N = x.shape[2]
...: m1 = x - x.sum(2,keepdims=1)/N
...: out = np.einsum('ijk,ilk->ijl',m1,m1) / (N - 1)
...: return out
...:
In [156]: # Setup inputs
...: n = 10000
...: x = np.random.rand(n,2,4)
...:
In [157]: np.allclose(original_app(x),proposed_app(x))
Out[157]: True # Results verified
In [158]: %timeit original_app(x)
1 loops, best of 3: 610 ms per loop
In [159]: %timeit proposed_app(x)
100 loops, best of 3: 6.32 ms per loop
Huge speedup there!

Python vectorizing nested for loops

I'd appreciate some help in finding and understanding a pythonic way to optimize the following array manipulations in nested for loops:
def _func(a, b, radius):
"Return 0 if a>b, otherwise return 1"
if distance.euclidean(a, b) < radius:
return 1
else:
return 0
def _make_mask(volume, roi, radius):
mask = numpy.zeros(volume.shape)
for x in range(volume.shape[0]):
for y in range(volume.shape[1]):
for z in range(volume.shape[2]):
mask[x, y, z] = _func((x, y, z), roi, radius)
return mask
Where volume.shape (182, 218, 200) and roi.shape (3,) are both ndarray types; and radius is an int

Approach #1
Here's a vectorized approach -
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2
Possible improvement : We can probably speedup the last step with numexpr module -
import numexpr as ne
mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
Approach #2
We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi on the fly without actually creating the meshes as done earlier with np.mgrid. This would be benefited by the use of broadcasting for efficiency purposes. The implementation would look like this -
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2
Simplified version : Thanks to #Bi Rico for suggesting an improvement here as we can use np.ogrid to perform those operations in a bit more concise manner, like so -
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2
Runtime test
Function definitions -
def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2
def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2
def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2
Timings -
In [106]: # Setup input arrays
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
...:
In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop
In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop
In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop
In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop
In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop
So, as always broadcasting showing its magic for a crazy almost 10,000x speedup over the original code and more than 10x better than creating meshes by using on-the-fly broadcasted operations!

Say you first build an xyzy array:
import itertools
xyz = [np.array(p) for p in itertools.product(range(volume.shape[0]), range(volume.shape[1]), range(volume.shape[2]))]
Now, using numpy.linalg.norm,
np.linalg.norm(xyz - roi, axis=1) < radius
checks whether the distance for each tuple from roi is smaller than radius.
Finally, just reshape the result to the dimensions you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized Implementation for a Convoluted Numpy Script - python

Related

A way to speed up the smoothing function

Efficiently compute pairwise equal for NumPy arrays

Vectorization/optimising for loop with numpy in Python

Vectorizing NumPy covariance for 3D array

Python vectorizing nested for loops

Categories

Resources