A way to speed up the smoothing function - python

I have a function that is used for smoothing a curve by taking a moving average over a factor of 2. However, in it's current form the function is slow because of the loops. I have added numba to increase the speed but still it's slow. Any suggestions on how I could improve the speed?
from numba import prange, jit
#jit(nopython=True, parallel=True)
def smoothing_function(x,y, window=2, pad = 1):
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
for i in prange(len_x):
x0 = x[i]
xf = window*x[i]
if xf < max_x:
e = np.where(x == x.flat[np.abs(x - xf).argmin()])[0][0]
if e<len(x):
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x[i] + np.log10(0.5) * (x[i] - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout
# Working example
f = lambda x: x**(-1.7)*2*np.random.rand(len(x))
x = np.logspace(np.log10(1e-5), np.log10(1), 1000)
xvals, yvals = x, f(x)
%timeit res =smoothing_function(xvals, yvals, window=2, pad = 1)
# plot results
plt.loglog(xvals, yvals)
plt.loglog(res[1], res[2])

Issue is that you are computing end index (e) very inefficiently. If you use the fact that x is in logspace, this can be done much faster since you know the distance between 2 consecutive points and just need to compute index which is log(window) far from initial point. Working example is as follows:
#jit(nopython=True, parallel=True)
def smoothing_function2(x,y, window=2, pad = 1):
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
f_idx = int(len(x)*np.log10(window)/(np.log10(x[-1])-np.log10(x[0])))
for i in prange(len_x):
if window*x[i] < max_x:
e = min(i+f_idx, len_x-1)
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x[i] + np.log10(0.5) * (x[i] - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout
f = lambda x: x**(-1.7)*2*np.random.rand(len(x))
x = np.logspace(np.log10(1e-5), np.log10(1), 1000)
xvals, yvals = x, f(x)
res1 = smoothing_function(xvals, yvals, window=2, pad = 1)
res2 = smoothing_function2(xvals, yvals, window=2, pad = 1)
print([np.nansum((r1 - r2)**2) for r1, r2 in zip(res1, res2)]) # verify that all the outputs are same
%timeit res = smoothing_function(xvals, yvals, window=2, pad = 1)
%timeit res = smoothing_function2(xvals, yvals, window=2, pad = 1)
Output is the following:
[0.0, 0.0, 0.0]
337 µs ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
49.1 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Which verifies that both functions return same output but smoothing_function2 is ~6.8x faster. If x is not restricted to be in logspace, you can still use the properties of whatever space you are using to get a similar improvement. There could be more ways to improve this further, it depends on what your target is. You could also try implementing this in C++ or Cython.

#Abhinav's solution works perfectly. Another very slightly faster solution is this one:
from numba import prange, jit
#jit(nopython=True, parallel=True)
def smoothing_function(x,y, window=2, pad = 1):
def bisection(array,value):
'''Given an ``array`` , and given a ``value`` , returns an index j such that ``value`` is between array[j]
and array[j+1]. ``array`` must be monotonic increasing. j=-1 or j=len(array) is returned
to indicate that ``value`` is out of range below and above respectively.'''
n = len(array)
if (value < array[0]):
return -1
elif (value > array[n-1]):
return n
jl = 0# Initialize lower
ju = n-1# and upper limits.
while (ju-jl > 1):# If we are not yet done,
jm=(ju+jl) >> 1# compute a midpoint with a bitshift
if (value >= array[jm]):
jl=jm# and replace either the lower limit
ju=jm# or the upper limit, as appropriate.
# Repeat until the test condition is satisfied.
if (value == array[0]):# edge cases at bottom
return 0
elif (value == array[n-1]):# and top
return n-1
return jl
len_x = len(x)
max_x = np.max(x)
xoutmid = np.full(len_x, np.nan)
xoutmean = np.full(len_x, np.nan)
yout = np.full(len_x, np.nan)
for i in prange(len_x):
x0 = x[i]
xf = window*x0
if xf < max_x:
#e = np.where(x == x[np.abs(x - xf).argmin()])[0][0]
e = bisection(x,xf)
if e<len_x:
yout[i] = np.nanmean(y[i:e])
xoutmid[i] = x0 + np.log10(0.5) * (x0 - x[e])
xoutmean[i] = np.nanmean(x[i:e])
return xoutmid, xoutmean, yout


Optimize computation of the "difference function"

My code call numerous "difference functions" to compute the "Yin algorithm" (fundamental frequency extractor).
The difference function (eq. 6 in the paper) is defined as:
And this is my implementation of the difference function:
def differenceFunction(x, W, tau_max):
df = [0] * tau_max
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = long(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
For instance with:
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)
Is there a way to optimize this double-loop computation (with python only, preferably without other libraries than numpy)?
EDIT: Changed code to avoid Index Error (j loop, #Elliot answer)
EDIT2: Changed code to use x[0] (j loop, #hynekcer comment)
EDIT: Improved speed to 220 µs - see edit at the end - direct version
The required calculation can be easily evaluated by Autocorrelation function or similarly by convolution. Wiener–Khinchin theorem allows computing the autocorrelation with two Fast Fourier transforms (FFT), with time complexity O(n log n).
I use accellerated convolution function fftconvolve from Scipy package. An advantage is that it is easy to explain here why it works. Everything is vectorized, no loop at Python interpreter level.
from scipy.signal import fftconvolve
def difference_by_convol(x, W, tau_max):
x = np.array(x, np.float64)
w = x.size
x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
conv = fftconvolve(x, x[::-1])
df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]
return df[:tau_max + 1]
Compared with differenceFunction_1loop function in Elliot's answer: It is faster with FFT: 430 µs compared to the original 1170 µs. It starts be faster for about tau_max >= 40. The numerical accuracy is great. The highest relative error is less then 1E-14 compared to exact integer result. (Therefore it could be easily rounded to the exact long integer solution.)
The parameter tau_max is not important for the algorithn. It only restricts the output finally. A zero element at index 0 is added to the output because indexes should start by 0 in Python.
The parameter W is not important in Python. The size is better to be introspected.
Data are converted to np.float64 initially to prevent repeated conversions. It is by half percent faster. Any type smaller than np.int64 would be unacceptable because of overflow.
The required difference function is double energy minus autocorrelation function. That can be evaluated by convolution: correlate(x, x) = convolve(x, reversed(x).
"As of Scipy v0.19 normal convolve automatically chooses this method or the direct method based on an estimation of which is faster." That heuristics is not adequate to this case because the convolution evaluates much more tau than tau_max and it must be outweighed by much faster FFT than a direct method.
It can be calculated also by Numpy ftp module without Scipy by rewriting the answer Calculate autocorrelation using FFT in matlab to Python (below at the end). I think that the solution above can be easier understand.
Proof: (for Pythonistas :-)
The original naive implementation can be written as:
df = [sum((x[j] - x[j + t]) ** 2 for j in range(w - t)) for t in range(tau_max + 1)]
where tau_max < w.
Derive by rule (a - b)**2 == a**2 + b**2 - 2 * a * b
df = [ sum(x[j] ** 2 for j in range(w - t))
+ sum(x[j] ** 2 for j in range(t, w))
- 2 * sum(x[j] * x[j + t] for j in range(w - t))
for t in range(tau_max + 1)]
Substitute the first two elements with help of x_cumsum = [sum(x[j] ** 2 for j in range(i)) for i in range(w + 1)] that can be easily calculated in linear time. Substitute sum(x[j] * x[j + t] for j in range(w - t)) by convolution conv = convolvefft(x, reversed(x), mode='full') that has output of size len(x) + len(x) - 1.
df = [x_cumsum[w - t] + x_cumsum[w] - x_cumsum[t]
- 2 * convolve(x, x[::-1])[w - 1 + t]
for t in range(tau_max + 1)]
Optimize by vector expressions:
df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]
Every step can be also tested and compared by test data numerically
EDIT: Implemented solution directly by Numpy FFT.
def difference_fft(x, W, tau_max):
x = np.array(x, np.float64)
w = x.size
tau_max = min(tau_max, w)
x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
size = w + tau_max
p2 = (size // 32).bit_length()
nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
size_pad = min(x * 2 ** p2 for x in nice_numbers if x * 2 ** p2 >= size)
fc = np.fft.rfft(x, size_pad)
conv = np.fft.irfft(fc * fc.conjugate())[:tau_max]
return x_cumsum[w:w - tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv
It is more than twice faster than my previous solution because the length of convolution is restricted to a nearest "nice" number with small prime factors after W + tau_max, not evaluated full 2 * W. It is also not necessary to transform the same data twice as it was with `fftconvolve(x, reversed(x)).
In [211]: %timeit differenceFunction_1loop(x, W, tau_max)
1.1 ms ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [212]: %timeit difference_by_convol(x, W, tau_max)
431 µs ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [213]: %timeit difference_fft(x, W, tau_max)
218 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The newest solution is faster than Eliot's difference_by_convol for tau_max >= 20. That ratio doesn't depend much on data size because of similar ratio of overhead costs.
First of all, you should consider the boundaries of the array. Your code as originally written would get an IndexError.
You can get about a significant speedup by vectorizing the inner loop
import numpy as np
# original version
def differenceFunction_2loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
for j in range(0, W - tau): # -tau eliminates the IndexError
tmp = np.long(x[j] -x[j + tau])
df[tau] += np.square(tmp)
return df
# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
df[tau] = np.dot(tmp, tmp)
return df
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
twoloop = differenceFunction_2loop(x, W, tau_max)
oneloop = differenceFunction_1loop(x, W, tau_max)
# confirm that the result comes out the same.
print(np.all(twoloop == oneloop))
# True
Now for some benchmarking. In ipython I get the following
In [103]: %timeit twoloop = differenceFunction_2loop(x, W, tau_max)
1 loop, best of 3: 2.35 s per loop
In [104]: %timeit oneloop = differenceFunction_1loop(x, W, tau_max)
100 loops, best of 3: 8.23 ms per loop
So, about a 300 fold speedup.
In opposite to optimize algorithm you can optimize interpreter with numba.jit:
import timeit
import numpy as np
from numba import jit
def differenceFunction(x, W, tau_max):
df = [0] * tau_max
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = int(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
def differenceFunction2(x, W, tau_max):
df = np.ndarray(shape=(tau_max,))
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = int(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)
timeit.timeit('differenceFunction(x, W, tau_max)', 'from __main__ import differenceFunction, x, W, tau_max',
number=20) / 20)
timeit.timeit('differenceFunction2(x, W, tau_max)', 'from __main__ import differenceFunction2, x, W, tau_max',
number=20) / 20)
old 0.18265145074453273
new 0.016223197058214667
You can combine optimization of algorithm and numba.jit for a better result.
Here's another approach using list comprehension. It takes approx less than a tenth of the time taken by the original function, but does not beat Elliot's answer. Just putting it out there anyway.
import numpy as np
import time
# original version
def differenceFunction_2loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
for j in range(0, W - tau): # -tau eliminates the IndexError
tmp = np.long(x[j] -x[j + tau])
df[tau] += np.square(tmp)
return df
# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
df[tau] = np.dot(tmp, tmp)
return df
# with list comprehension
def differenceFunction_1loop_listcomp(x, W, tau_max):
df = [sum(((x[:-tau]) - (x[tau:]).astype(np.long))**2) for tau in range(1, tau_max)]
return [0] + df[:]
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
s = time.clock()
twoloop = differenceFunction_2loop(x, W, tau_max)
print(time.clock() - s)
s = time.clock()
oneloop = differenceFunction_1loop(x, W, tau_max)
print(time.clock() - s)
s = time.clock()
listcomprehension = differenceFunction_1loop_listcomp(x, W, tau_max)
print(time.clock() - s)
# confirm that the result comes out the same.
print(np.all(twoloop == listcomprehension))
# True
Performance results (approximately):
differenceFunction_2loop() = 0.47s
differenceFunction_1loop() = 0.003s
differenceFunction_1loop_listcomp() = 0.033s
I Don't know how you can find alternative to your nested loops problem but for arithmetic functions you can use numpy library. it is faster than manual operations.
import numpy as np
tmp = np.subtract(long(x[j] ,x[j + tau])
I would do something like this:
>>> x = np.random.randint(0, high=32000, size=2048, dtype='int16')
>>> tau_max = 106
>>> res = np.square((x[tau_max:] - x[:-tau_max]))
However I am convinced this is not the fastest way to do it.
I was trying to make sense of the fastest answer and I just came up with a faster and simpler solution.
def autocorrelation(x):
result = np.correlate(x, x, mode='full')
return result[result.size // 2:]
def difference(x):
return np.dot(x, x) + (x * x)[::-1].cumsum()[::-1] - 2 * autocorrelation(x)
Solution is based on the difference function as defined in the YIN paper.
140 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Vectorized Implementation for a Convoluted Numpy Script

I am working on a personal project which involves predicting weather pattern movements from radar. I have three n by m numpy arrays; one with precipitation intensity values, one with the movement (in pixels) in the X direction of that precipitation and one with the movement (in pixels) in the Y direction of that precipitation. I want to use these three arrays to determine the location of the precipitation pixels using the offsets in the other two arrays.
xMax = currentReflectivity.shape[0]
yMax = currentReflectivity.shape[1]
for x in xrange(currentReflectivity.shape[0]):
for y in xrange(currentReflectivity.shape[1]):
targetPixelX = xOffsetArray[x,y] + x
targetPixelY = yOffsetArray[x,y] + y
targetPixelX = int(targetPixelX)
targetPixelY = int(targetPixelY)
if targetPixelX < xMax and targetPixelY < yMax:
interpolatedReflectivity[targetPixelX,targetPixelY] = currentReflectivity[x,y]
I can't think of a way to vectorize this; any ideas?
Here's a vectorized approach making use of broadcasting -
x_arr = np.arange(currentReflectivity.shape[0])[:,None]
y_arr = np.arange(currentReflectivity.shape[1])
targetPixelX_arr = (xOffsetArray[x_arr, y_arr] + x_arr).astype(int)
targetPixelY_arr = (yOffsetArray[x_arr, y_arr] + y_arr).astype(int)
valid_mask = (targetPixelX_arr < xMax) & (targetPixelY_arr < yMax)
R = targetPixelX_arr[valid_mask]
C = targetPixelY_arr[valid_mask]
interpolatedReflectivity[R,C] = currentReflectivity[valid_mask]
Runtime test
Approaches -
def org_app(currentReflectivity, xOffsetArray, yOffsetArray):
m,n = currentReflectivity.shape
interpolatedReflectivity = np.zeros((m,n))
xMax = currentReflectivity.shape[0]
yMax = currentReflectivity.shape[1]
for x in xrange(currentReflectivity.shape[0]):
for y in xrange(currentReflectivity.shape[1]):
targetPixelX = xOffsetArray[x,y] + x
targetPixelY = yOffsetArray[x,y] + y
targetPixelX = int(targetPixelX)
targetPixelY = int(targetPixelY)
if targetPixelX < xMax and targetPixelY < yMax:
interpolatedReflectivity[targetPixelX,targetPixelY] = \
return interpolatedReflectivity
def broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray):
m,n = currentReflectivity.shape
interpolatedReflectivity = np.zeros((m,n))
xMax, yMax = m,n
x_arr = np.arange(currentReflectivity.shape[0])[:,None]
y_arr = np.arange(currentReflectivity.shape[1])
targetPixelX_arr = (xOffsetArray[x_arr, y_arr] + x_arr).astype(int)
targetPixelY_arr = (yOffsetArray[x_arr, y_arr] + y_arr).astype(int)
valid_mask = (targetPixelX_arr < xMax) & (targetPixelY_arr < yMax)
R = targetPixelX_arr[valid_mask]
C = targetPixelY_arr[valid_mask]
interpolatedReflectivity[R,C] = currentReflectivity[valid_mask]
return interpolatedReflectivity
Timings and verification -
In [276]: # Setup inputs
...: m,n = 100,110 # currentReflectivity.shape
...: max_r = 120 # xOffsetArray's extent
...: max_c = 130 # yOffsetArray's extent
...: currentReflectivity = np.random.rand(m, n)
...: xOffsetArray = np.random.randint(0,max_r,(m, n))
...: yOffsetArray = np.random.randint(0,max_c,(m, n))
In [277]: out1 = org_app(currentReflectivity, xOffsetArray, yOffsetArray)
...: out2 = broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray)
...: print np.allclose(out1, out2)
In [278]: %timeit org_app(currentReflectivity, xOffsetArray, yOffsetArray)
100 loops, best of 3: 6.86 ms per loop
In [279]: %timeit broadcasting_app(currentReflectivity, xOffsetArray, yOffsetArray)
1000 loops, best of 3: 212 µs per loop
In [280]: 6860.0/212 # Speedup number
Out[280]: 32.35849056603774
I am pretty sure that you can vectorize this by just taking everything out of the loop:
targetPixelX = (xOffsetArray + np.arange(xMax).reshape(xMax, 1)).astype(np.int)
targetPixelY = (yOffsetArray + np.arange(yMax)).astype(np.int)
mask = ((targetPixelX < xMax) & (targetPixelY < yMax))
interpolatedReflectivity[mask] = currentReflectivity[mask]
This will be much faster but more memory intensive. Basically, targetPixelX and targetPixelY are now arrays containing the values for each pixel that were before computed on a per-iteration basis.
Only the masked values are set in interpolatedReflectivity, similarly to what the if statement was doing in the loop.

Vectorization/optimising for loop with numpy in Python

Im writing a script to handle some data from a sensor represented in the signal_gen function. As you can see in the testing function it is quite loop sentered. Since this function is called many times it makes it a bit slow and it would be lovely with a push in the right direction for optimising it.
I have read that it is possible to exchange the for loop with a vectorizatid array, but I can't get my head around how the i_avg[i] line should be written, since we have single element y[i] multiplied with the whole array x inside a np.cos, and all this is again just one irritation of i_avg.
def testing(signal):
y = np.arange(0.0108, 0.0135, 0.001) # this one changes over time, set
#to constant for easier reading
x = np.arange(0, (len(signal)))
I_avg = np.zeros(len(y))
Q_avg = np.zeros_like(I_avg)
for i in range(0, len(y)):
I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
D = np.power(I_avg, 2) + np.power(Q_avg, 2)
max_index = np.argmax(D)
phaseOut = np.arctan2(Q_avg[max_index], I_avg[max_index])
#just a test signal
def signal_gen():
signal = np.random.random(size=251)
return signal
One vectorized approach using matrix-multiplication with numpy.dot to replace the nested loop to give us I_avg, Q_avg and also incorporating NumPy broadcasting and thus achieve a more efficient solution would be like so -
mult = 2*np.pi*y[:,None]*x
I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
Please note that for the given sample, we are competing against a loopy version that only has to iterate for 3 iterations (y being of length 3). As such, we won't be seeing a huge speedup here.
Runtime test -
In [9]: #just a test signal
...: signal = np.random.random(size=251)
...: y = np.arange(0.0108, 0.0135, 0.001)
...: x = np.arange(0, (len(signal)))
# Original approach
In [10]: %%timeit I_avg = np.zeros(len(y))
...: Q_avg = np.zeros_like(I_avg)
...: for i in range(0, len(y)):
...: I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
...: Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
10000 loops, best of 3: 68 µs per loop
# Proposed approach
In [11]: %%timeit mult = 2*np.pi*y[:,None]*x
...: I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
10000 loops, best of 3: 34.8 µs per loop
You can use np.einsum for broadcasting:
yx = 2*np.pi*np.einsum("i,j->ij", y, x)
I_avg = np.sin(yx) # signal
Q_avg = np.cos(yx) # signal

Vectorizing NumPy covariance for 3D array

I have a 3D numpy array of shape (t, n1, n2):
x = np.random.rand(10, 2, 4)
I need to calculate another 3D array y which is of shape (t, n1, n1) such that:
y[0] = np.cov(x[0,:,:])
...and so on for all slices along the first axis.
So, a loopy implementation would be:
y = np.zeros((10,2,2))
for i in np.arange(x.shape[0]):
y[i] = np.cov(x[i, :, :])
Is there any way to vectorize this so I can calculate all covariance matrices in one go? I tried doing:
x1 = x.swapaxes(1, 2)
y = np.dot(x, x1)
But it didn't work.
Hacked into numpy.cov source code and tried using the default parameters. As it turns out, np.cov(x[i,:,:]) would be simply :
N = x.shape[2]
m = x[i,:,:]
m -= np.sum(m, axis=1, keepdims=True) / N
cov = np.dot(m, m.T) /(N - 1)
So, the task was to vectorize this loop that would iterate through i and process all of the data from x in one go. For the same, we could use broadcasting at the third step. For the final step, we are performing sum-reduction there along all slices in first axis. This could be efficiently implemented in a vectorized manner with np.einsum. Thus, the final implementation came to this -
N = x.shape[2]
m1 = x - x.sum(2,keepdims=1)/N
y_out = np.einsum('ijk,ilk->ijl',m1,m1) /(N - 1)
Runtime test
In [155]: def original_app(x):
...: n = x.shape[0]
...: y = np.zeros((n,2,2))
...: for i in np.arange(x.shape[0]):
...: y[i]=np.cov(x[i,:,:])
...: return y
...: def proposed_app(x):
...: N = x.shape[2]
...: m1 = x - x.sum(2,keepdims=1)/N
...: out = np.einsum('ijk,ilk->ijl',m1,m1) / (N - 1)
...: return out
In [156]: # Setup inputs
...: n = 10000
...: x = np.random.rand(n,2,4)
In [157]: np.allclose(original_app(x),proposed_app(x))
Out[157]: True # Results verified
In [158]: %timeit original_app(x)
1 loops, best of 3: 610 ms per loop
In [159]: %timeit proposed_app(x)
100 loops, best of 3: 6.32 ms per loop
Huge speedup there!

Python vectorizing nested for loops

I'd appreciate some help in finding and understanding a pythonic way to optimize the following array manipulations in nested for loops:
def _func(a, b, radius):
"Return 0 if a>b, otherwise return 1"
if distance.euclidean(a, b) < radius:
return 1
return 0
def _make_mask(volume, roi, radius):
mask = numpy.zeros(volume.shape)
for x in range(volume.shape[0]):
for y in range(volume.shape[1]):
for z in range(volume.shape[2]):
mask[x, y, z] = _func((x, y, z), roi, radius)
return mask
Where volume.shape (182, 218, 200) and roi.shape (3,) are both ndarray types; and radius is an int
Approach #1
Here's a vectorized approach -
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2
Possible improvement : We can probably speedup the last step with numexpr module -
import numexpr as ne
mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
Approach #2
We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi on the fly without actually creating the meshes as done earlier with np.mgrid. This would be benefited by the use of broadcasting for efficiency purposes. The implementation would look like this -
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2
Simplified version : Thanks to #Bi Rico for suggesting an improvement here as we can use np.ogrid to perform those operations in a bit more concise manner, like so -
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2
Runtime test
Function definitions -
def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2
def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2
def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2
Timings -
In [106]: # Setup input arrays
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop
In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop
In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop
In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop
In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop
So, as always broadcasting showing its magic for a crazy almost 10,000x speedup over the original code and more than 10x better than creating meshes by using on-the-fly broadcasted operations!
Say you first build an xyzy array:
import itertools
xyz = [np.array(p) for p in itertools.product(range(volume.shape[0]), range(volume.shape[1]), range(volume.shape[2]))]
Now, using numpy.linalg.norm,
np.linalg.norm(xyz - roi, axis=1) < radius
checks whether the distance for each tuple from roi is smaller than radius.
Finally, just reshape the result to the dimensions you need.

