I have some code which generates the coordinates of a cylindrically-symmetric surface, with coordinates given as (r, theta, phi). At the moment, I generate the coordinates of one phi slice, and store that in a 2xN array (for N bins), and then in a for loop I copy this array for each value of phi from 0 to 2pi:
import numpy as np
# this is the number of bins that my surface is chopped into
numbins = 50
# these are the values for r
r_vals = np.linspace(0.0001, 50, numbins, endpoint = True)
# these are the values for theta
theta_vals = np.linspace(0, np.pi / 2, numbins, endpoint = True)
# I combine the r and theta values into a 2xnumbins array for one "slice" of phi
phi_slice = np.vstack([r_vals,theta_vals]).T
# this is the array which will store all of the coordinates of the surface
surface = np.zeros((numbins**2,3))
lower_bound = 0
upper_bound = numbins
# this is the size of the bin for phi
dphi = (2 * np.pi) / numbins
# this is the for loop I'd like to eliminate.
# For every value of phi, it puts a copy of the phi_slice array into
# the surface array, so that the surface is cylindrical about phi.
for phi in np.linspace(0, (2 * np.pi) - dphi, numbins):
surface[lower_bound:upper_bound, :2] = phi_slice
surface[lower_bound:upper_bound,2] = phi
lower_bound += numbins
upper_bound += numbins
I'm calling this routine in a numerical integration of 1e6 or 1e7 steps, and while numbins is 50 in the example above, in practice it'll be in the thousands. This for loop is a choking point, and I'd really like to eliminate it to speed things up. Is there a good NumPythonic way to do the same thing as this for loop?
Timing that loop:
In [9]: %%timeit
...: lower_bound, upper_bound = 0, numbins
...: for phi in np.linspace(0, (2 * np.pi) - dphi, numbins):
...: surface[lower_bound:upper_bound,:2] = phi_slice
...: surface[lower_bound:upper_bound,2] = phi
...: lower_bound += numbins
...: upper_bound += numbins
10000 loops, best of 3: 176 µs per loop
Off hand that doesn't look bad, though if repeated in some larger context that does matter. You are looping 50 times to fill an array of 75000 items. For the size of the task the number of loops isn't large.
Daniel's alternative is a little faster, but not drastic
In [12]: %%timeit
...: phi_slices = np.tile(phi_slice.T, numbins).T
...: phi_indices = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)
...: surface1 = np.c_[phi_slices, phi_indices]
...:
10000 loops, best of 3: 137 µs per loop
kazemakase'sis a bit better still:
In [15]: %%timeit
...: phis = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)[:, np.newaxis]
...: slices = np.repeat(phi_slice[np.newaxis, :, :], numbins, axis=0).reshape(-1, 2)
...: surface2 = np.hstack([slices, phis])
...:
10000 loops, best of 3: 115 µs per loop
And my nomination (thanks to the others for helping me see the pattern); I'm taking advantage of broadcasting in assignment.
surface3 = np.zeros((numbins,numbins,3))
phis = np.linspace(0, (2 * np.pi) - dphi, numbins)
surface3[:,:,2] = phis[:,None]
surface3[:,:,:2] = phi_slice[None,:,:]
surface3.shape = (numbins**2,3)
A bit better:
In [50]: %%timeit
...: surface3 = np.zeros((numbins,numbins,3))
...: phis=np.linspace(0, (2 * np.pi) - dphi, numbins)
...: surface3[:,:,2]=phis[:,None]
...: surface3[:,:,:2]=phi_slice[None,:,:]
...: surface3.shape=(numbins**2,3)
...:
10000 loops, best of 3: 73.3 µs per loop
edit
Replacing:
surface3[:,:,:2]=phi_slice[None,:,:]
with
surface3[:,:,0]=r_vals[None,:]
surface3[:,:,1]=theta_vals[None,:]
squeezes out a bit more of a time improvement, especially if phi_slice was constructed solely for this use.
Building this sort of blockwise array is done with np.repeat and np.tile
phi_slices = np.tile(phi_slice.T, numbins).T
phi_indices = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)
surface = np.c_[phi_slices, phi_indices]
This is a possible way to vectorize the loop:
phis = np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins)[:, np.newaxis]
slices = np.repeat(phi_slice[np.newaxis, :, :], numbins, axis=0).reshape(-1, 2)
surface2 = np.hstack([slices, phis])
print(np.allclose(surface, surface2))
# True
That's what is going on in detail:
np.repeat(np.linspace(0, (2 * np.pi) - dphi, numbins), numbins) takes the array with all values of phi and repeats each element numbins times. [:, np.newaxis] brings the result into 2D shape (numbins**2, 1).
phi_slice[np.newaxis, :, :] brings phi_slice into 3D shape (1, numbins, 2.. This array is repeated along the first axis, resulting in shape (numbins, numbins, 2). Finally reshape(-1, 2) brings the first two dimensions back together to final shape (numbins**2, 2).
np.hstack([slices, phis]) combines both parts of the final array.
Related
This seems more of a direct question. I will generalize it a bit at the end.
I am trying to this function in numpy. I have been successful using nested for loops but I can't think of a numpy way to do it.
My way of implementation:
bs = 10 # batch_size
nb = 8 # number of bounding boxes
nc = 15 # number of classes
bbox = np.random.random(size=(bs, nb, 4)) # model output bounding boxes
p = np.random.random(size=(bs, nb, nc)) # model output probability
p = softmax(p, axis=-1)
s_rand = np.random.random(size=(nc, nc))
s = (s_rand + s_rand.T)/2 # similarity matrix
pp = np.random.random(size=(bs, nb, nc)) # proposed probability
pp = softmax(pp, axis=-1)
first_term = 0
for b in range(nb):
for b_1 in range(nb):
if b_1 == b:
continue
for l in range(nc):
for l_1 in range(nc):
first_term += (s[l, l_1] * (pp[:, b, l] - pp[:, b_1, l_1])**2)
second_term = 0
for b in range(nb):
for l in range(nc):
second_term += (np.linalg.norm(s[l, :], ord=1) * (pp[:, b, l] - p[:, b, l])**2)
second_term *= nb
epsilon = 0.5
output = ((1 - epsilon) * first_term) + (epsilon * second_term)
I have tried hard to remove the loops and use np.tile and np.repeat instead, in order to achieve the task. But can't think of a possible way.
I have tried searching google for finding exercises like such which can help me learn such conversions in numpy but wasn't successful.
P_hat.shape is (B,L), S.shape is (L,L), P.shape is (B,L).
array_before_sum = S[None,:,None,:]*(P_hat[:,:,None,None]- P_hat[None,None,:,:])**2
array_after_sum = array_before_sum.sum(axis=(1,3))
array_sum_again = (array_after_sum*(1-np.ones((B,B)))).sum()
first_term = (1-epsilon)*array_sum_again
second_term = epsilon*(B*np.abs(S).sum(axis=1)[None,:]*(P_hat - P)**2).sum()
I think you can do both with einsum
first_term = np.einsum('km, ijklm -> i', s, (pp[..., None, None] - pp[:, None, None, ...])**2 )
second_term = np.einsum('k, ijk -> i', np.linalg.norm(s, axis = 1), (pp - p)**2 )
Now there's a problem: that ijklm tensor in first_term is going to get huge if nb and nc get large. You should probably distribute it so that you get 3 smaller tensors:
first_term = np.einsum('km, ijk, ijk -> i', s, pp, pp) +\
np.einsum('km, ilm, ilm -> i', s, pp, pp) -\
2 * np.einsum('km, ijk, ilm -> i', s, pp, pp)
This takes advantage of the fact that (a-b)**2 = a**2 + b**2 - 2ab to allow you to break the problem into three parts that can each be done in one step with the dot product
Maximally optimized code: (removal of first two loops is inspired from L.Iridium's answer)
squared_diff = (pp[:, :, None, :, None] - pp[:, None, :, None, :]) ** 2
weighted_diff = s * squared_diff
b_eq_b_1_removed = b.sum(axis=(3,4)) * (1 - np.eye(nb))
first_term = b_eq_b_1_removed.sum(axis=(1,2))
normalized_s = np.linalg.norm(s, ord=1, axis=1)
squared_diff = (pp - p)**2
second_term = nb * (normalized_s * squared_diff).sum(axis=(1,2))
loss = ((1 - epsilon) * first_term) + (epsilon * second_term)
Timeit track:
512 µs ± 13 µs per loop
Timeit track of code posted in question:
62.5 ms ± 197 µs per loop
That's a huge improvement.
My code call numerous "difference functions" to compute the "Yin algorithm" (fundamental frequency extractor).
The difference function (eq. 6 in the paper) is defined as:
And this is my implementation of the difference function:
def differenceFunction(x, W, tau_max):
df = [0] * tau_max
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = long(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
For instance with:
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)
Is there a way to optimize this double-loop computation (with python only, preferably without other libraries than numpy)?
EDIT: Changed code to avoid Index Error (j loop, #Elliot answer)
EDIT2: Changed code to use x[0] (j loop, #hynekcer comment)
EDIT: Improved speed to 220 µs - see edit at the end - direct version
The required calculation can be easily evaluated by Autocorrelation function or similarly by convolution. Wiener–Khinchin theorem allows computing the autocorrelation with two Fast Fourier transforms (FFT), with time complexity O(n log n).
I use accellerated convolution function fftconvolve from Scipy package. An advantage is that it is easy to explain here why it works. Everything is vectorized, no loop at Python interpreter level.
from scipy.signal import fftconvolve
def difference_by_convol(x, W, tau_max):
x = np.array(x, np.float64)
w = x.size
x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
conv = fftconvolve(x, x[::-1])
df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]
return df[:tau_max + 1]
Compared with differenceFunction_1loop function in Elliot's answer: It is faster with FFT: 430 µs compared to the original 1170 µs. It starts be faster for about tau_max >= 40. The numerical accuracy is great. The highest relative error is less then 1E-14 compared to exact integer result. (Therefore it could be easily rounded to the exact long integer solution.)
The parameter tau_max is not important for the algorithn. It only restricts the output finally. A zero element at index 0 is added to the output because indexes should start by 0 in Python.
The parameter W is not important in Python. The size is better to be introspected.
Data are converted to np.float64 initially to prevent repeated conversions. It is by half percent faster. Any type smaller than np.int64 would be unacceptable because of overflow.
The required difference function is double energy minus autocorrelation function. That can be evaluated by convolution: correlate(x, x) = convolve(x, reversed(x).
"As of Scipy v0.19 normal convolve automatically chooses this method or the direct method based on an estimation of which is faster." That heuristics is not adequate to this case because the convolution evaluates much more tau than tau_max and it must be outweighed by much faster FFT than a direct method.
It can be calculated also by Numpy ftp module without Scipy by rewriting the answer Calculate autocorrelation using FFT in matlab to Python (below at the end). I think that the solution above can be easier understand.
Proof: (for Pythonistas :-)
The original naive implementation can be written as:
df = [sum((x[j] - x[j + t]) ** 2 for j in range(w - t)) for t in range(tau_max + 1)]
where tau_max < w.
Derive by rule (a - b)**2 == a**2 + b**2 - 2 * a * b
df = [ sum(x[j] ** 2 for j in range(w - t))
+ sum(x[j] ** 2 for j in range(t, w))
- 2 * sum(x[j] * x[j + t] for j in range(w - t))
for t in range(tau_max + 1)]
Substitute the first two elements with help of x_cumsum = [sum(x[j] ** 2 for j in range(i)) for i in range(w + 1)] that can be easily calculated in linear time. Substitute sum(x[j] * x[j + t] for j in range(w - t)) by convolution conv = convolvefft(x, reversed(x), mode='full') that has output of size len(x) + len(x) - 1.
df = [x_cumsum[w - t] + x_cumsum[w] - x_cumsum[t]
- 2 * convolve(x, x[::-1])[w - 1 + t]
for t in range(tau_max + 1)]
Optimize by vector expressions:
df = x_cumsum[w:0:-1] + x_cumsum[w] - x_cumsum[:w] - 2 * conv[w - 1:]
Every step can be also tested and compared by test data numerically
EDIT: Implemented solution directly by Numpy FFT.
def difference_fft(x, W, tau_max):
x = np.array(x, np.float64)
w = x.size
tau_max = min(tau_max, w)
x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
size = w + tau_max
p2 = (size // 32).bit_length()
nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
size_pad = min(x * 2 ** p2 for x in nice_numbers if x * 2 ** p2 >= size)
fc = np.fft.rfft(x, size_pad)
conv = np.fft.irfft(fc * fc.conjugate())[:tau_max]
return x_cumsum[w:w - tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv
It is more than twice faster than my previous solution because the length of convolution is restricted to a nearest "nice" number with small prime factors after W + tau_max, not evaluated full 2 * W. It is also not necessary to transform the same data twice as it was with `fftconvolve(x, reversed(x)).
In [211]: %timeit differenceFunction_1loop(x, W, tau_max)
1.1 ms ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [212]: %timeit difference_by_convol(x, W, tau_max)
431 µs ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [213]: %timeit difference_fft(x, W, tau_max)
218 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The newest solution is faster than Eliot's difference_by_convol for tau_max >= 20. That ratio doesn't depend much on data size because of similar ratio of overhead costs.
First of all, you should consider the boundaries of the array. Your code as originally written would get an IndexError.
You can get about a significant speedup by vectorizing the inner loop
import numpy as np
# original version
def differenceFunction_2loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
for j in range(0, W - tau): # -tau eliminates the IndexError
tmp = np.long(x[j] -x[j + tau])
df[tau] += np.square(tmp)
return df
# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
df[tau] = np.dot(tmp, tmp)
return df
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
twoloop = differenceFunction_2loop(x, W, tau_max)
oneloop = differenceFunction_1loop(x, W, tau_max)
# confirm that the result comes out the same.
print(np.all(twoloop == oneloop))
# True
Now for some benchmarking. In ipython I get the following
In [103]: %timeit twoloop = differenceFunction_2loop(x, W, tau_max)
1 loop, best of 3: 2.35 s per loop
In [104]: %timeit oneloop = differenceFunction_1loop(x, W, tau_max)
100 loops, best of 3: 8.23 ms per loop
So, about a 300 fold speedup.
In opposite to optimize algorithm you can optimize interpreter with numba.jit:
import timeit
import numpy as np
from numba import jit
def differenceFunction(x, W, tau_max):
df = [0] * tau_max
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = int(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
#jit
def differenceFunction2(x, W, tau_max):
df = np.ndarray(shape=(tau_max,))
for tau in range(1, tau_max):
for j in range(0, W - tau):
tmp = int(x[j] - x[j + tau])
df[tau] += tmp * tmp
return df
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
differenceFunction(x, W, tau_max)
print('old',
timeit.timeit('differenceFunction(x, W, tau_max)', 'from __main__ import differenceFunction, x, W, tau_max',
number=20) / 20)
print('new',
timeit.timeit('differenceFunction2(x, W, tau_max)', 'from __main__ import differenceFunction2, x, W, tau_max',
number=20) / 20)
Result:
old 0.18265145074453273
new 0.016223197058214667
You can combine optimization of algorithm and numba.jit for a better result.
Here's another approach using list comprehension. It takes approx less than a tenth of the time taken by the original function, but does not beat Elliot's answer. Just putting it out there anyway.
import numpy as np
import time
# original version
def differenceFunction_2loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
for j in range(0, W - tau): # -tau eliminates the IndexError
tmp = np.long(x[j] -x[j + tau])
df[tau] += np.square(tmp)
return df
# vectorized inner loop
def differenceFunction_1loop(x, W, tau_max):
df = np.zeros(tau_max, np.long)
for tau in range(1, tau_max):
tmp = (x[:-tau]) - (x[tau:]).astype(np.long)
df[tau] = np.dot(tmp, tmp)
return df
# with list comprehension
def differenceFunction_1loop_listcomp(x, W, tau_max):
df = [sum(((x[:-tau]) - (x[tau:]).astype(np.long))**2) for tau in range(1, tau_max)]
return [0] + df[:]
x = np.random.randint(0, high=32000, size=2048, dtype='int16')
W = 2048
tau_max = 106
s = time.clock()
twoloop = differenceFunction_2loop(x, W, tau_max)
print(time.clock() - s)
s = time.clock()
oneloop = differenceFunction_1loop(x, W, tau_max)
print(time.clock() - s)
s = time.clock()
listcomprehension = differenceFunction_1loop_listcomp(x, W, tau_max)
print(time.clock() - s)
# confirm that the result comes out the same.
print(np.all(twoloop == listcomprehension))
# True
Performance results (approximately):
differenceFunction_2loop() = 0.47s
differenceFunction_1loop() = 0.003s
differenceFunction_1loop_listcomp() = 0.033s
I Don't know how you can find alternative to your nested loops problem but for arithmetic functions you can use numpy library. it is faster than manual operations.
import numpy as np
tmp = np.subtract(long(x[j] ,x[j + tau])
I would do something like this:
>>> x = np.random.randint(0, high=32000, size=2048, dtype='int16')
>>> tau_max = 106
>>> res = np.square((x[tau_max:] - x[:-tau_max]))
However I am convinced this is not the fastest way to do it.
I was trying to make sense of the fastest answer and I just came up with a faster and simpler solution.
def autocorrelation(x):
result = np.correlate(x, x, mode='full')
return result[result.size // 2:]
def difference(x):
return np.dot(x, x) + (x * x)[::-1].cumsum()[::-1] - 2 * autocorrelation(x)
Solution is based on the difference function as defined in the YIN paper.
%%timeit
difference(frame)
140 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Im writing a script to handle some data from a sensor represented in the signal_gen function. As you can see in the testing function it is quite loop sentered. Since this function is called many times it makes it a bit slow and it would be lovely with a push in the right direction for optimising it.
I have read that it is possible to exchange the for loop with a vectorizatid array, but I can't get my head around how the i_avg[i] line should be written, since we have single element y[i] multiplied with the whole array x inside a np.cos, and all this is again just one irritation of i_avg.
def testing(signal):
y = np.arange(0.0108, 0.0135, 0.001) # this one changes over time, set
#to constant for easier reading
x = np.arange(0, (len(signal)))
I_avg = np.zeros(len(y))
Q_avg = np.zeros_like(I_avg)
for i in range(0, len(y)):
I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
D = np.power(I_avg, 2) + np.power(Q_avg, 2)
max_index = np.argmax(D)
phaseOut = np.arctan2(Q_avg[max_index], I_avg[max_index])
#just a test signal
def signal_gen():
signal = np.random.random(size=251)
return signal
One vectorized approach using matrix-multiplication with numpy.dot to replace the nested loop to give us I_avg, Q_avg and also incorporating NumPy broadcasting and thus achieve a more efficient solution would be like so -
mult = 2*np.pi*y[:,None]*x
I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
Please note that for the given sample, we are competing against a loopy version that only has to iterate for 3 iterations (y being of length 3). As such, we won't be seeing a huge speedup here.
Runtime test -
In [9]: #just a test signal
...: signal = np.random.random(size=251)
...: y = np.arange(0.0108, 0.0135, 0.001)
...: x = np.arange(0, (len(signal)))
...:
# Original approach
In [10]: %%timeit I_avg = np.zeros(len(y))
...: Q_avg = np.zeros_like(I_avg)
...: for i in range(0, len(y)):
...: I_avg[i] = np.array(signal * (np.cos(2 * np.pi * y[i] * x))).sum()
...: Q_avg[i] = np.array(signal * (np.sin(2 * np.pi * y[i] * x))).sum()
...:
10000 loops, best of 3: 68 µs per loop
# Proposed approach
In [11]: %%timeit mult = 2*np.pi*y[:,None]*x
...: I_avg, Q_avg = np.cos(mult).dot(signal), np.sin(mult).dot(signal)
...:
10000 loops, best of 3: 34.8 µs per loop
You can use np.einsum for broadcasting:
yx = 2*np.pi*np.einsum("i,j->ij", y, x)
I_avg = np.sin(yx) # signal
Q_avg = np.cos(yx) # signal
I have a 3D numpy array of shape (t, n1, n2):
x = np.random.rand(10, 2, 4)
I need to calculate another 3D array y which is of shape (t, n1, n1) such that:
y[0] = np.cov(x[0,:,:])
...and so on for all slices along the first axis.
So, a loopy implementation would be:
y = np.zeros((10,2,2))
for i in np.arange(x.shape[0]):
y[i] = np.cov(x[i, :, :])
Is there any way to vectorize this so I can calculate all covariance matrices in one go? I tried doing:
x1 = x.swapaxes(1, 2)
y = np.dot(x, x1)
But it didn't work.
Hacked into numpy.cov source code and tried using the default parameters. As it turns out, np.cov(x[i,:,:]) would be simply :
N = x.shape[2]
m = x[i,:,:]
m -= np.sum(m, axis=1, keepdims=True) / N
cov = np.dot(m, m.T) /(N - 1)
So, the task was to vectorize this loop that would iterate through i and process all of the data from x in one go. For the same, we could use broadcasting at the third step. For the final step, we are performing sum-reduction there along all slices in first axis. This could be efficiently implemented in a vectorized manner with np.einsum. Thus, the final implementation came to this -
N = x.shape[2]
m1 = x - x.sum(2,keepdims=1)/N
y_out = np.einsum('ijk,ilk->ijl',m1,m1) /(N - 1)
Runtime test
In [155]: def original_app(x):
...: n = x.shape[0]
...: y = np.zeros((n,2,2))
...: for i in np.arange(x.shape[0]):
...: y[i]=np.cov(x[i,:,:])
...: return y
...:
...: def proposed_app(x):
...: N = x.shape[2]
...: m1 = x - x.sum(2,keepdims=1)/N
...: out = np.einsum('ijk,ilk->ijl',m1,m1) / (N - 1)
...: return out
...:
In [156]: # Setup inputs
...: n = 10000
...: x = np.random.rand(n,2,4)
...:
In [157]: np.allclose(original_app(x),proposed_app(x))
Out[157]: True # Results verified
In [158]: %timeit original_app(x)
1 loops, best of 3: 610 ms per loop
In [159]: %timeit proposed_app(x)
100 loops, best of 3: 6.32 ms per loop
Huge speedup there!
I'd appreciate some help in finding and understanding a pythonic way to optimize the following array manipulations in nested for loops:
def _func(a, b, radius):
"Return 0 if a>b, otherwise return 1"
if distance.euclidean(a, b) < radius:
return 1
else:
return 0
def _make_mask(volume, roi, radius):
mask = numpy.zeros(volume.shape)
for x in range(volume.shape[0]):
for y in range(volume.shape[1]):
for z in range(volume.shape[2]):
mask[x, y, z] = _func((x, y, z), roi, radius)
return mask
Where volume.shape (182, 218, 200) and roi.shape (3,) are both ndarray types; and radius is an int
Approach #1
Here's a vectorized approach -
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
mask = X**2 + Y**2 + Z**2 < radius**2
Possible improvement : We can probably speedup the last step with numexpr module -
import numexpr as ne
mask = ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
Approach #2
We can also gradually build the three ranges corresponding to the shape parameters and perform the subtraction against the three elements of roi on the fly without actually creating the meshes as done earlier with np.mgrid. This would be benefited by the use of broadcasting for efficiency purposes. The implementation would look like this -
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
mask = vals < radius**2
Simplified version : Thanks to #Bi Rico for suggesting an improvement here as we can use np.ogrid to perform those operations in a bit more concise manner, like so -
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
mask = (x**2+y**2+z**2) < radius**2
Runtime test
Function definitions -
def vectorized_app1(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return X**2 + Y**2 + Z**2 < radius**2
def vectorized_app1_improved(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.mgrid[0:m,0:n,0:r]
X = x - roi[0]
Y = y - roi[1]
Z = z - roi[2]
return ne.evaluate('X**2 + Y**2 + Z**2 < radius**2')
def vectorized_app2(volume, roi, radius):
m,n,r = volume.shape
vals = ((np.arange(m)-roi[0])**2)[:,None,None] + \
((np.arange(n)-roi[1])**2)[:,None] + ((np.arange(r)-roi[2])**2)
return vals < radius**2
def vectorized_app2_simplified(volume, roi, radius):
m,n,r = volume.shape
x,y,z = np.ogrid[0:m,0:n,0:r]-roi
return (x**2+y**2+z**2) < radius**2
Timings -
In [106]: # Setup input arrays
...: volume = np.random.rand(90,110,100) # Half of original input sizes
...: roi = np.random.rand(3)
...: radius = 3.4
...:
In [107]: %timeit _make_mask(volume, roi, radius)
1 loops, best of 3: 41.4 s per loop
In [108]: %timeit vectorized_app1(volume, roi, radius)
10 loops, best of 3: 62.3 ms per loop
In [109]: %timeit vectorized_app1_improved(volume, roi, radius)
10 loops, best of 3: 47 ms per loop
In [110]: %timeit vectorized_app2(volume, roi, radius)
100 loops, best of 3: 4.26 ms per loop
In [139]: %timeit vectorized_app2_simplified(volume, roi, radius)
100 loops, best of 3: 4.36 ms per loop
So, as always broadcasting showing its magic for a crazy almost 10,000x speedup over the original code and more than 10x better than creating meshes by using on-the-fly broadcasted operations!
Say you first build an xyzy array:
import itertools
xyz = [np.array(p) for p in itertools.product(range(volume.shape[0]), range(volume.shape[1]), range(volume.shape[2]))]
Now, using numpy.linalg.norm,
np.linalg.norm(xyz - roi, axis=1) < radius
checks whether the distance for each tuple from roi is smaller than radius.
Finally, just reshape the result to the dimensions you need.