Suppose that you have an array and want to create another array, which's values are equal to standard deviation of first array's 10 elements successively. With the help of for loop, it can be written easily like below code. What I want to do is avoid using for loop for faster execution time. Any suggestions?
Code
a = np.arange(20)
b = np.empty(11)
for i in range(11):
b[i] = np.std(a[i:i+10])
You could create a 2D array of sliding windows with np.lib.stride_tricks.as_strided that would be views into the given 1D array and as such won't be occupying any more memory. Then, simply use np.std along the second axis (axis=1) for the final result in a vectorized way, like so -
W = 10 # Window size
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
out = np.std(a2D, axis=1)
Runtime test
Function definitions -
def original_app(a, W):
b = np.empty(a.size-W+1)
for i in range(b.size):
b[i] = np.std(a[i:i+W])
return b
def vectorized_app(a, W):
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
return np.std(a2D,1)
Timings and verification -
In [460]: # Inputs
...: a = np.arange(10000)
...: W = 10
...:
In [461]: np.allclose(original_app(a, W), vectorized_app(a, W))
Out[461]: True
In [462]: %timeit original_app(a, W)
1 loops, best of 3: 522 ms per loop
In [463]: %timeit vectorized_app(a, W)
1000 loops, best of 3: 1.33 ms per loop
So, around 400x speedup there!
For completeness, here's the equivalent pandas version -
import pandas as pd
def pdroll(a, W): # a is 1D ndarray and W is window-size
return pd.Series(a).rolling(W).std(ddof=0).values[W-1:]
Not so fancy, but the code with no loops would be something like this:
a = np.arange(20)
b = [a[i:i+10].std() for i in range(len(a)-10)]
I need to speed up the processing of this loop as it is very slow. But I don't know how to vectorize it since the result of one value depends on the result of a previous value. Any suggestions?
import numpy as np
sig = np.random.randn(44100)
alpha = .9887
beta = .999
out = np.zeros_like(sig)
for n in range(1, len(sig)):
if np.abs(sig[n]) >= out[n-1]:
out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
else:
out[n] = beta * out[n-1]
Numba's just-in-time compiler should deal with indexing overhead you're facing pretty well by compiling the function to native code during first execution:
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:
:sig = np.random.randn(44100)
:alpha = .9887
:beta = .999
:
:def nonvectorized(sig):
: out = np.zeros_like(sig)
:
: for n in range(1, len(sig)):
: if np.abs(sig[n]) >= out[n-1]:
: out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
: else:
: out[n] = beta * out[n-1]
: return out
:--
In [2]: nonvectorized(sig)
Out[2]:
array([ 0. , 0.01862503, 0.04124917, ..., 1.2979579 ,
1.304247 , 1.30294275])
In [3]: %timeit nonvectorized(sig)
10 loops, best of 3: 80.2 ms per loop
In [4]: from numba import jit
In [5]: vectorized = jit(nonvectorized)
In [6]: np.allclose(vectorized(sig), nonvectorized(sig))
Out[6]: True
In [7]: %timeit vectorized(sig)
1000 loops, best of 3: 249 µs per loop
EDIT: as suggested in a comment, adding jit benchmarks. jit(nonvectorized) is creating a lightweight wrapper and thus is a cheap operation.
In [8]: %timeit jit(nonvectorized)
10000 loops, best of 3: 45.3 µs per loop
The function itself is compiled during the first execution (hence just-in-time) which takes a while, but probably not as much:
In [9]: %timeit jit(nonvectorized)(sig)
10 loops, best of 3: 169 ms per loop
Low vectorisation potential on a "forward-dependent-loop" code
majority of your "vectorisation" parallelism is out of the game, once the dependency is analysed. ( JIT-compiler cannot vectorise "against" such dependence barrier either )
you may pre-calculate some re-used values in a vectorised manner, but there is no direct python syntax manner ( without an external JIT-compiler workaround ) to arrange forward-shifting-dependence loop computation into your CPU vector-register aligned co-parallel computation:
from zmq import Stopwatch # ok to use pyzmq 2.11 for [usec] .Stopwatch()
aStopWATCH = Stopwatch() # a performance measurement .Stopwatch() instance
sig = np.abs(sig) # self-destructive calc/assign avoids memalloc-OPs
aConst = ( 1 - alpha ) # avoids many repetitive SUB(s) in the loop
for thisPtr in range( 1, len( sig ) ): # FORWARD-SHIFTING-DEPENDENCE LOOP:
prevPtr = thisPtr - 1 # prevPtr->"previous" TimeSlice in out[] ( re-used 2 x len(sig) times )
if sig[thisPtr] < out[prevPtr]: # 1st re-use
out[thisPtr] = out[prevPtr] * beta # 2nd
else:
out[thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] ) # 2nd
A good example of vectorised speed-up can be seen in cases, where calculation strategy can be parallelised/broadcast along 1D, 2D or even 3D structure of the native numpy array. For a speedup of about 100x see an RGBA-2D matrix accelerated processing in Vectorised code for a PNG picture processing ( an OpenGL shader pipeline)
Performance increased still about 3x
Even this simple python code revision has increased the speed more than about 2.8x times ( right now, i.e. without undertaking an installation to allow using an ad-hoc JIT-optimising compiler ):
>>> def aForwardShiftingDependenceLOOP(): # proposed code-revision
... aStopWATCH.start() # ||||||||||||||||||.start
... for thisPtr in range( 1, len( sig ) ):
... # |vvvvvvv|------------# FORWARD-SHIFTING-LOOP DEPENDENCE
... prevPtr = thisPtr - 1 #|vvvvvvv|--STEP-SHIFTING avoids Numpy syntax
... if ( sig[ thisPtr] < out[prevPtr] ):
... out[ thisPtr] = out[prevPtr] * beta
... else:
... out[ thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] )
... usec = aStopWATCH.stop() # ||||||||||||||||||.stop
... print usec, " [usec]"
>>> aForwardShiftingDependenceLOOP()
57593 [usec]
57879 [usec]
58085 [usec]
>>> def anOriginalForLOOP():
... aStopWATCH.start()
... for n in range( 1, len( sig ) ):
... if ( np.abs( sig[n] ) >= out[n-1] ):
... out[n] = out[n-1] * alpha + ( 1 - alpha ) * np.abs( sig[n] )
... else:
... out[n] = out[n-1] * beta
... usec = aStopWATCH.stop()
... print usec, " [usec]"
>>> anOriginalForLOOP()
164907 [usec]
165674 [usec]
165154 [usec]
I am using sympy to generate some functions for numerical calculations. Therefore I lambdify an expression an vectorize it to use it with numpy arrays. Here is an example:
import numpy as np
import sympy as sp
def numpy_function():
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = (1 - np.cos(2*np.pi*x))*(1 - np.cos(2*np.pi*y))*np.sin(np.pi*z)*0.1
return T
def sympy_function():
x, y, z = sp.Symbol("x"), sp.Symbol("y"), sp.Symbol("z")
T = (1 - sp.cos(2*sp.pi*x))*(1 - sp.cos(2*sp.pi*y))*sp.sin(sp.pi*z)*0.1
lambda_function = np.vectorize(sp.lambdify((x, y, z), T, "numpy"))
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = lambda_function(x,y,z)
return T
The problem between the sympy version and a pure numpy version is the speed i.e.
In [3]: timeit test.numpy_function()
100 loops, best of 3: 11.9 ms per loop
vs.
In [4]: timeit test.sympy_function()
1 loops, best of 3: 634 ms per loop
So is there any way to get closer to the speed of the numpy version ?
I think np.vectorize is pretty slow but somehow some part of my code does not work without it. Thank you for any suggestions.
EDIT:
So I found the reason why the vectorize function is necessary, i.e:
In [35]: y = np.arange(10)
In [36]: f = sp.lambdify(x,sin(x),"numpy")
In [37]: f(y)
Out[37]:
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])
this seems to work fine however:
In [38]: y = np.arange(10)
In [39]: f = sp.lambdify(x,1,"numpy")
In [40]: f(y)
Out[40]: 1
So for simple expression like 1 this function doesn't return an array.
Is there a way to fix this and isn't this some kind of bug or at least inconsistent design?
lambdify returns a single value for constants because no numpy functions are involved. This is because of the way lambdify works (see https://stackoverflow.com/a/25514007/161801).
But this is typically not a problem because a constant will automatically broadcast to the correct shape in any operation that you use it in with an array. On the other hand, if you explicitly worked with an array of the same constant, it would be much less efficient because you would compute the same operations multiple times.
Using np.vectorize() in this case is like looping over the first dimension of x, y and z, and that's why it becomes slower. You don't need np.vectorize() IF you tell lambdify()to use NumPy's functions, which is exactly what you are doing. Then, using:
def sympy_function():
x, y, z = sp.Symbol("x"), sp.Symbol("y"), sp.Symbol("z")
T = (1 - sp.cos(2*sp.pi*x))*(1 - sp.cos(2*sp.pi*y))*sp.sin(sp.pi*z)*0.1
lambda_function = sp.lambdify((x, y, z), T, "numpy")
x, y, z = np.mgrid[0:1:40*1j, 0:1:40*1j, 0:1:40*1j]
T = lambda_function(x,y,z)
return T
makes the performance comparable:
In [26]: np.allclose(numpy_function(), sympy_function())
Out[26]: True
In [27]: timeit numpy_function()
100 loops, best of 3: 4.08 ms per loop
In [28]: timeit sympy_function()
100 loops, best of 3: 5.52 ms per loop
I want to be able to vectorize this code:
def sobHypot(rec):
a, b, c = rec.shape
hype = np.ones((a,b,c))
for i in xrange(c):
x=ndimage.sobel(abs(rec[...,i])**2,axis=0, mode='constant')
y=ndimage.sobel(abs(rec[...,i])**2,axis=1, mode='constant')
hype[...,i] = np.hypot(x,y)
hype[...,i] = hype[...,i].mean()
index = hype.argmax()
return index
where rec,shape returns (1024,1024,20)
Here's how you can avoid the for-loop with the sobel filter:
import numpy as np
from scipy.ndimage import sobel
def sobHypot_vec(rec):
r = np.abs(rec)
x = sobel(r, 0, mode='constant')
y = sobel(r, 1, mode='constant')
h = np.hypot(x, y)
h = np.apply_over_axes(np.mean, h, [0,1])
return h.argmax()
I'm not sure if the sobel filter is particularly necessary in your application, and this is hard to test without your particular 20-layer 'image', but you could try using np.gradient instead of running the sobel twice. The advantage is that gradient runs in three dimensions. You can ignore the component in the third, and take the hypot of just the first two. This seems wasteful but is actually still faster in my tests.
For a variety of randomly generated images, r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j, this gives the same answer as your code, but test it to be sure, and possibly fiddle with the dx, dy arguments of np.gradient
def grad_max(rec):
g = np.gradient(np.abs(rec))[:2] # ignore derivative in third dimension
h = np.hypot(*g)
h = np.apply_over_axes(np.mean, h, [0,1]) # mean along first and second dimension
return h.argmax()
Using this code for timing:
def sobHypot_clean(rec):
rs = rec.shape
hype = np.ones(rs)
r = np.abs(rec)
for i in xrange(rs[-1]):
ri = r[...,i]
x = sobel(ri, 0, mode='constant')
y = sobel(ri, 1, mode='constant')
hype[...,i] = np.hypot(x,y).mean()
return hype.argmax()
Timing:
In [1]: r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j
# Original Post
In [2]: timeit sobHypot(r)
1 loops, best of 3: 9.85 s per loop
#cleaned up a bit:
In [3]: timeit sobHypot_clean(r)
1 loops, best of 3: 7.64 s per loop
# vectorized:
In [4]: timeit sobHypot_vec(r)
1 loops, best of 3: 5.98 s per loop
# using np.gradient:
In [5]: timeit grad_max(r)
1 loops, best of 3: 4.12 s per loop
Please test any of these functions on your own images to be sure they give the desired output, since different types of arrays could react differently from the simple random tests I did.
I am having performance issues with my code.
step # IIII consumes hours of time. I used to materialize the
the itertools.prodct before, but thanks to a user I dont do pro_data = product(array_b,array_a) anymore. This helped me with memory issues, but the still is heavily time consuming.
I would like to paralellize it with multithreading or multiprocesisng, whatever you can suggest, I am grateful.
Explanation. I have two arrays that contain x and y values of particles. For each particle (defined by two coordinates) I want to calculate a function with another. For combinations I use the itertools.product method and loop over every particle. I run over 50000 particels in total, so I have N*N/2 combinations to calculate.
Thanks in advance
import numpy as np
import matplotlib.pyplot as plt
from itertools import product,combinations_with_replacement
def func(ar1,ar2,ar3,ar4): #example func that takes four arguments
return (ar1*ar2**22+np.sin(ar3)+ar4)
def newdist(a):
return func(a[0][0],a[0][1],a[1][0],a[1][1])
x_edges = np.logspace(-3,1, num=25) #prepare x-axis for histogram
x_mean = 10**((np.log10(x_edges[:-1])+np.log10(x_edges[1:]))/2)
x_width=x_edges[1:]-x_edges[:-1]
hist_data=np.zeros([len(x_edges)-1])
array1=np.random.uniform(0.,10.,100)
array2=np.random.uniform(0.,10.,100)
array_a = np.dstack((array1,array1))[0]
array_b = np.dstack((array2,array2))[0]
# IIII
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i),bins=x_edges)
hist_data+=result
hist_data = np.array(map(float, hist_data))
plt.bar(x_mean,hist_data,width=x_width,color='r')
plt.show()
-----EDIT-----
I used this code now:
def mp_dist(array_a,array_b, d, bins): #d chunks AND processes
def worker(array_ab, out_q):
""" push result in queue """
outdict = {}
outdict = vec_chunk(array_ab, bins)
out_q.put(outdict)
out_q = mp.Queue()
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_size_a=len(array_a)-(len(array_a)%d)
array_size_b=len(array_b)-(len(array_b)%d)
a_chunk = array_size_a / d
b_chunk = array_size_b / d
procs = []
#prepare arrays for mp
array_ab = np.empty((4, a_chunk, b_chunk))
for j in xrange(d):
for k in xrange(d):
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
p = mp.Process(target=worker, args=(array_ab, out_q))
procs.append(p)
p.start()
resultarray = np.empty(len(bins)-1)
for i in range(d):
resultarray+=out_q.get()
# Wait for all worker processes to finish
for pro in procs:
pro.join()
print resultarray
return resultarray
Problem here is that I cannot control the numbers of processes. How Can I use a mp.Pool() instead?
than
First, lets look at a straightforward vectorization of your problem. I have a feeling that you want your array_a and array_b to be the exact same, i.e. the coordinates of the particles, but I am keeping them separate here.
I have turned your code into a function, to make timing easier:
def IIII(array_a, array_b, bins) :
hist_data=np.zeros([len(bins)-1])
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i), bins=bins)
hist_data+=result
hist_data = np.array(map(float, hist_data))
return hist_data
You can, by the way, generate your sample data in a less convoluted way as follows:
n = 100
array_a = np.random.uniform(0, 10, size=(n, 2))
array_b = np.random.uniform(0, 10, size=(n, 2))
So first we need to vectorize your func. I have done it so it can take any array of shape (4, ...). To spare memory, it is doing the calculation in place, and returning the first plane, i.e. array[0].
def func_vectorized(a) :
a[1] **= 22
np.sin(a[2], out=a[2])
a[0] *= a[1]
a[0] += a[2]
a[0] += a[3]
return a[0]
With this function in place, we can write a vectorized version of IIII:
def IIII_vec(array_a, array_b, bins) :
array_ab = np.empty((4, len(array_a), len(array_b)))
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_ab[[0, 1]] = a[:, :, None]
array_ab[[2, 3]] = b[:, None, :]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
return hist
With n = 100 points, they both return the same:
In [2]: h1 = IIII(array_a, array_b, x_edges)
In [3]: h2 = IIII_bis(array_a, array_b, x_edges)
In [4]: np.testing.assert_almost_equal(h1, h2)
But the timing differences are already very relevant:
In [5]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 654 ms per loop
In [6]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.08 ms per loop
A 300x speedup!. If you try it again with longer sample data, n = 1000, you can see that they both scale equally bad, as n**2, so the 300x stays there:
In [10]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 68.2 s per loop
In [11]: %timeit IIII_bis(array_a, array_b, x_edges)
1 loops, best of 3: 229 ms per loop
So you are still looking at a good 10 min. of processing, which is not really that much when compared to the more than 2 days that your current solution would require.
Of course, for things to be so nice, you will need to fit a (4, 50000, 50000) array of floats into memory, something that my system cannot handle. But you can still keep things relatively fast, by processing it in chunks. The following version of IIII_vec divides each array into d chunks. As written, the length of the array should be divisible by d. It wouldn't bee too hard to overcome that limitation, but it would obfuscate the true purpose:
def IIII_vec_bis(array_a, array_b, bins, d=1) :
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
a_chunk = len(array_a) // d
b_chunk = len(array_b) // d
array_ab = np.empty((4, a_chunk, b_chunk))
hist_data = np.zeros((len(bins) - 1,))
for j in xrange(d) :
for k in xrange(d) :
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
hist_data += hist
return hist_data
First, lets check that it really works:
In [4]: h1 = IIII_vec(array_a, array_b, x_edges)
In [5]: h2 = IIII_vec_bis(array_a, array_b, x_edges, d=10)
In [6]: np.testing.assert_almost_equal(h1, h2)
And now some timings. With n = 100:
In [7]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.02 ms per loop
In [8]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
100 loops, best of 3: 12 ms per loop
But as you start having to have a larger and larger array in memory, doing it in chunks starts to pay off. With n = 1000:
In [12]: %timeit IIII_vec(array_a, array_b, x_edges)
1 loops, best of 3: 223 ms per loop
In [13]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 208 ms per loop
With n = 10000 I can no longer call IIII_vec without an array is too big error, but the chunky version is still running:
In [18]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 21.8 s per loop
And just to show that it can be done, I have run it once with n = 50000:
In [23]: %timeit -n1 -r1 IIII_vec_bis(array_a, array_b, x_edges, d=50)
1 loops, best of 1: 543 s per loop
So a good 9 minutes of number crunching, which is not all that bad given it has computed 2.5 billion interactions.
Use vectorized numpy operations. Replace the for-loop over product() with a single newdist() call by creating arguments using meshgrid().
To parallize the problem compute newdist() on slices of array_a, array_b that correspond to subblocks of meshgrid(). Here's an example using slices and multiprocessing.
Here's another example to demonstrate the steps: python loop -> vectorized numpy version -> parallel:
#!/usr/bin/env python
from __future__ import division
import math
import multiprocessing as mp
import numpy as np
try:
from itertools import izip as zip
except ImportError:
zip = zip # Python 3
def pi_loop(x, y, npoints):
"""Compute pi using Monte-Carlo method."""
# note: the method converges to pi very slowly.
return 4 * sum(1 for xx, yy in zip(x, y) if (xx**2 + yy**2) < 1) / npoints
def pi_vectorized(x, y, npoints):
return 4 * ((x**2 + y**2) < 1).sum() / npoints # or just .mean()
def mp_init(x_shared, y_shared):
global mp_x, mp_y
mp_x, mp_y = map(np.frombuffer, [x_shared, y_shared]) # no copy
def mp_pi(args):
# perform computations on slices of mp_x, mp_y
start, end = args
x = mp_x[start:end] # no copy
y = mp_y[start:end]
return ((x**2 + y**2) < 1).sum()
def pi_parallel(x, y, npoints):
# compute pi using multiple processes
pool = mp.Pool(initializer=mp_init, initargs=[x, y])
step = 100000
slices = ((start, start + step) for start in range(0, npoints, step))
return 4 * sum(pool.imap_unordered(mp_pi, slices)) / npoints
def main():
npoints = 1000000
# create shared arrays
x_sh, y_sh = [mp.RawArray('d', npoints) for _ in range(2)]
# initialize arrays
x, y = map(np.frombuffer, [x_sh, y_sh])
x[:] = np.random.uniform(size=npoints)
y[:] = np.random.uniform(size=npoints)
for f, a, b in [(pi_loop, x, y),
(pi_vectorized, x, y),
(pi_parallel, x_sh, y_sh)]:
pi = f(a, b, npoints)
precision = int(math.floor(math.log10(npoints)) / 2 - 1 + 0.5)
print("%.*f %.1e" % (precision + 1, pi, abs(pi - math.pi)))
if __name__=="__main__":
main()
Time performance for npoints = 10_000_000:
pi_loop pi_vectorized pi_parallel
32.6 0.159 0.069 # seconds
It shows that the main performance benefit is from converting the python loop to its vectorized numpy analog.