I'm developing code that integrates an ODE using scipy's complex_ode, where the integrand includes a Fourier transform and exponential operator acting on a large array of complex values.
To optimize performance, I've profiled this and found the main bottleneck is (after optimizing FFTs using PyFFTW etc) in the line:
val = np.exp(float_value * arr)
I'm currently using numpy which I understand calls C code - and thus should be quick. But is there any way to further improve performance please?
I've looked into using Numba but since my main loop includes FFTs too, I don't think it can be compiled (nopython=True flag leads to errors) and thus, I suspect it offers no gain.
Here is a test example for the code I'd like to optimize:
arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
float_value = 0.5
%timeit np.exp(float_value * arr)
Any suggestions welcomed thanks.
We could leverage numexpr module, which works really efficiently on large data involving transcendental operations -
In [91]: arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
...: float_value = 0.5
...:
In [92]: %timeit np.exp(float_value * arr)
1000 loops, best of 3: 739 µs per loop
In [94]: import numexpr as ne
In [95]: %timeit ne.evaluate('exp(float_value*arr)')
1000 loops, best of 3: 241 µs per loop
This seems to be coherent with the expected performance as stated in the docs.
Let A,B be ((day,observation,dim)) arrays. Each array contains for a given day the same number of observations, an observation being a point with dim dimensions (that is dim floats). For every day, I want to compute the spatial distances between all observations in A and B that day.
For example:
import numpy as np
from scipy.spatial.distance import cdist
A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)
output = []
for day in range(50):
output.append(cdist(A[day],B[day]))
where I use scipy.spatial.distance.cdist.
Is there a faster way to do this? Ideally, I would like to get for output a ((day,observation,observation)) array that contains for every day the pairwise distances between observations in A and B that day, whilst somehow avoid the loop over days.
One way to do it (though it will require a massive amount of memory) is to make clever use of array broadcasting:
output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
Edit
But after some testing, it seems that probably scikit-learn's euclidean_distances is the best option for large arrays. (Note that I've rewritten your loop into a list comprehension.)
This is for 100 data points per day:
# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 8.81 ms per loop
# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
10 loops, best of 3: 46.9 ms per loop
# scikit-learn's algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop
and this is for 2000 data points per day:
In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop
In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 2.94 s per loop
Edit: I'm an idiot and forgot that python's map is evaluated lazily. My "faster" code wasn't actually doing any of the work! Forcing evaluation removed the performance boost.
I think your time is going to be dominated by the time spent inside the scipy function. I'd use map instead of the loop anyway as I think its a bit neater but I don't think theres any magic way to get a huge performance boost here. Maybe compiling the code with cython or using numba would help a little.
I had writted a script using NumPy's fft function, where I was padding my input array to the nearest power of 2 to get a faster FFT.
After profiling the code, I found that the FFT call was taking the longest time, so I fiddled around with the parameters and found that if I didn't pad the input array, the FFT ran several times faster.
Here's a minimal example to illustrate what I'm talking about (I ran this in IPython and used the %timeit magic to time the execution).
x = np.arange(-4.*np.pi, 4.*np.pi, 1000)
dat1 = np.sin(x)
The timing results:
%timeit np.fft.fft(dat1)
100000 loops, best of 3: 12.3 µs per loop
%timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 61.5 µs per loop
Padding the array to a power of 2 leads to a very drastic slowdown.
Even if I create an array with a prime number of elements (hence the theoretically slowest FFT)
x2 = np.arange(-4.*np.pi, 4.*np.pi, 1009)
dat2 = np.sin(x2)
The time it takes to run still doesn't change so drastically!
%timeit np.fft.fft(dat2)
100000 loops, best of 3: 12.2 µs per loop
I would have thought that padding the array will be a one time operation, and then calculating the FFT should be quicker.
Am I missing anything?
EDIT: I was supposed to use np.linspace rather than np.arange. Below are the timing results using linspace
In [2]: import numpy as np
In [3]: x = np.linspace(-4*np.pi, 4*np.pi, 1000)
In [4]: x2 = np.linspace(-4*np.pi, 4*np.pi, 1024)
In [5]: dat1 = np.sin(x)
In [6]: dat2 = np.sin(x2)
In [7]: %timeit np.fft.fft(dat1)
10000 loops, best of 3: 55.1 µs per loop
In [8]: %timeit np.fft.fft(dat2)
10000 loops, best of 3: 49.4 µs per loop
In [9]: %timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 64.9 µs per loop
Padding still causes a slowdown. Could this be a local issue? i.e., due to some quirk in my NumPy setup it's acting this way?
FFT algorithms like NumPy's are fast for array sizes that factorize into a product of small primes, not just powers of two. If you increase the array size by padding the computational work increases. The speed of FFT algorithms is also critically dependent on the cache use. If you pad to an array size that creates less efficient cache use the efficiency slows down. The really fast FFT algorithms, like FFTW and Intel MKL, will actually generate plans for the array size factorization to get the most efficient computation. This includes both heuristics and actual measurements. So no, padding to the nearest power of two is only beneficial in introductory textbooks and not neccesarily in practice. As a rule of thumb you usually benefit from padding if the array size factorizes to one or more very large prime.
You're using np.arange when you want to be using np.linspace
In [2]: x = np.arange(-4.*np.pi, 4.*np.pi, 1000)
In [3]: x
Out[3]: array([-12.56637061])
np.arange takes arguments (start, stop, step), whereas np.linspace is (start, stop, number_of_pts). When you calculate with the data I suspect you think you're using, you get the expected behavior:
In [4]: x = np.linspace(-4.*np.pi, 4.*np.pi, 1000)
In [5]: dat1 = np.sin(x)
In [6]: %timeit np.fft.fft(dat1)
1 loops, best of 3: 28.1 µs per loop
In [7]: %timeit np.fft.fft(dat1, n=1024)
10000 loops, best of 3: 26.7 µs per loop
In [8]: x = np.linspace(-4.*np.pi, 4.*np.pi, 1009)
In [9]: dat2 = np.sin(x)
In [10]: %timeit np.fft.fft(dat2)
10000 loops, best of 3: 53 µs per loop
In [11]: %timeit np.fft.fft(dat2, n=1024)
10000 loops, best of 3: 26.8 µs per loop
I have a large code which takes a bit of time to run. I've tracked down the two lines that take up most of the time and I'd like to know if there's a way to speed them up. Here's a MWE:
import numpy as np
def setup(k=2, m=100, n=300):
return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)
# make some random points and weights
a, b, w = setup()
# Weighted euclidean distance between arrays a and b.
wdiff = (a[np.newaxis,...] - b[np.newaxis,...].T) / w[np.newaxis,...]
# This is the set of operations that need a performance boost:
dist_1 = np.exp(-0.5*(wdiff*wdiff)) / w
dist_2 = np.array([i[0]*i[1] for i in dist_1])
I'm coming from this question BTW Fast weighted euclidean distance between points in arrays where ali_m suggested his amazing answer that saved me a lot of time by applying broadcasting (of which I know absolutely nothing, yet at least) Could something like that be applied with these lines?
Your dist_2 calculation can be sped up by a factor of 10 or so:
>>> dist_1.shape
(300, 2, 100)
>>> %timeit dist_2 = np.array([i[0]*i[1] for i in dist_1])
1000 loops, best of 3: 1.35 ms per loop
>>> %timeit dist_2 = dist_1.prod(axis=1)
10000 loops, best of 3: 116 µs per loop
>>> np.allclose(np.array([i[0]*i[1] for i in dist_1]), dist_1.prod(axis=1))
True
I couldn't manage to do much with your dist_1 as the majority of time is spent in the exponentiation:
>>> %timeit (-0.5*(wdiff*wdiff)) / w
1000 loops, best of 3: 467 µs per loop
>>> %timeit np.exp((-0.5*(wdiff*wdiff)))/w
100 loops, best of 3: 3.3 ms per loop
Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.