High performance array mean - python

I've got a performance bottleneck. I'm computing the column-wise mean of large arrays (250 rows & 1.3 million columns), and I do so more than a million times in my application.
My test case in Python:
import numpy as np
big_array = np.random.random((250, 1300000))
%timeit mean = big_array.mean(axis = 0) # ~400 milliseconds
Numpy takes around 400 milliseconds on my machine, running on a single core. I've tried several other matrix libraries across different languages (Cython, R, Julia, Torch), but found only Julia to beat Numpy, by taking around 250 milliseconds.
Can anyone provide evidence of substantial improvements in performance in this task? Perhaps this is a task suited for the GPU?
Edit: My application is evidently memory-constrained, and its performance is dramatically improved by accessing elements of a large array only once, rather than repeatedly. (See comment below.)

Julia, if I'm not mistaken, uses fortran ordering in memory as opposed to numpy which uses C memory layout by default. So if you rearrange things to adhere to the same layout so that the mean is happening along contiguous memory, you get better performance:
In [1]: import numpy as np
In [2]: big_array = np.random.random((250, 1300000))
In [4]: big_array_f = np.asfortranarray(big_array)
In [5]: %timeit mean = big_array.mean(axis = 0)
1 loop, best of 3: 319 ms per loop
In [6]: %timeit mean = big_array_f.mean(axis = 0)
1 loop, best of 3: 205 ms per loop
Or you can just change you dimensions and take the mean over the other axis:
In [10]: big_array = np.random.random((1300000, 250))
In [11]: %timeit mean = big_array.mean(axis = 1)
1 loop, best of 3: 205 ms per loop

Related

Why Python loops over slices of numpy arrays are faster than fully vectorized operations

I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:
mask = (data < low) or (data > high)
I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???
In [1]: import numpy as np
In [2]: import sys
In [3]: print(sys.version)
3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
In [4]: print(np.__version__)
1.14.0
In [5]: arr = np.random.random((10, 1000, 1000))
In [6]: def method1(arr, low, high):
...: """ Fully vectorized computations """
...: out = np.empty(arr.shape, dtype=np.bool)
...: np.greater_equal(arr, high, out)
...: np.logical_or(out, arr < low, out)
...: return out
...:
In [7]: def method2(arr, low, high):
...: """ Partially vectorized computations """
...: out = np.empty(arr.shape, dtype=np.bool)
...: for k in range(arr.shape[0]):
...: a = arr[k]
...: o = out[k]
...: np.greater_equal(a, high, o)
...: np.logical_or(o, a < low, o)
...: return out
...:
First of all, let's make sure that both methods produce identical results:
In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))
Out[8]: True
And now some timing tests:
In [9]: %timeit method1(arr, 0.2, 0.8)
14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %timeit method2(arr, 0.2, 0.8)
11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
What is going on here?
EDIT 1: A similar behavior is observed in an older environment:
In [3]: print(sys.version)
2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
In [4]: print(np.__version__)
1.11.3
In [9]: %timeit method1(arr, 0.2, 0.8)
100 loops, best of 3: 14.3 ms per loop
In [10]: %timeit method2(arr, 0.2, 0.8)
100 loops, best of 3: 13 ms per loop
Outperforming both methods
In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.
Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.
To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.
import numba as nb
#nb.njit(fastmath=True, cache=True)
def method3(arr, low, high):
out = np.empty(arr.shape, dtype=nb.boolean)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(arr.shape[2]):
out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high
return out
Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)
This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.
In my own tests, the difference in performance was even more noticeable than in your question. The differences continued to be clearly observable after increasing second and third dimensions of the arr data. It also continued to be observable after commenting out one of the two comparison functions (greater_equal or logical_or), which means we can rule out some kind of strange interaction between the two.
By changing the implementation of the two methods to the following, I could significantly reduce the observable difference in performance (but not completely eliminate it):
def method1(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
high = np.ones_like(arr) * high
low = np.ones_like(arr) * low
np.greater_equal(arr, high, out)
np.logical_or(out, arr < low, out)
return out
def method2(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
high = np.ones_like(arr) * high
low = np.ones_like(arr) * low
for k in range(arr.shape[0]):
a = arr[k]
o = out[k]
h = high[k]
l = low[k]
np.greater_equal(a, h, o)
np.logical_or(o, a < l, o)
return out
I suppose that, when supplying high or low as a scalar to those numpy functions, they may internally first create a numpy array of the correct shape filled with that scalar. When we do this manually outside the functions, in both cases only once for the full shape, the performance difference becomes much less noticeable. This implies that, for whatever reason (maybe cache?), creating such a large array filled with the same constant once may be less efficient than creating k smaller arrays with the same constant (as done automatically by the implementation of method2 in the original question).
Note: in addition to reducing the performance gap, it also makes the performance of both methods much worse (affecting the second method more severly than the first). So, while this may give some indication of where the issue may be, it doesn't appear to explain everything.
EDIT
Here is a new version of method2, where we now manually pre-create smaller arrays within the loop every time, like I suspect is happening internally in numpy in the original implementation in the question:
def method2(arr, low, high):
out = np.empty(arr.shape, dtype=np.bool)
for k in range(arr.shape[0]):
a = arr[k]
o = out[k]
h = np.full_like(a, high)
l = np.full_like(a, low)
np.greater_equal(a, h, o)
np.logical_or(o, a < l, o)
return out
This version is indeed much faster again than the one I have above (confirming that creating many smaller arrays inside the loop is more efficient than one big one outside the loop), but still slower than the original implementation in the question.
Under the hypothesis that these numpy functions are indeed converting the scalar bounds into these kinds of arrays first, the difference in performance between this last function and the one in the question could be due to creation of arrays in Python (my implementation) vs. doing so natively (original implementation)

Fastest way to calculate exponential [exp()] function of large complex array in Python

I'm developing code that integrates an ODE using scipy's complex_ode, where the integrand includes a Fourier transform and exponential operator acting on a large array of complex values.
To optimize performance, I've profiled this and found the main bottleneck is (after optimizing FFTs using PyFFTW etc) in the line:
val = np.exp(float_value * arr)
I'm currently using numpy which I understand calls C code - and thus should be quick. But is there any way to further improve performance please?
I've looked into using Numba but since my main loop includes FFTs too, I don't think it can be compiled (nopython=True flag leads to errors) and thus, I suspect it offers no gain.
Here is a test example for the code I'd like to optimize:
arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
float_value = 0.5
%timeit np.exp(float_value * arr)
Any suggestions welcomed thanks.
We could leverage numexpr module, which works really efficiently on large data involving transcendental operations -
In [91]: arr = np.random.rand(2**14) + 1j *np.random.rand(2**14)
...: float_value = 0.5
...:
In [92]: %timeit np.exp(float_value * arr)
1000 loops, best of 3: 739 µs per loop
In [94]: import numexpr as ne
In [95]: %timeit ne.evaluate('exp(float_value*arr)')
1000 loops, best of 3: 241 µs per loop
This seems to be coherent with the expected performance as stated in the docs.

Fast distance calculation in scipy and numpy

Let A,B be ((day,observation,dim)) arrays. Each array contains for a given day the same number of observations, an observation being a point with dim dimensions (that is dim floats). For every day, I want to compute the spatial distances between all observations in A and B that day.
For example:
import numpy as np
from scipy.spatial.distance import cdist
A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)
output = []
for day in range(50):
output.append(cdist(A[day],B[day]))
where I use scipy.spatial.distance.cdist.
Is there a faster way to do this? Ideally, I would like to get for output a ((day,observation,observation)) array that contains for every day the pairwise distances between observations in A and B that day, whilst somehow avoid the loop over days.
One way to do it (though it will require a massive amount of memory) is to make clever use of array broadcasting:
output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
Edit
But after some testing, it seems that probably scikit-learn's euclidean_distances is the best option for large arrays. (Note that I've rewritten your loop into a list comprehension.)
This is for 100 data points per day:
# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 8.81 ms per loop
# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
10 loops, best of 3: 46.9 ms per loop
# scikit-learn's algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop
and this is for 2000 data points per day:
In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop
In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 2.94 s per loop
Edit: I'm an idiot and forgot that python's map is evaluated lazily. My "faster" code wasn't actually doing any of the work! Forcing evaluation removed the performance boost.
I think your time is going to be dominated by the time spent inside the scipy function. I'd use map instead of the loop anyway as I think its a bit neater but I don't think theres any magic way to get a huge performance boost here. Maybe compiling the code with cython or using numba would help a little.

Diagonal sparse matrix obtained from a sparse coo_matrix

I built some sparse matrix M in Python using the coo_matrix format. I would like to find an efficient way to compute:
A = M + M.T - D
where D is the restriction of M to its diagonal (M is potentially very large). I can't find a way to efficiently build D while keeping a coo_matrix format. Any ideas?
Could D = scipy.sparse.spdiags(coo_matrix.diagonal(M),0,M.shape[0],M.shape[0]) be a solution?
I have come up with a faster coo diagonal:
msk = M.row==M.col
D1 = sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
sparse.tril uses this method with mask = A.row + k >= A.col (sparse/extract.py)
Some times for a (100,100) M (and M1 = M.tocsr())
In [303]: timeit msk=M.row==M.col; D1=sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
10000 loops, best of 3: 115 µs per loop
In [305]: timeit D=sparse.diags(M.diagonal(),0)
1000 loops, best of 3: 358 µs per loop
So the coo way of getting the diagional is fast, at least for this small, and very sparse matrix (only 1 time in the diagonal)
If I start with the csr form, the diags is faster. That's because .diagonal works in the csr format:
In [306]: timeit D=sparse.diags(M1.diagonal(),0)
10000 loops, best of 3: 176 µs per loop
But creating D is a small part of the overall calculation. Again, working with M1 is faster. The sum is done in csr format.
In [307]: timeit M+M.T-D
1000 loops, best of 3: 1.35 ms per loop
In [308]: timeit M1+M1.T-D
1000 loops, best of 3: 1.11 ms per loop
Another way to do the whole thing is to take advantage of that fact that coo allows duplicate i,j values, which will be summed when converted to csr format. So you could stack the row, col, data arrays for M with those for M.T (see M.transpose for how those are constructed), along with masked values for D. (or the masked diagonals could be removed from M or M.T)
For example:
def MplusMT(M):
msk=M.row!=M.col;
data=np.concatenate([M.data, M.data[msk]])
rows=np.concatenate([M.row, M.col[msk]])
cols=np.concatenate([M.col, M.row[msk]])
MM=sparse.coo_matrix((data, (rows, cols)), shape=M.shape)
return MM
# alt version with a more explicit D
# msk=M.row==M.col;
# data=np.concatenate([M.data, M.data,-M.data[msk]])
MplusMT as written is very fast because it is just doing array concatenation, not summation. To do that we have to convert it to a csr matrix.
MplusMT(M).tocsr()
which takes considerably longer. Still this approach is, in my limited testing, more than 2x faster than M+M.T-D. So it's a potential tool for constructing complex sparse matrices.
You probably want
from scipy.sparse import diags
D = diags(M.diagonal(), 0, format='coo')
This will still build an M-size 1d array as an intermediate step, but that will probably not be so bad.

Is there a way to efficiently invert an array of matrices with numpy?

Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.

Categories

Resources