Let's say I want to do an element-wise sum of a list of numpy arrays:
tosum = [rand(100,100) for n in range(10)]
I've been looking for the best way to do this. It seems like numpy.sum is awful:
timeit.timeit('sum(array(tosum), axis=0)',
setup='from numpy import sum; from __main__ import tosum, array',
number=10000)
75.02289700508118
timeit.timeit('sum(tosum, axis=0)',
setup='from numpy import sum; from __main__ import tosum',
number=10000)
78.99106407165527
Reduce is much faster (to the tune of nearly two orders of magnitude):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=10000)
1.131795883178711
It looks like reduce even has a meaningful lead over the non-numpy sum (note that these are for 1e6 runs rather than 1e4 for the above times):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=1000000)
109.98814797401428
timeit.timeit('sum(tosum)',
setup='from __main__ import tosum',
number=1000000)
125.52461504936218
Are there other methods I should try? Can anyone explain the rankings?
Edit
numpy.sum is definitely faster if the list is turned into a numpy array first:
tosum2 = array(tosum)
timeit.timeit('sum(tosum2, axis=0)',
setup='from numpy import sum; from __main__ import tosum2',
number=10000)
1.1545608043670654
However, I'm only interested in doing a sum once, so turning the array into a numpy array would still incur a real performance penalty.
The following is competitive with reduce, and is faster if the tosum list is long enough. However, it's not a lot faster, and it is more code. (reduce(add, tosum) sure is pretty.)
def loop_inplace_sum(arrlist):
# assumes len(arrlist) > 0
sum = arrlist[0].copy()
for a in arrlist[1:]:
sum += a
return sum
Timing for the original tosum. reduce(add, tosum) is faster:
In [128]: tosum = [rand(100,100) for n in range(10)]
In [129]: %timeit reduce(add, tosum)
10000 loops, best of 3: 73.5 µs per loop
In [130]: %timeit loop_inplace_sum(tosum)
10000 loops, best of 3: 78 µs per loop
Timing for a much longer list of arrays. Now loop_inplace_sum is faster.
In [131]: tosum = [rand(100,100) for n in range(500)]
In [132]: %timeit reduce(add, tosum)
100 loops, best of 3: 5.09 ms per loop
In [133]: %timeit loop_inplace_sum(tosum)
100 loops, best of 3: 4.4 ms per loop
Numpy sum is not awful, you are simply using numpy in the wrong way. You won't be able to make use of numpy's speed advantage if you combine normal python, functions (including reduce!), loops and lists with numpy arrays. If you want your code to be fast, you must only use numpy.
Since you did not specify any imports in your code snippet, I am not sure what the function randn is doing or where it comes from, so I just assumed that tosum should just represent a list of 10 matrices of some random numbers. The following code snippet shows that numpy is definitely not as slow as you claim it to be:
import numpy as np
import timeit
def test_np_sum(n=10):
# n represents the numbers of matrices to sum up element wise
tosum = np.random.randint(0, 100, size=(n, 10, 10)) # n 10x10 matrices, shape = (n, 10, 10)
summed = np.sum(tosum, axis=0) # shape = (10, 10)
And then testing it:
timeit.timeit('test_np_sum()', number=10000, setup='from __main__ import test_np_sum')
0.8418250999999941
Related
I was playing around with benchmarking numpy arrays because I was getting slower than expected results when I tried to replace python arrays with numpy arrays in a script.
I know I'm missing something, and I was hoping someone could clear up my ignorance.
I created two functions and timed them
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += x
np_array[1] += x
def py_array_addition():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += x
py_array[1] += x
Results:
np_array_addition: 2.556 seconds
py_array_addition: 0.204 seconds
What gives? What's causing the massive slowdown? I figured that if I was using statically sized arrays numpy would be at least the same speed.
Thanks!
Update:
It kept bothering me that numpy array access was slow, and I figured "Hey, they're just arrays in memory right? Cython should solve this!"
And it did. Here's my revised benchmark
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_t
NUM_ITERATIONS = 200000
def np_array_assignment():
cdef np.ndarray[DTYPE_t, ndim=1] np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += 1
np_array[1] += 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += 1
py_array[1] += 1
I redefined the np_array to cdef np.ndarray[DTYPE_t, ndim=1]
print(timeit(py_array_assignment, number=3))
# 0.03459
print(timeit(np_array_assignment, number=3))
# 0.00755
That's with the python function also being optimized by cython. The timing for the python function in pure python is
print(timeit(py_array_assignment, number=3))
# 0.12510
A 17x speedup. Sure it's a silly example, but I thought it was educational.
This is not (just) addition which is slow, it is element access overhead, see for example:
def np_array_assignment():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] = 1
np_array[1] = 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] = 1
py_array[1] = 1
timeit np_array_assignment()
10000 loops, best of 3: 178 us per loop
timeit py_array_assignment()
10000 loops, best of 3: 72.5 us per loop
Numpy is fast with operating on vectors (matrices), when performed on the whole structure at once. Such single element-by-element operations are slow.
Use numpy functions to avoid looping, making operations on the whole array at once, i.e.:
def np_array_addition_good():
np_array = np.array([1, 2])
np_array += np.sum(np.arange(NUM_ITERATIONS))
The results comparing your functions with the one above are pretty revealing:
timeit np_array_addition()
1000 loops, best of 3: 1.32 ms per loop
timeit py_array_addition()
10000 loops, best of 3: 101 us per loop
timeit np_array_addition_good()
100000 loops, best of 3: 11 us per loop
But actually, you can do as good with pure python if you collapse the loops:
def py_array_addition_good():
py_array = [1, 2]
rangesum = sum(range(NUM_ITERATIONS))
py_array = [x + rangesum for x in py_array]
timeit py_array_addition_good()
100000 loops, best of 3: 11 us per loop
All in all, with such simple operations there is really no improvement in using numpy. Optimized code in pure python works just as good.
There were a lot of questions about it and I suggest looking at some good answers there:
How do I maximize efficiency with numpy arrays?
numpy float: 10x slower than builtin in arithmetic operations?
You're not actually using numpy's vectorized array addition if you do the loop in python; there's also the access overhead mentioned by #shashkello.
I took the liberty of increasing the array size a tad, and also adding a vectorized version of the addition:
import numpy as np
from timeit import timeit
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(np_array)):
np_array[i] += x
def np_array_addition2():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
np_array += x
def py_array_addition():
py_array = range(1000)
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(py_array)):
py_array[i] += x
print timeit(np_array_addition, number=3) # 4.216162
print timeit(np_array_addition2, number=3) # 0.117681
print timeit(py_array_addition, number=3) # 0.439957
As you can see, the vectorized numpy version wins pretty handily. The gap will just get larger as array sizes and/or iterations increase.
I have a one-dimensional numpy array, which is quite large in size. For each entry of the array, I need to produce a linearly spaced sub-array upto that entry value. Here is what I have as an example.
import numpy as np
a = np.array([2, 3])
b = np.array([np.linspace(0, i, 4) for i in a])
In this case there is linear space of size 4. The last statement in the above code involves a for loop which is rather slow if a is very large. Is there a trick to implement this in numpy itself?
You can phrase this as an outer product:
In [37]: a = np.arange(100000)
In [38]: %timeit np.array([np.linspace(0, i, 4) for i in a])
1 loop, best of 3: 1.3 s per loop
In [39]: %timeit np.outer(a, np.linspace(0, 1, 4))
1000 loops, best of 3: 1.44 ms per loop
The idea is to a take a unit linspace and then scale it separately by each element of a.
As you can see, this gives ~1000x speed up for n=100000.
For completeness, I'll mention that this code has slightly different roundoff properties than your original version (likely not an issue in practical applications):
In [52]: np.max(np.abs(np.array([np.linspace(0, i, 4) for i in a]) -
...: np.outer(a, np.linspace(0, 1, 4))))
Out[52]: 1.4551915228366852e-11
P. S. An alternative way to express the idea is by using element-wise multiplication with broadcasting (based on a suggestion by #Scott Gigante):
In [55]: %timeit a[:, np.newaxis] * np.linspace(0, 1, 4)
1000 loops, best of 3: 1.48 ms per loop
P. P. S. See the comments below for further ideas on making this faster.
Let A,B be ((day,observation,dim)) arrays. Each array contains for a given day the same number of observations, an observation being a point with dim dimensions (that is dim floats). For every day, I want to compute the spatial distances between all observations in A and B that day.
For example:
import numpy as np
from scipy.spatial.distance import cdist
A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)
output = []
for day in range(50):
output.append(cdist(A[day],B[day]))
where I use scipy.spatial.distance.cdist.
Is there a faster way to do this? Ideally, I would like to get for output a ((day,observation,observation)) array that contains for every day the pairwise distances between observations in A and B that day, whilst somehow avoid the loop over days.
One way to do it (though it will require a massive amount of memory) is to make clever use of array broadcasting:
output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
Edit
But after some testing, it seems that probably scikit-learn's euclidean_distances is the best option for large arrays. (Note that I've rewritten your loop into a list comprehension.)
This is for 100 data points per day:
# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 8.81 ms per loop
# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
10 loops, best of 3: 46.9 ms per loop
# scikit-learn's algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop
and this is for 2000 data points per day:
In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop
In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 2.94 s per loop
Edit: I'm an idiot and forgot that python's map is evaluated lazily. My "faster" code wasn't actually doing any of the work! Forcing evaluation removed the performance boost.
I think your time is going to be dominated by the time spent inside the scipy function. I'd use map instead of the loop anyway as I think its a bit neater but I don't think theres any magic way to get a huge performance boost here. Maybe compiling the code with cython or using numba would help a little.
I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.
Here's an example that gives me what I want with an array of 1000 numbers.
import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])
What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.
Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.
Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.
Here's some code:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
Timing with IPython:
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).
Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.
I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:
File my_cython.pyx:
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
#cython.wraparound(False)
#cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
The answer is a 1-D array containing all non-repeated evaluations.
To import into Python:
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
Timing with IPython:
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
Using half the memory, but 6 times slower than np.abs(r - r[:, None]):
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])
Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.