I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.
Here's an example that gives me what I want with an array of 1000 numbers.
import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])
What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.
Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.
Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.
Here's some code:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
Timing with IPython:
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).
Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.
I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:
File my_cython.pyx:
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
#cython.wraparound(False)
#cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
The answer is a 1-D array containing all non-repeated evaluations.
To import into Python:
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
Timing with IPython:
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
Using half the memory, but 6 times slower than np.abs(r - r[:, None]):
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])
Related
I was playing around with benchmarking numpy arrays because I was getting slower than expected results when I tried to replace python arrays with numpy arrays in a script.
I know I'm missing something, and I was hoping someone could clear up my ignorance.
I created two functions and timed them
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += x
np_array[1] += x
def py_array_addition():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += x
py_array[1] += x
Results:
np_array_addition: 2.556 seconds
py_array_addition: 0.204 seconds
What gives? What's causing the massive slowdown? I figured that if I was using statically sized arrays numpy would be at least the same speed.
Thanks!
Update:
It kept bothering me that numpy array access was slow, and I figured "Hey, they're just arrays in memory right? Cython should solve this!"
And it did. Here's my revised benchmark
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_t
NUM_ITERATIONS = 200000
def np_array_assignment():
cdef np.ndarray[DTYPE_t, ndim=1] np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += 1
np_array[1] += 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += 1
py_array[1] += 1
I redefined the np_array to cdef np.ndarray[DTYPE_t, ndim=1]
print(timeit(py_array_assignment, number=3))
# 0.03459
print(timeit(np_array_assignment, number=3))
# 0.00755
That's with the python function also being optimized by cython. The timing for the python function in pure python is
print(timeit(py_array_assignment, number=3))
# 0.12510
A 17x speedup. Sure it's a silly example, but I thought it was educational.
This is not (just) addition which is slow, it is element access overhead, see for example:
def np_array_assignment():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] = 1
np_array[1] = 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] = 1
py_array[1] = 1
timeit np_array_assignment()
10000 loops, best of 3: 178 us per loop
timeit py_array_assignment()
10000 loops, best of 3: 72.5 us per loop
Numpy is fast with operating on vectors (matrices), when performed on the whole structure at once. Such single element-by-element operations are slow.
Use numpy functions to avoid looping, making operations on the whole array at once, i.e.:
def np_array_addition_good():
np_array = np.array([1, 2])
np_array += np.sum(np.arange(NUM_ITERATIONS))
The results comparing your functions with the one above are pretty revealing:
timeit np_array_addition()
1000 loops, best of 3: 1.32 ms per loop
timeit py_array_addition()
10000 loops, best of 3: 101 us per loop
timeit np_array_addition_good()
100000 loops, best of 3: 11 us per loop
But actually, you can do as good with pure python if you collapse the loops:
def py_array_addition_good():
py_array = [1, 2]
rangesum = sum(range(NUM_ITERATIONS))
py_array = [x + rangesum for x in py_array]
timeit py_array_addition_good()
100000 loops, best of 3: 11 us per loop
All in all, with such simple operations there is really no improvement in using numpy. Optimized code in pure python works just as good.
There were a lot of questions about it and I suggest looking at some good answers there:
How do I maximize efficiency with numpy arrays?
numpy float: 10x slower than builtin in arithmetic operations?
You're not actually using numpy's vectorized array addition if you do the loop in python; there's also the access overhead mentioned by #shashkello.
I took the liberty of increasing the array size a tad, and also adding a vectorized version of the addition:
import numpy as np
from timeit import timeit
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(np_array)):
np_array[i] += x
def np_array_addition2():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
np_array += x
def py_array_addition():
py_array = range(1000)
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(py_array)):
py_array[i] += x
print timeit(np_array_addition, number=3) # 4.216162
print timeit(np_array_addition2, number=3) # 0.117681
print timeit(py_array_addition, number=3) # 0.439957
As you can see, the vectorized numpy version wins pretty handily. The gap will just get larger as array sizes and/or iterations increase.
In my project I need to compute euclidian distance beetween each points stored in an array.
The entry array is a 2D numpy array with 3 columns which are the coordinates(x,y,z) and each rows define a new point.
I'm usualy working with 5000 - 6000 points in my test cases.
My first algorithm use Cython and my second numpy. I find that my numpy algorithm is faster than cython.
edit: with 6000 points :
numpy 1.76 s / cython 4.36 s
Here's my cython code:
cimport cython
from libc.math cimport sqrt
#cython.boundscheck(False)
#cython.wraparound(False)
cdef void calcul1(double[::1] M,double[::1] R):
cdef int i=0
cdef int max = M.shape[0]
cdef int x,y
cdef int start = 1
for x in range(0,max,3):
for y in range(start,max,3):
R[i]= sqrt((M[y] - M[x])**2 + (M[y+1] - M[x+1])**2 + (M[y+2] - M[x+2])**2)
i+=1
start += 1
M is a memory view of the initial entry array but flatten() by numpy before the call of the function calcul1(), R is a memory view of a 1D output array to store all the results.
Here's my Numpy code :
def calcul2(M):
return np.sqrt(((M[:,:,np.newaxis] - M[:,np.newaxis,:])**2).sum(axis=0))
Here M is the initial entry array but transpose() by numpy before the function call to have coordinates(x,y,z) as rows and points as columns.
Moreover this numpy function is quite convinient because the array it returns is well organise. It's a n by n array with n the number of points and each points has a row and a column. So for example the distance AB is stored at the intersection index of row A and column B.
Here's how I call them (cython function):
cpdef test():
cdef double[::1] Mf
cdef double[::1] out = np.empty(17998000,dtype=np.float64) # (6000² - 6000) / 2
M = np.arange(6000*3,dtype=np.float64).reshape(6000,3) # Example array with 6000 points
Mf = M.flatten() #because my cython algorithm need a 1D array
Mt = M.transpose() # because my numpy algorithm need coordinates as rows
calcul2(Mt)
calcul1(Mf,out)
Am I doing something wrong here ? For my project both are not fast enough.
1: Is there a way to improve my cython code in order to beat numpy's speed ?
2: Is there a way to improve my numpy code to compute even faster ?
3: Or any other solutions, but it must be a python/cython (like parallel computing) ?
Thank you.
Not sure where you are getting your timings, but you can use scipy.spatial.distance:
M = np.arange(6000*3, dtype=np.float64).reshape(6000,3)
np_result = calcul2(M)
sp_result = sd.cdist(M.T, M.T) #Scipy usage
np.allclose(np_result, sp_result)
>>> True
Timings:
%timeit calcul2(M)
1000 loops, best of 3: 313 µs per loop
%timeit sd.cdist(M.T, M.T)
10000 loops, best of 3: 86.4 µs per loop
Importantly, its also useful to realize that your output is symmetric:
np.allclose(sp_result, sp_result.T)
>>> True
An alternative is to only compute the upper triangular of this array:
%timeit sd.pdist(M.T)
10000 loops, best of 3: 39.1 µs per loop
Edit: Not sure which index you want to zip, looks like you may be doing it both ways? Zipping the other index for comparison:
%timeit sd.pdist(M)
10 loops, best of 3: 135 ms per loop
Still about 10x faster than your current NumPy implementation.
I have just started experimenting with cython and as a first exercise I created the following (re)implementation of a function computing the sin for each element of an array. So here's my sin.pyx
from numpy cimport ndarray, float64_t
import numpy as np
cdef extern from "math.h":
double sin(double x)
def sin_array(ndarray[float64_t, ndim=1] arr):
cdef int n = len(arr)
cdef ndarray h = np.zeros(n, dtype=np.float64)
for i in range(n):
h[i] = sin(arr[i])
return h
I also created the following setup.py for this
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
ext = Extension("sin", sources=["sin.pyx"])
setup(ext_modules=[ext],
cmdclass={"build_ext": build_ext},
include_dirs=[numpy.get_include()])
So this creates my *.so file. I import this into python and create 1000 random numbers, e.g.
import sin
import numpy as np
x = np.random.randn(1000)
%timeit sin.sin_array(x)
%timeit np.sin(x)
Numpy wins by a factor of 3. Why is that? I thought that a function making very explicit assumption about the type and the dimension of the input-array can be more competitive here. Of course, I also understand that numpy is incredible clever, but chances are that I am doing something stupid here...
Note that the point of this exercise is to not to rewrite a faster sin function but rather to create some cython wrappers for some of our internal tools but that's another issue for later...
Cython's annotation feature, cython -a filename.pyx is your friend. It generates a html file which you can load in a browser and it highlights lines of code which are not well optimized. You can click on a line to see the generated c code.
In this case the problem appears to be that h is not properly typed. If you simply type an array as ndarray you're telling Cython that it's an array, but you're not giving cython enough information to tell it how to index it efficiently, you must give the type and shape information. You have done this correctly in the function declaration.
I imagine that once this has been fixed the performance will be comparable, but if it isn't annotate will tell you what's wrong. If cython is still slower then numpy probably uses a faster sin function than the standard c one (you can get much faster sin approximations, try googling it if interested).
Here are a couple of variants and the performance on my machine (which may vary) using the cython magic in ipython:
%%cython --compile-args=-O3 -a
import numpy as np
cimport numpy as np
import cython
from libc.math cimport sin
def sin_array(np.ndarray[np.float64_t, ndim=1] arr):
cdef int n = len(arr)
cdef np.ndarray h = np.zeros(n, dtype=np.float64)
for i in range(n):
h[i] = sin(arr[i])
return h
#cython.boundscheck(False)
#cython.wraparound(False)
def sin_array1(np.ndarray[np.float64_t, ndim=1] arr):
cdef int n = arr.shape[0]
cdef unsigned int i
cdef np.ndarray[np.float64_t, ndim=1] h = np.empty_like(arr)
for i in range(n):
h[i] = sin(arr[i])
return h
#cython.boundscheck(False)
#cython.wraparound(False)
def sin_array2(np.float64_t[:] arr):
cdef int n = arr.shape[0]
cdef unsigned int i
cdef np.ndarray[np.float64_t, ndim=1] h = np.empty(n, np.float64)
cdef np.float64_t[::1] _h = h
for i in range(n):
_h[i] = sin(arr[i])
return h
And for kicks, I threw in a Numba jitted method:
import numpy as np
import numba as nb
#nb.jit
def sin_numba(x):
n = x.shape[0]
h = np.empty(n, np.float64)
for k in range(n):
h[k] = np.sin(x[k])
return h
And the timings:
In [25]:
x = np.random.randn(1000)
%timeit np.sin(x)
%timeit sin_array(x)
%timeit sin_array1(x)
%timeit sin_array2(x)
%timeit sin_numba(x)
10000 loops, best of 3: 27 µs per loop
10000 loops, best of 3: 80.3 µs per loop
10000 loops, best of 3: 28.7 µs per loop
10000 loops, best of 3: 32.8 µs per loop
10000 loops, best of 3: 31.4 µs per loop
The numpy built-in is still the fastest (but just by a little bit), and the numba performance is pretty good considering the simplicity of not specifying any type info.
Update:
It's also always good to take a look at a variety of array sizes. Here are the timings for an array of 10000 elements:
In [26]:
x = np.random.randn(10000)
%timeit np.sin(x)
%timeit sin_array(x)
%timeit sin_array1(x)
%timeit sin_array2(x)
%timeit sin_numba(x)
1000 loops, best of 3: 267 µs per loop
1000 loops, best of 3: 783 µs per loop
1000 loops, best of 3: 267 µs per loop
1000 loops, best of 3: 268 µs per loop
1 loops, best of 3: 287 µs per loop
Here you can see almost identical timings between the optimized versions of the original method and the np.sin call, pointing to some overhead in the initialization of the data structures in cython or the return. Numba fares slightly worse under these conditions.
I thought I'd update this using Python 3.6.1 and Cython 0.25.2. As suggested by #blake-walsh I correctly typed all variables and used the -a option to check that the code was translated to C with no extra tests. I also used the newer typed memoryview approach to passing arrays to functions.
The result is that Cython, compiling Python to C and using C libraries for math functions is 45% faster than the Numpy solution. Why? Probably because Numpy has a number of checks and generalizations that I didn't add to the Cython version. I've done a number of Cython vs C tests recently and, if you can use code that can be translated to C the difference isn't significant. Cython really is fast.
The code is:
%%cython -c=-O3 -c=-march=native
import cython
cimport cython
from libc.math cimport sin
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cpdef double [:] cy_sin(double [:] arr):
cdef unsigned int i, n = arr.shape[0]
for i in range(n):
arr[i] = sin(arr[i])
return arr
import numpy as np
x = np.random.randn(1000)
%timeit np.sin(x)
%timeit cy_sin(x)
and the results were:
15.6 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
10.7 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit:
I added parallel processing by changing the code to:
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force -c=-O3 -c=-march=native
import cython
cimport cython
cimport openmp
from cython.parallel import parallel, prange
from libc.math cimport sin
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cpdef double [:] cy_sin(double [:] arr):
cdef int i, n = arr.shape[0]
for i in prange(n, nogil=True):
# for i in range(n):
arr[i] = sin(arr[i])
return arr
On this small array, it approximately doubled speed (i5-3470 3.2GHz x4 processor) to complete in 5.75 µs. On larger 1M+ sized arrays it quadrupled speed.
Let's say I want to do an element-wise sum of a list of numpy arrays:
tosum = [rand(100,100) for n in range(10)]
I've been looking for the best way to do this. It seems like numpy.sum is awful:
timeit.timeit('sum(array(tosum), axis=0)',
setup='from numpy import sum; from __main__ import tosum, array',
number=10000)
75.02289700508118
timeit.timeit('sum(tosum, axis=0)',
setup='from numpy import sum; from __main__ import tosum',
number=10000)
78.99106407165527
Reduce is much faster (to the tune of nearly two orders of magnitude):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=10000)
1.131795883178711
It looks like reduce even has a meaningful lead over the non-numpy sum (note that these are for 1e6 runs rather than 1e4 for the above times):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=1000000)
109.98814797401428
timeit.timeit('sum(tosum)',
setup='from __main__ import tosum',
number=1000000)
125.52461504936218
Are there other methods I should try? Can anyone explain the rankings?
Edit
numpy.sum is definitely faster if the list is turned into a numpy array first:
tosum2 = array(tosum)
timeit.timeit('sum(tosum2, axis=0)',
setup='from numpy import sum; from __main__ import tosum2',
number=10000)
1.1545608043670654
However, I'm only interested in doing a sum once, so turning the array into a numpy array would still incur a real performance penalty.
The following is competitive with reduce, and is faster if the tosum list is long enough. However, it's not a lot faster, and it is more code. (reduce(add, tosum) sure is pretty.)
def loop_inplace_sum(arrlist):
# assumes len(arrlist) > 0
sum = arrlist[0].copy()
for a in arrlist[1:]:
sum += a
return sum
Timing for the original tosum. reduce(add, tosum) is faster:
In [128]: tosum = [rand(100,100) for n in range(10)]
In [129]: %timeit reduce(add, tosum)
10000 loops, best of 3: 73.5 µs per loop
In [130]: %timeit loop_inplace_sum(tosum)
10000 loops, best of 3: 78 µs per loop
Timing for a much longer list of arrays. Now loop_inplace_sum is faster.
In [131]: tosum = [rand(100,100) for n in range(500)]
In [132]: %timeit reduce(add, tosum)
100 loops, best of 3: 5.09 ms per loop
In [133]: %timeit loop_inplace_sum(tosum)
100 loops, best of 3: 4.4 ms per loop
Numpy sum is not awful, you are simply using numpy in the wrong way. You won't be able to make use of numpy's speed advantage if you combine normal python, functions (including reduce!), loops and lists with numpy arrays. If you want your code to be fast, you must only use numpy.
Since you did not specify any imports in your code snippet, I am not sure what the function randn is doing or where it comes from, so I just assumed that tosum should just represent a list of 10 matrices of some random numbers. The following code snippet shows that numpy is definitely not as slow as you claim it to be:
import numpy as np
import timeit
def test_np_sum(n=10):
# n represents the numbers of matrices to sum up element wise
tosum = np.random.randint(0, 100, size=(n, 10, 10)) # n 10x10 matrices, shape = (n, 10, 10)
summed = np.sum(tosum, axis=0) # shape = (10, 10)
And then testing it:
timeit.timeit('test_np_sum()', number=10000, setup='from __main__ import test_np_sum')
0.8418250999999941
Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.