Performance: Python lists vs. Numpy ndarrays [duplicate] - python

I was playing around with benchmarking numpy arrays because I was getting slower than expected results when I tried to replace python arrays with numpy arrays in a script.
I know I'm missing something, and I was hoping someone could clear up my ignorance.
I created two functions and timed them
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += x
np_array[1] += x
def py_array_addition():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += x
py_array[1] += x
Results:
np_array_addition: 2.556 seconds
py_array_addition: 0.204 seconds
What gives? What's causing the massive slowdown? I figured that if I was using statically sized arrays numpy would be at least the same speed.
Thanks!
Update:
It kept bothering me that numpy array access was slow, and I figured "Hey, they're just arrays in memory right? Cython should solve this!"
And it did. Here's my revised benchmark
import numpy as np
cimport numpy as np
ctypedef np.int_t DTYPE_t
NUM_ITERATIONS = 200000
def np_array_assignment():
cdef np.ndarray[DTYPE_t, ndim=1] np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] += 1
np_array[1] += 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] += 1
py_array[1] += 1
I redefined the np_array to cdef np.ndarray[DTYPE_t, ndim=1]
print(timeit(py_array_assignment, number=3))
# 0.03459
print(timeit(np_array_assignment, number=3))
# 0.00755
That's with the python function also being optimized by cython. The timing for the python function in pure python is
print(timeit(py_array_assignment, number=3))
# 0.12510
A 17x speedup. Sure it's a silly example, but I thought it was educational.

This is not (just) addition which is slow, it is element access overhead, see for example:
def np_array_assignment():
np_array = np.array([1, 2])
for x in xrange(NUM_ITERATIONS):
np_array[0] = 1
np_array[1] = 1
def py_array_assignment():
py_array = [1, 2]
for x in xrange(NUM_ITERATIONS):
py_array[0] = 1
py_array[1] = 1
timeit np_array_assignment()
10000 loops, best of 3: 178 us per loop
timeit py_array_assignment()
10000 loops, best of 3: 72.5 us per loop
Numpy is fast with operating on vectors (matrices), when performed on the whole structure at once. Such single element-by-element operations are slow.
Use numpy functions to avoid looping, making operations on the whole array at once, i.e.:
def np_array_addition_good():
np_array = np.array([1, 2])
np_array += np.sum(np.arange(NUM_ITERATIONS))
The results comparing your functions with the one above are pretty revealing:
timeit np_array_addition()
1000 loops, best of 3: 1.32 ms per loop
timeit py_array_addition()
10000 loops, best of 3: 101 us per loop
timeit np_array_addition_good()
100000 loops, best of 3: 11 us per loop
But actually, you can do as good with pure python if you collapse the loops:
def py_array_addition_good():
py_array = [1, 2]
rangesum = sum(range(NUM_ITERATIONS))
py_array = [x + rangesum for x in py_array]
timeit py_array_addition_good()
100000 loops, best of 3: 11 us per loop
All in all, with such simple operations there is really no improvement in using numpy. Optimized code in pure python works just as good.
There were a lot of questions about it and I suggest looking at some good answers there:
How do I maximize efficiency with numpy arrays?
numpy float: 10x slower than builtin in arithmetic operations?

You're not actually using numpy's vectorized array addition if you do the loop in python; there's also the access overhead mentioned by #shashkello.
I took the liberty of increasing the array size a tad, and also adding a vectorized version of the addition:
import numpy as np
from timeit import timeit
NUM_ITERATIONS = 1000
def np_array_addition():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(np_array)):
np_array[i] += x
def np_array_addition2():
np_array = np.array(xrange(1000))
for x in xrange(NUM_ITERATIONS):
np_array += x
def py_array_addition():
py_array = range(1000)
for x in xrange(NUM_ITERATIONS):
for i in xrange(len(py_array)):
py_array[i] += x
print timeit(np_array_addition, number=3) # 4.216162
print timeit(np_array_addition2, number=3) # 0.117681
print timeit(py_array_addition, number=3) # 0.439957
As you can see, the vectorized numpy version wins pretty handily. The gap will just get larger as array sizes and/or iterations increase.

Related

Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

I want to calculate a large distance matrix, based on a higher dimensional vector. For instance, I have 1000 instances each represented by 20 vectors of length 10. The distance between each two instances is given by the mean distance between each of the 20 vectors associated to each vector. So I want to go from a 1000 by 20 by 10 matrix to a 1000 by 1000 (lower-triangular) matrix. Because these calculations can get slow, I want to use Dask distributed to block the algorithm and spread it over several CPU's. Below is how far I've gotten:
Preamble
import itertools
import random
import numpy as np
import dask.array
from dask.distributed import Client
The distance function is defined by
def distance(u, v):
result = np.empty([int((len(u)*(len(u)+1))/2)], dtype=float)
for i, j in itertools.product(range(len(u)),range(len(v))):
if j <= i:
differences = []
k = int(((i*(i+1))/2 +j-1)+1)
for x,y in itertools.product(u[i], v[j]):
difference = np.abs(np.array(x) - np.array(y)).sum(axis=1)
differences.apply(difference)
result[k] = np.mean(differences)
return result
and returns an array of length n*(n+1)/2 to describe the lower triangular matrix for this block of the distance matrix.
def distance_matrix(X):
X = np.asarray(X, dtype=object)
X = dask.array.from_array(X, (100, 20, 10)).astype(float)
print("chunksize: ", X.chunksize)
resulting_length = [int((X.chunksize[0]*(X.chunksize[0])+1)/2)]
result = dask.array.map_blocks(distance, X, X, chunks=(resulting_length), drop_axis=[1,2], dtype=float)
return result.compute()
I split up the input array in chunks and use dask.array.map_blocks to apply the distance calculation to all the blocks.
if __name__ == '__main__':
workers = 6
X = np.array([[[random.random() for _ in range(10)] for _ in range(20)] for _ in range(1000)])
client = Client(n_workers=workers)
results = similarity_matrix(X)
client.close()
print(results)
Unfortunately, this approach returns the wrong length of array at the end of the process. Would somebody to help me out here? I don't have much experience in distributed computing.
I'm a big fan of dask, but this problem is way too small to need it. The runtime issue you're seeing is because you are looping through each element in python rather than using vectorized operations in numpy.
As with many packages in python, numpy relies on highly efficient compiled code written in other, faster languages such as C to carry out array operations. When you do something like an array operation A + B numpy calls these fast routines once, and the array operations are carried out within a highly optimized C routine. There is overhead involved with making calls to other languages, but this is overwhelmed by the performance gain due to the single call to a very fast routine. If instead you loop over every element, adding cell-wise, you have a (slow) python process, and on each element, this calls the C code, which adds overhead for each element of the array. Because of this, you actually would be better off not using numpy if you're going to do this once for each element.
To implement this in a vectorized manner, you can exploit numpy's broadcasting rules to ensure the first dimensions of your two arrays expand to a new dimension. I don't totally understand what's going on in your distance function, but you could extend this simple version to do whatever you want:
In [1]: import numpy as np
In [2]: A = np.random.random((1000, 20))
...: B = np.random.random((1000, 20))
In [3]: distance = np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
In [4]: distance
Out[4]:
array([[7.22985776, 7.76185666, 5.61824886, ..., 7.62092039, 6.35189562,
7.06365986],
[5.73359499, 5.8422105 , 7.2644021 , ..., 5.72230353, 6.79390303,
5.03074007],
[7.27871151, 8.6856818 , 5.97489449, ..., 8.86620029, 7.49875638,
6.57389575],
...,
[7.67783107, 7.24419076, 4.17941596, ..., 8.68674754, 6.65078093,
5.67279811],
[7.1550136 , 6.10590227, 5.75417987, ..., 7.05953998, 5.8306628 ,
6.55112672],
[5.81748615, 6.79246838, 6.95053088, ..., 7.63994705, 6.77720511,
7.5663236 ]])
In [5]: distance.shape
Out[5]: (1000, 1000)
The performance difference can be seen clearly against a looped implementation:
In [6]: %%timeit
...: np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
...:
...:
45 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %%timeit
...: distances = np.empty((1000, 1000))
...: for i in range(1000):
...: for j in range(1000):
...: distances[i, j] = np.abs(A[i, :] - B[j, :]).sum()
...:
2.42 s ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The looped version takes more than 50x as long!

Fastest way to compute distance beetween each points in python

In my project I need to compute euclidian distance beetween each points stored in an array.
The entry array is a 2D numpy array with 3 columns which are the coordinates(x,y,z) and each rows define a new point.
I'm usualy working with 5000 - 6000 points in my test cases.
My first algorithm use Cython and my second numpy. I find that my numpy algorithm is faster than cython.
edit: with 6000 points :
numpy 1.76 s / cython 4.36 s
Here's my cython code:
cimport cython
from libc.math cimport sqrt
#cython.boundscheck(False)
#cython.wraparound(False)
cdef void calcul1(double[::1] M,double[::1] R):
cdef int i=0
cdef int max = M.shape[0]
cdef int x,y
cdef int start = 1
for x in range(0,max,3):
for y in range(start,max,3):
R[i]= sqrt((M[y] - M[x])**2 + (M[y+1] - M[x+1])**2 + (M[y+2] - M[x+2])**2)
i+=1
start += 1
M is a memory view of the initial entry array but flatten() by numpy before the call of the function calcul1(), R is a memory view of a 1D output array to store all the results.
Here's my Numpy code :
def calcul2(M):
return np.sqrt(((M[:,:,np.newaxis] - M[:,np.newaxis,:])**2).sum(axis=0))
Here M is the initial entry array but transpose() by numpy before the function call to have coordinates(x,y,z) as rows and points as columns.
Moreover this numpy function is quite convinient because the array it returns is well organise. It's a n by n array with n the number of points and each points has a row and a column. So for example the distance AB is stored at the intersection index of row A and column B.
Here's how I call them (cython function):
cpdef test():
cdef double[::1] Mf
cdef double[::1] out = np.empty(17998000,dtype=np.float64) # (6000² - 6000) / 2
M = np.arange(6000*3,dtype=np.float64).reshape(6000,3) # Example array with 6000 points
Mf = M.flatten() #because my cython algorithm need a 1D array
Mt = M.transpose() # because my numpy algorithm need coordinates as rows
calcul2(Mt)
calcul1(Mf,out)
Am I doing something wrong here ? For my project both are not fast enough.
1: Is there a way to improve my cython code in order to beat numpy's speed ?
2: Is there a way to improve my numpy code to compute even faster ?
3: Or any other solutions, but it must be a python/cython (like parallel computing) ?
Thank you.
Not sure where you are getting your timings, but you can use scipy.spatial.distance:
M = np.arange(6000*3, dtype=np.float64).reshape(6000,3)
np_result = calcul2(M)
sp_result = sd.cdist(M.T, M.T) #Scipy usage
np.allclose(np_result, sp_result)
>>> True
Timings:
%timeit calcul2(M)
1000 loops, best of 3: 313 µs per loop
%timeit sd.cdist(M.T, M.T)
10000 loops, best of 3: 86.4 µs per loop
Importantly, its also useful to realize that your output is symmetric:
np.allclose(sp_result, sp_result.T)
>>> True
An alternative is to only compute the upper triangular of this array:
%timeit sd.pdist(M.T)
10000 loops, best of 3: 39.1 µs per loop
Edit: Not sure which index you want to zip, looks like you may be doing it both ways? Zipping the other index for comparison:
%timeit sd.pdist(M)
10 loops, best of 3: 135 ms per loop
Still about 10x faster than your current NumPy implementation.

quickly summing numpy arrays element-wise

Let's say I want to do an element-wise sum of a list of numpy arrays:
tosum = [rand(100,100) for n in range(10)]
I've been looking for the best way to do this. It seems like numpy.sum is awful:
timeit.timeit('sum(array(tosum), axis=0)',
setup='from numpy import sum; from __main__ import tosum, array',
number=10000)
75.02289700508118
timeit.timeit('sum(tosum, axis=0)',
setup='from numpy import sum; from __main__ import tosum',
number=10000)
78.99106407165527
Reduce is much faster (to the tune of nearly two orders of magnitude):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=10000)
1.131795883178711
It looks like reduce even has a meaningful lead over the non-numpy sum (note that these are for 1e6 runs rather than 1e4 for the above times):
timeit.timeit('reduce(add,tosum)',
setup='from numpy import add; from __main__ import tosum',
number=1000000)
109.98814797401428
timeit.timeit('sum(tosum)',
setup='from __main__ import tosum',
number=1000000)
125.52461504936218
Are there other methods I should try? Can anyone explain the rankings?
Edit
numpy.sum is definitely faster if the list is turned into a numpy array first:
tosum2 = array(tosum)
timeit.timeit('sum(tosum2, axis=0)',
setup='from numpy import sum; from __main__ import tosum2',
number=10000)
1.1545608043670654
However, I'm only interested in doing a sum once, so turning the array into a numpy array would still incur a real performance penalty.
The following is competitive with reduce, and is faster if the tosum list is long enough. However, it's not a lot faster, and it is more code. (reduce(add, tosum) sure is pretty.)
def loop_inplace_sum(arrlist):
# assumes len(arrlist) > 0
sum = arrlist[0].copy()
for a in arrlist[1:]:
sum += a
return sum
Timing for the original tosum. reduce(add, tosum) is faster:
In [128]: tosum = [rand(100,100) for n in range(10)]
In [129]: %timeit reduce(add, tosum)
10000 loops, best of 3: 73.5 µs per loop
In [130]: %timeit loop_inplace_sum(tosum)
10000 loops, best of 3: 78 µs per loop
Timing for a much longer list of arrays. Now loop_inplace_sum is faster.
In [131]: tosum = [rand(100,100) for n in range(500)]
In [132]: %timeit reduce(add, tosum)
100 loops, best of 3: 5.09 ms per loop
In [133]: %timeit loop_inplace_sum(tosum)
100 loops, best of 3: 4.4 ms per loop
Numpy sum is not awful, you are simply using numpy in the wrong way. You won't be able to make use of numpy's speed advantage if you combine normal python, functions (including reduce!), loops and lists with numpy arrays. If you want your code to be fast, you must only use numpy.
Since you did not specify any imports in your code snippet, I am not sure what the function randn is doing or where it comes from, so I just assumed that tosum should just represent a list of 10 matrices of some random numbers. The following code snippet shows that numpy is definitely not as slow as you claim it to be:
import numpy as np
import timeit
def test_np_sum(n=10):
# n represents the numbers of matrices to sum up element wise
tosum = np.random.randint(0, 100, size=(n, 10, 10)) # n 10x10 matrices, shape = (n, 10, 10)
summed = np.sum(tosum, axis=0) # shape = (10, 10)
And then testing it:
timeit.timeit('test_np_sum()', number=10000, setup='from __main__ import test_np_sum')
0.8418250999999941

Fastest pairwise distance metric in python

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.
Here's an example that gives me what I want with an array of 1000 numbers.
import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])
What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.
Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.
Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.
Here's some code:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
Timing with IPython:
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).
Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.
I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:
File my_cython.pyx:
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
#cython.wraparound(False)
#cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
The answer is a 1-D array containing all non-repeated evaluations.
To import into Python:
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
Timing with IPython:
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
Using half the memory, but 6 times slower than np.abs(r - r[:, None]):
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])

Is there a way to efficiently invert an array of matrices with numpy?

Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.

Categories

Resources