Slower time series simulation with Numba

Slower time series simulation with Numba - python

I would like to use the #njit decorator from Numba on this code which, given matrices A,B,C,D produces a sample from the state-space model
x_n = A#x_{n-1} + B#v_n
y_n = C#x_n + D#v_n
#njit
def generate_Y_state_space(N, A, B, C, D):
"""
Simulate M-dimensional time series given state space model defined by A,B,C,D.
"""
M = A_sim.shape[0]
v = np.random.normal(0,1/np.sqrt(2),(M,N)) + 1j*np.random.normal(0,1/np.sqrt(2),(M,N)) # complex gaussian randomly variable
x = np.zeros((M,N),dtype='c16') # 'c16' is the numba type for complex128
y = np.zeros((M,N),dtype='c16')
#initialization
x[:,0] = v[:,0]
y[:,0] = C#x[:,0] + D#v[:,0]
for i in range(1,N):
x[:,i] = A#x[:,i-1] + B#v[:,i]
y[:,i] = C#x[:,i] + D#v[:,i]
return y
However, without the njit decorator, I get the following performance (N=1000, M=100)
%timeit generate_Y_state_space(N, A, B, C, D)
27.9 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
while with the njit decorator, the performance has not really improved:
%timeit generate_Y_state_space(N, A, B, C, D)
24.1 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I wonder if the Numba implementation of the matrix multiplication is in fact not better than the Numpy one... Do you have any idea of how could I improve this code ?
Edit : I think that Numba could be able to provide a nice performance improvement not on the matrix multiplication (as Numpy is already pretty fast as pointed out), but more on the for loop (which is necessary here since the whole point of a time series is to generate a new data point as a transformation of the previous one).

One possible reason why you get a slight decrease in performance with Numba is that you need at least to use fastmath=True in the #njit decorator to be as fast as Numpy which internally use it.
Another reason is that the #njit decorator compile the function at runtime which is a bit slow (and takes often more than 28 ms). You should be careful not to include this compilation time in the benchmark. You can specify the types in the decorator to that Numba can compile the function before the first call (ahead of time). Here is an example:
#njit('c16[:,::1](int64, c16[:,::1], c16[:,::1], c16[:,::1], c16[:,::1])')
Moreover, you do not need to zero-initialize the arrays: x and y can be left uninitialized.
Finally, you can speed up the computation using parallelism. This is not straightforward here as there is a temporal dependency on x[:,i]. However, B#v[:,i] and D#v[:,i] can be computed in parallel for example. Thus, you can use the parameter parallel=True and prange rather than range.
Here is an (untested) example:
#njit('c16[:,::1](int64, c16[:,::1], c16[:,::1], c16[:,::1], c16[:,::1])', fastmath=True, parallel=True)
def generate_Y_state_space(N, A, B, C, D):
"""
Simulate M-dimensional time series given state space model defined by A,B,C,D.
"""
M = A_sim.shape[0]
v = np.random.normal(0,1/np.sqrt(2),(M,N)) + 1j*np.random.normal(0,1/np.sqrt(2),(M,N)) # complex gaussian randomly variable
x = np.empty((M,N),dtype='c16') # 'c16' is the numba type for complex128
y = np.empty((M,N),dtype='c16')
#initialization
x[:,0] = v[:,0]
y[:,0] = C#x[:,0] + D#v[:,0]
for i in prange(1,N):
x[:,i] = B#v[:,i]
y[:,i] = D#v[:,i]
for i in range(1,N):
x[:,i] += A#x[:,i-1]
y[:,i] += C#x[:,i]
return y
Parallelism will not necessary always make the code faster, but it should worth a try on desktop machine.

Related

Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

I want to calculate a large distance matrix, based on a higher dimensional vector. For instance, I have 1000 instances each represented by 20 vectors of length 10. The distance between each two instances is given by the mean distance between each of the 20 vectors associated to each vector. So I want to go from a 1000 by 20 by 10 matrix to a 1000 by 1000 (lower-triangular) matrix. Because these calculations can get slow, I want to use Dask distributed to block the algorithm and spread it over several CPU's. Below is how far I've gotten:
Preamble
import itertools
import random
import numpy as np
import dask.array
from dask.distributed import Client
The distance function is defined by
def distance(u, v):
result = np.empty([int((len(u)*(len(u)+1))/2)], dtype=float)
for i, j in itertools.product(range(len(u)),range(len(v))):
if j <= i:
differences = []
k = int(((i*(i+1))/2 +j-1)+1)
for x,y in itertools.product(u[i], v[j]):
difference = np.abs(np.array(x) - np.array(y)).sum(axis=1)
differences.apply(difference)
result[k] = np.mean(differences)
return result
and returns an array of length n*(n+1)/2 to describe the lower triangular matrix for this block of the distance matrix.
def distance_matrix(X):
X = np.asarray(X, dtype=object)
X = dask.array.from_array(X, (100, 20, 10)).astype(float)
print("chunksize: ", X.chunksize)
resulting_length = [int((X.chunksize[0]*(X.chunksize[0])+1)/2)]
result = dask.array.map_blocks(distance, X, X, chunks=(resulting_length), drop_axis=[1,2], dtype=float)
return result.compute()
I split up the input array in chunks and use dask.array.map_blocks to apply the distance calculation to all the blocks.
if __name__ == '__main__':
workers = 6
X = np.array([[[random.random() for _ in range(10)] for _ in range(20)] for _ in range(1000)])
client = Client(n_workers=workers)
results = similarity_matrix(X)
client.close()
print(results)
Unfortunately, this approach returns the wrong length of array at the end of the process. Would somebody to help me out here? I don't have much experience in distributed computing.

I'm a big fan of dask, but this problem is way too small to need it. The runtime issue you're seeing is because you are looping through each element in python rather than using vectorized operations in numpy.
As with many packages in python, numpy relies on highly efficient compiled code written in other, faster languages such as C to carry out array operations. When you do something like an array operation A + B numpy calls these fast routines once, and the array operations are carried out within a highly optimized C routine. There is overhead involved with making calls to other languages, but this is overwhelmed by the performance gain due to the single call to a very fast routine. If instead you loop over every element, adding cell-wise, you have a (slow) python process, and on each element, this calls the C code, which adds overhead for each element of the array. Because of this, you actually would be better off not using numpy if you're going to do this once for each element.
To implement this in a vectorized manner, you can exploit numpy's broadcasting rules to ensure the first dimensions of your two arrays expand to a new dimension. I don't totally understand what's going on in your distance function, but you could extend this simple version to do whatever you want:
In [1]: import numpy as np
In [2]: A = np.random.random((1000, 20))
...: B = np.random.random((1000, 20))
In [3]: distance = np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
In [4]: distance
Out[4]:
array([[7.22985776, 7.76185666, 5.61824886, ..., 7.62092039, 6.35189562,
7.06365986],
[5.73359499, 5.8422105 , 7.2644021 , ..., 5.72230353, 6.79390303,
5.03074007],
[7.27871151, 8.6856818 , 5.97489449, ..., 8.86620029, 7.49875638,
6.57389575],
...,
[7.67783107, 7.24419076, 4.17941596, ..., 8.68674754, 6.65078093,
5.67279811],
[7.1550136 , 6.10590227, 5.75417987, ..., 7.05953998, 5.8306628 ,
6.55112672],
[5.81748615, 6.79246838, 6.95053088, ..., 7.63994705, 6.77720511,
7.5663236 ]])
In [5]: distance.shape
Out[5]: (1000, 1000)
The performance difference can be seen clearly against a looped implementation:
In [6]: %%timeit
...: np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)
...:
...:
45 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %%timeit
...: distances = np.empty((1000, 1000))
...: for i in range(1000):
...: for j in range(1000):
...: distances[i, j] = np.abs(A[i, :] - B[j, :]).sum()
...:
2.42 s ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The looped version takes more than 50x as long!

Applying a mapping function to each member of an ndarray with indices as arguments

I have an ndarray representing an RGB image with the shape of (width,height,3) and I wish to replace each value with the result of some function of itself, its position and the color channel it belongs to. Doing so in three nested for loops is extremely slow, is there a way to express this as a native array operation ?
Edit: looking for an in-place solution - one that does not involve creating another O(width x height) ndarray (unless numpy has some magic that can prevent such ndarray from actually being allocated)

I'm not sure if I got your question correct! What I understood is that you want to apply a mapping on each channel of your RGB image based on their corresponding indices, if so the code below MIGHT help, since no details was available in your question.
import numpy as np
bit_depth = 8
patch_size = 32
def lut_generator(constant_multiplier):
x = np.arange(2 ** bit_depth)
y = constant_multiplier * x
return dict(zip(x, y))
rgb = np.random.randint(0, (2**bit_depth), (patch_size, patch_size, 3))
# Considering a simple lookup table without using indices.
lut = lut_generator(5)
# splitting three channels followed and their respective indices.
# You can use indexes wherever you need them.
r, g, b = np.dsplit(rgb, rgb.shape[-1])
indexes = np.arange(rgb.size).reshape(rgb.shape)
r_idx, g_idx, b_idx = np.dsplit(indexes, indexes.shape[-1])
# Apply transformation on each channel.
transformed_r = np.vectorize(lut.get)(r)
transformed_g = np.vectorize(lut.get)(g)
transformed_b = np.vectorize(lut.get)(b)
Good luck!

Take note of the qualifications in many of the comments, using numpy arithmetic directly will often be easier and faster.
import numpy as np
def test(item, ix0, ix1, ix2):
# A function with the required signature. This you customise to suit.
return item*(ix0+ix1+ix2)//202
def make_function_for(arr, f):
''' where arr is a 3D numpy array and f is a function taking four arguments.
item : the item from the array
ix0 ... ix2 : the three indices
it returns the required result from these 4 arguments.
'''
def user_f(ix0, ix1, ix2):
# np.fromfunction requires only the three indices as arguments.
ix0=ix0.astype(np.int32)
ix1=ix1.astype(np.int32)
ix2=ix2.astype(np.int32)
return f(arr[ix0, ix1, ix2], ix0, ix1, ix2)
return user_f
# user_f is a function suitable for calling in np.fromfunction
a=np.arange(100*100*3)
a.shape=100,100,3
a[...]=np.fromfunction(make_function_for(a, test), a.shape)
My test function is pretty simple so I can do it in numpy.
Using fromfunction:
%timeit np.fromfunction(make_function_for(a, test), a.shape)
5.7 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy arithmetic:
def alt_func(arr):
temp=np.add.outer(np.arange(arr.shape[0]), np.arange(arr.shape[1]))
temp=np.add.outer(temp,np.arange(arr.shape[2]))
return arr*temp//202
%timeit alt_func(a)
967 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So numpy arithmetic is almost 6 times as fast on my machine for this case.
Edited to correct my seemingly inevitable typos!

Numpy distance calculations of different shaped arrays

Not sure I titled this well, but basically I have a reference coordinate, in the format of (x,y,z), and a large list/array of coordinates also in that format. I need to get the euclidean distance between each, so with numpy and scipy in theory I should be able to do an operation such as:
import numpy, scipy.spatial.distance
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.euclidean(b, a)
But instead of getting an array back I get an error: ValueError: Input vector should be 1-D.
Not sure how to resolve this error and get what I want without having to resort to loops and such, which sort of defeats the purpose of using Numpy.
Long term I want to use these distances to calculate truth masks for counting distance values in bins.
I'm not sure if I'm just using the function wrong or using the wrong function, I haven't been able to find anything in the documentation that would work better.

The documentation of scipy.spatial.distance.euclidean states, that only 1D-vectors are allowed as inputs. Thus you must loop over your arrays like:
distances = np.empty(b.shape[0])
for i in range(b.shape[0]):
distances[i] = scipy.spatial.distance.euclidean(a, b[i])
If you want to have a vectorized implementation, you need to write your own function. Perhaps using np.vectorize with a correct signature will also work, but this is in fact also just a short-hand for a for-loop and will thus have the same performance as a simple for-loop.
As stated in my comment to hannes wittingham's solution, I'll post a one-liner which is focussing on performance:
distances = ((b - a)**2).sum(axis=1)**0.5
Writing out all the calculations reduces the number of separate functions calls and thus assignments of the intermediate results to new arrays. Thus it is about 22% faster than using the solution of hannes wittingham for an array shape of b.shape == (20, 3) and about 5% faster for an array shape of
b.shape == (20000, 3):
a = np.array([1, 1, 1,])
b = np.random.rand(20, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 5.37 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit euclidean_distances(a, b)
# 6.89 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
b = np.random.rand(20000, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 588 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distances(a, b)
# 616 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But your are losing the flexibility of being able to easily change to distance calculation routine. When using the scipy.spatial.distance module, you can change the calculation routing by simply calling another method.
To improve the calculation performance even further, you can use a jit (just in time) compiler like numba for your functions:
import numba as nb
#nb.njit
def euc(a, b):
return ((b - a)**2).sum(axis=1)**0.5
This reduces the time needed to do the calculations by about 70% for small arrays and by about 60% for large arrays. Unluckily the axis keyword for np.linalg.norm is not yet supported by numba.

It's not actually too hard to write your own function to do this - here's mine, which you're welcome to use.
If you are carrying out this operation over a large number of points and speed matters, I would guess this function will beat a for-loop based solution for speed by a long way - numpy is designed to be efficient when carrying out operations on a whole matrix.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
def euclidean_distances(ref_point, co_ords_array):
diffs = co_ords_array - ref_point
sqrd_diffs = numpy.square(diffs)
sum_sqrd_diffs = numpy.sum(sqrd_diffs, axis = 1)
euc_dists = numpy.sqrt(sum_sqrd_diffs)
return euc_dists

This code will get the euclidean norm which should work in many cases, and is fairly quick, and one line. Other methods are more efficient or flexible depending on the needs, and I would favour some of the other solutions posted depending on the work being done.
import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)
distances = numpy.linalg.norm(a - b, axis = 1)

Note the extra set of [] in the definition of a
import numpy, scipy.spatial.distance
a = numpy.array([[1,1,1]])
b = numpy.random.rand(20,3)
distances = scipy.spatial.distance.cdist(b, a, metric='euclidean')

Numpy Pure Functions for performance, caching

I'm writing some moderately performance critical code in numpy.
This code will be in the inner most loop, of a computation that's run time is measured in hours.
A quick calculation suggest that this code will be executed up something like 10^12 times, in some variations of the calculation.
So the function is to calculate sigmoid(X) and another to calculate its derivative (gradient).
Sigmoid has the property that for y=sigmoid(x), dy/dx= y(1-y)
In python for numpy this looks like:
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
As can be seen, both functions are pure (without side effects),
so they are ideal candidates for memoization,
at least for the short term, I have some worries about caching every single call to sigmoid ever made: Storing 10^12 floats which would take several terabytes of RAM.
Is there a good way to optimise this?
Will python pick up that these are pure functions and cache them for me, as appropriate?
Am I worrying over nothing?

These functions already exist in scipy. The sigmoid function is available as scipy.special.expit.
In [36]: from scipy.special import expit
Compare expit to the vectorized sigmoid function:
In [38]: x = np.linspace(-6, 6, 1001)
In [39]: %timeit y = sigmoid(x)
100 loops, best of 3: 2.4 ms per loop
In [40]: %timeit y = expit(x)
10000 loops, best of 3: 20.6 µs per loop
expit is also faster than implementing the formula yourself:
In [41]: %timeit y = 1.0 / (1.0 + np.exp(-x))
10000 loops, best of 3: 27 µs per loop
The CDF of the logistic distribution is the sigmoid function. It is available as the cdf method of scipy.stats.logistic, but cdf eventually calls expit, so there is no point in using that method. You can use the pdf method to compute the derivative of the sigmoid function, or the _pdf method which has less overhead, but "rolling your own" is faster:
In [44]: def sigmoid_grad(x):
....: ex = np.exp(-x)
....: y = ex / (1 + ex)**2
....: return y
Timing (x has length 1001):
In [45]: from scipy.stats import logistic
In [46]: %timeit y = logistic._pdf(x)
10000 loops, best of 3: 73.8 µs per loop
In [47]: %timeit y = sigmoid_grad(x)
10000 loops, best of 3: 29.7 µs per loop
Be careful with your implementation if you are going to use values that are far into the tails. The exponential function can overflow pretty easily. logistic._cdf is a bit more robust than my quick implementation of sigmoid_grad:
In [60]: sigmoid_grad(-500)
/home/warren/anaconda/bin/ipython:3: RuntimeWarning: overflow encountered in double_scalars
import sys
Out[60]: 0.0
In [61]: logistic._pdf(-500)
Out[61]: 7.1245764067412855e-218
An implementation using sech**2 (1/cosh**2) is a bit slower than the above sigmoid_grad:
In [101]: def sigmoid_grad_sech2(x):
.....: y = (0.5 / np.cosh(0.5*x))**2
.....: return y
.....:
In [102]: %timeit y = sigmoid_grad_sech2(x)
10000 loops, best of 3: 34 µs per loop
But it handles the tails better:
In [103]: sigmoid_grad_sech2(-500)
Out[103]: 7.1245764067412855e-218
In [104]: sigmoid_grad_sech2(500)
Out[104]: 7.1245764067412855e-218

Just expanding on my comment, here is a comparison between your sigmoid through vectorize and using numpy directly:
In [1]: x = np.random.normal(size=10000)
In [2]: sigmoid = np.vectorize(lambda x: 1.0 / (1.0 + np.exp(-x)))
In [3]: %timeit sigmoid(x)
10 loops, best of 3: 63.3 ms per loop
In [4]: %timeit 1.0 / (1.0 + np.exp(-x))
1000 loops, best of 3: 250 us per loop
As you can see, not only does vectorize make it much slower, the fact is that you can calculate 10000 sigmoids in 250 microseconds (that is, 25 nanoseconds for each). A single dictionary look-up in Python is slower than that, let alone all the other code to get the memoization in place.
The only way to optimize this that I can think of is writing a sigmoid ufunc for numpy, which basically will implement the operation in C. That way, you won't have to do each operation in the sigmoid to the entire array, even though numpy does this really fast.

If you are looking to memoize this process, I'd wrap that code in a function and decorate with functools.lru_cache(maxsize=n). Experiment with the maxsize value to find the appropriate size for your application. For best results, use a maxsize argument that is a power of two.
from functools import lru_cache
lru_cache(maxsize=8096)
def sigmoids(x):
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
return sigmoid, grad_sigmoid
If you're on 2.7 (which I expect you are since you're using numpy), you can take a look at https://pypi.python.org/pypi/repoze.lru/ for a memoization library with identical syntax.
You can install it via pip: pip install repoze.lru
from repoze.lru import lru_cache
lru_cache(maxsize=8096)
def sigmoids(x):
sigmoid = vectorize(lambda(x): 1.0/(1.0+exp(-x)))
grad_sigmoid = vectorize(lambda (x): sigmoid(x)*(1-sigmoid(x)))
return sigmoid, grad_sigmoid

Mostly I agree with Warren Weckesser and his answer above.
But for derivative of sigmoid the following can be used:
In [002]: def sg(x):
...: s = scipy.special.expit(x)
...: return s * (1.0 - s)
Timings:
In [003]: %timeit y = logistic._pdf(x)
10000 loops, best of 3: 45 µs per loop
In [004]: %timeit y = sg(x)
10000 loops, best of 3: 20.4 µs per loop
The only problem is accuracy:
In [005]: sg(37)
Out[005]: 0.0
In [006]: logistic._pdf(37)
Out[006]: 8.5330476257440658e-17

Is there a way to efficiently invert an array of matrices with numpy?

Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])

It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.

A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s

For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slower time series simulation with Numba - python

Related

Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

Applying a mapping function to each member of an ndarray with indices as arguments

Numpy distance calculations of different shaped arrays

Numpy Pure Functions for performance, caching

Is there a way to efficiently invert an array of matrices with numpy?

Categories

Resources