I have a large matrix A of shape (n, n, 3, 3) with n is about 5000. Now I want find the inverse and transpose of matrix A:
import numpy as np
A = np.random.rand(1000, 1000, 3, 3)
identity = np.identity(3, dtype=A.dtype)
Ainv = np.zeros_like(A)
Atrans = np.zeros_like(A)
for i in range(1000):
for j in range(1000):
Ainv[i, j] = np.linalg.solve(A[i, j], identity)
Atrans[i, j] = np.transpose(A[i, j])
Is there a faster, more efficient way to do this?
This is taken from a project of mine, where I also do vectorized linear algebra on many 3x3 matrices.
Note that there is only a loop over 3; not a loop over n, so the code is vectorized in the important dimensions. I don't want to vouch for how this compares to a C/numba extension to do the same thing though, performance wise. This is likely to be substantially faster still, but at least this blows the loops over n out of the water.
def adjoint(A):
"""compute inverse without division by det; ...xv3xc3 input, or array of matrices assumed"""
AI = np.empty_like(A)
for i in xrange(3):
AI[...,i,:] = np.cross(A[...,i-2,:], A[...,i-1,:])
return AI
def inverse_transpose(A):
"""
efficiently compute the inverse-transpose for stack of 3x3 matrices
"""
I = adjoint(A)
det = dot(I, A).mean(axis=-1)
return I / det[...,None,None]
def inverse(A):
"""inverse of a stack of 3x3 matrices"""
return np.swapaxes( inverse_transpose(A), -1,-2)
def dot(A, B):
"""dot arrays of vecs; contract over last indices"""
return np.einsum('...i,...i->...', A, B)
A = np.random.rand(2,2,3,3)
I = inverse(A)
print np.einsum('...ij,...jk',A,I)
for the transpose:
testing a bit in ipython showed:
In [1]: import numpy
In [2]: x = numpy.ones((5,6,3,4))
In [3]: numpy.transpose(x,(0,1,3,2)).shape
Out[3]: (5, 6, 4, 3)
so you can just do
Atrans = numpy.transpose(A,(0,1,3,2))
to transpose the second and third dimensions (while leaving dimension 0 and 1 the same)
for the inversion:
the last example of http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv
Inverses of several matrices can be computed at once:
from numpy.linalg import inv
a = np.array([[[1., 2.], [3., 4.]], [[1, 3], [3, 5]]])
>>> inv(a)
array([[[-2. , 1. ],
[ 1.5, -0.5]],
[[-5. , 2. ],
[ 3. , -1. ]]])
So i guess in your case, the inversion can be done with just
Ainv = inv(A)
and it will know that the last two dimensions are the ones it is supposed to invert over, and that the first dimensions are just how you stacked your data. This should be much faster
speed difference
for the transpose: your method needs 3.77557015419 sec, and mine needs 2.86102294922e-06 sec (which is a speedup of over 1 million times)
for the inversion: i guess my numpy version is not high enough to try that numpy.linalg.inv trick with (n,n,3,3) shape, to see the speedup there (my version is 1.6.2, and the docs i based my solution on are for 1.8, but it should work on 1.8, if someone else can test that?)
Numpy has the array.T properties which is a shortcut for transpose.
For inversions, you use np.linalg.inv(A).
As posted by wim A.I also works on matrix. e.g.
print (A.I)
for numpy-matrix object, use matrix.getI.
e.g.
A=numpy.matrix('1 3;5 6')
print (A.getI())
Related
I'm trying to find a way to perform operations on each elements across multiple 2D-arrays without having to loop over them. Or at least, not needing two for loops. My code calculates the standard deviation of each pixel over a series of images (arrays). Now, the amount of images there are is not the problem, it is the size of the arrays, making the code take extremely slow. The following is a working example of what I have.
import numpy as np
# reshape(# of image (arrays),# of rows, # of cols)
a = np.arange(32).reshape(2,4,4)
stddev_arr = np.array([])
for i in range(4):
for j in range(4):
pixel = a[0:,i,j]
stddev = np.std(pixel)
stddev_arr = np.append(stddev_arr, stddev)
My actual data is 2000x2000, making this code loop 4000000 times. Is there a better way to do this?
Any advice is extremely appreciated.
You're already using numpy. numpy's std() function takes an axis argument that tells it what axis you want it to operate on (in this case the zeroth axis). Because this offloads the calculation to numpy's C-backend (and possibly using SIMD optimizations for your processor that vectorize a lot of operations), it's so much faster than iterating. Another time-consuming operation in your code is when you append to stddev_arr. Appending to numpy arrays is slow because the entire array is copied into new memory before the new element is added. Now you already know how big that array needs to be, so you might as well preallocate it.
a = np.arange(32).reshape(2, 4, 4)
stdev = np.std(a, axis=0)
This gives a 4x4 array
array([[8., 8., 8., 8.],
[8., 8., 8., 8.],
[8., 8., 8., 8.],
[8., 8., 8., 8.]])
To flatten this into a 1D array, do flat_stdev = stdev.flatten().
Comparing the execution times:
# Using only numpy
def fun1(arr):
return np.std(arr, axis=0).flatten()
# Your function
def fun2(arr):
stddev_arr = np.array([])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
pixel = arr[0:,i,j]
stddev = np.std(pixel)
stddev_arr = np.append(stddev_arr, stddev)
return stddev_arr
# Your function, but pre-allocating stddev_arr
def fun3(arr):
stddev_arr = np.zeros((arr.shape[1] * arr.shape[2],))
x = 0
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
pixel = arr[0:,i,j]
stddev = np.std(pixel)
stddev_arr[x] = stddev
x += 1
return stddev_arr
First, let's make sure all these functions are equivalent:
a = np.random.random((3, 10, 10))
assert np.all(fun1(a) == fun2(a))
assert np.all(fun1(a) == fun3(a))
Yup, all give the same result. Now, let's try with a bigger array.
a = np.random.random((3, 100, 100))
x = timeit.timeit('fun1(a)', setup='from __main__ import fun1, a', number=10)
# x: 0.003302899989648722
y = timeit.timeit('fun2(a)', setup='from __main__ import fun2, a', number=10)
# y: 5.495519500007504
z = timeit.timeit('fun3(a)', setup='from __main__ import fun3, a', number=10)
# z: 3.6250679999939166
Wow! We get a ~1.5x speedup just by preallocating.
Even more wow: using numpy's std() with the axis argument gives a > 1000x speedup, and this is just for the 100x100 array! With bigger arrays, you can expect to see even bigger speedup.
So based on what you have provided, you can reshape your array in another way to vectorize it to replace your two loops. Then you only have to use np.std once on the axis that you want.
a = np.arange(32).reshape(2, 4, 4)
a = a.reshape(2, -1).transpose()
stddev_arr = np.std(a, axis=1)
I am trying to construct a stack of block diagonal matrix in the form of nXMXM in numpy/scipy from a given stacks of matrices (nXmXm), where M=k*m with k the number of stacks of matrices. At the moment, I'm using the scipy.linalg.block_diag function in a for loop to perform this task:
import numpy as np
import scipy.linalg as linalg
a = np.ones((5,2,2))
b = np.ones((5,2,2))
c = np.ones((5,2,2))
result = np.zeros((5,6,6))
for k in range(0,5):
result[k,:,:] = linalg.block_diag(a[k,:,:],b[k,:,:],c[k,:,:])
However, since n is in my case getting quite large, I'm looking for a more efficient way than a for loop. I found 3D numpy array into block diagonal matrix but this does not really solve my problem. Anything I could imagine is transforming each stack of matrices into block diagonals
import numpy as np
import scipy.linalg as linalg
a = np.ones((5,2,2))
b = np.ones((5,2,2))
c = np.ones((5,2,2))
a = linalg.block_diag(*a)
b = linalg.block_diag(*b)
c = linalg.block_diag(*c)
and constructing the resulting matrix from it by reshaping
result = linalg.block_diag(a,b,c)
result = result.reshape((5,6,6))
which does not reshape. I don't even know, if this approach would be more efficient, so I'm asking if I'm on the right track or if somebody knows a better way of constructing this block diagonal 3D matrix or if I have to stick with the for loop solution.
Edit:
Since I'm new to this platform, I don't know where to leave this (Edit or Answer?), but I want to share my final solution: The highlightet solution from panadestein worked very nice and easy, but I'm now using higher dimensional arrays, where my matrices reside in the last two dimensions. Additionally my matrices are no longer of the same dimension (mostly a mixture of 1x1, 2x2, 3x3), so I adopted V. Ayrat's solution with minor changes:
def nd_block_diag(arrs):
shapes = np.array([i.shape for i in arrs])
out = np.zeros(np.append(np.amax(shapes[:,:-2],axis=0), [shapes[:,-2].sum(), shapes[:,-1].sum()]))
r, c = 0, 0
for i, (rr, cc) in enumerate(shapes[:,-2:]):
out[..., r:r + rr, c:c + cc] = arrs[i]
r += rr
c += cc
return out
which works also with array broadcasting, if the input arrays are shaped properly (i.e. the dimensions, which are to be broadcasted are not added automatically). Thanks to pandestein and V. Ayrat for your kind and fast help, I've learned a lot about the possibilites of list comprehensions and array indexing/slicing!
block_diag also just iterate through shapes. Almost all time spend in copying data so you can do it whatever way your want for example with little change of source code of block_diag
arrs = a, b, c
shapes = np.array([i.shape for i in arrs])
out = np.zeros([shapes[0, 0], shapes[:, 1].sum(), shapes[:, 2].sum()])
r, c = 0, 0
for i, (_, rr, cc) in enumerate(shapes):
out[:, r:r + rr, c:c + cc] = arrs[i]
r += rr
c += cc
print(np.allclose(result, out))
# True
I don't think that you can escape all possible loops to solve your problem. One way that I find convenient and perhaps more efficient than your for loop is to use a list comprehension:
import numpy as np
from scipy.linalg import block_diag
# Define input matrices
a = np.ones((5, 2, 2))
b = np.ones((5, 2, 2))
c = np.ones((5, 2, 2))
# Generate block diagonal matrices
mats = np.array([a, b, c]).reshape(5, 3, 2, 2)
result = [block_diag(*bmats) for bmats in mats]
Maybe this can give you some ideas to improve your implementation.
I have the following code snippet
def norm(x1, x2):
return np.sqrt(((x1 - x2)**2).sum(axis=0))
def call_norm(x1, x2):
x1 = x1[..., :, np.newaxis]
x2 = x2[..., np.newaxis, :]
return norm(x1, x2)
As I understand it, each x represents an array of points in N dimensional space, where N is the size of the final dimension of the array (so for points in 3-space the final dimension is size 3). It inserts extra dimensions and uses broadcasting to generate the cartesian product of these sets of points, and so calculates the distance between all pairs of points.
x = np.array([[1, 2, 3],[1, 2, 3]])
call_norm(x, x)
array([[ 0. , 1.41421356, 2.82842712],
[ 1.41421356, 0. , 1.41421356],
[ 2.82842712, 1.41421356, 0. ]])
(so the distance between[1,1] and [2,2] is 1.41421356, as expected)
I find that for moderate size problems this approach can use huge amounts of memory. I can easily "de-vectorise" the problem and replace it by iteration, but I'd expect that to be slow. I there a (reasonably) easy compromise solution where I could have most of the speed advantages of vectorisation but without the memory penalty? Some fancy generator trick?
There is no way to do this kind of computation without the memory penalty with numpy vectorization. For the specific case of efficiently computing pairwise distance matrices, packages tend to get around this by implementing things in C (e.g. scipy.spatial.distance) or in Cython (e.g. sklearn.metrics.pairwise).
If you want to do this "by-hand", so to speak, using numpy-style syntax but without incurring the memory penalty, the current best option is probably dask.array, which automates the construction and execution of flexible task graphs for batch execution using a numpy-style syntax.
Here's an example of using dask for this computation:
import dask.array as da
# Create the chunked data. This can be created
# from numpy arrays as well, e.g. x_dask = da.array(x_numpy)
x = da.random.random((100, 3), chunks=5)
y = da.random.random((200, 3), chunks=5)
# Compute the task graph (syntax just like numpy!)
diffs = x[:, None, :] - y[None, :, :]
dist = da.sqrt((diffs ** 2).sum(-1))
# Execute the task graph
result = dist.compute()
print(result.shape)
# (100, 200)
You'll find that dask is much more memory efficient than NumPy, is often more computationally efficient than NumPy, and can also be computed in parallel/out-of core relatively straightforwardly.
Given two large numpy arrays, one for a list of 3D points, and another for a list of transformation matrices. Assuming there is a 1 to 1 correspondence between the two lists, i'm looking for the best way to calculate the result array of each point transformed by it's corresponding matrix.
My solution to do this was to use slicing (see "test4" in the example code below) which worked fine with small arrays, but fails with large arrays because of how memory-wasteful my method is :)
import numpy as np
COUNT = 100
matrix = np.random.random_sample((3,3,)) # A single matrix
matrices = np.random.random_sample((COUNT,3,3,)) # Many matrices
point = np.random.random_sample((3,)) # A single point
points = np.random.random_sample((COUNT,3,)) # Many points
# Test 1, result of a single point multiplied by a single matrix
# This is as easy as it gets
test1 = np.dot(point,matrix)
print 'done'
# Test 2, result of a single point multiplied by many matrices
# This works well and returns a transformed point for each matrix
test2 = np.dot(point,matrices)
print 'done'
# Test 3, result of many points multiplied by a single matrix
# This works also just fine
test3 = np.dot(points,matrix)
print 'done'
# Test 4, this is the case i'm trying to solve. Assuming there's a 1-1
# correspondence between the point and matrix arrays, the result i want
# is an array of points, where each point has been transformed by it's
# corresponding matrix
test4 = np.zeros((COUNT,3))
for i in xrange(COUNT):
test4[i] = np.dot(points[i],matrices[i])
print 'done'
With a small array, this works fine. With large arrays, (COUNT=1000000) Test #4 works but gets rather slow.
Is there a way to make Test #4 faster? Presuming without using a loop?
You can use numpy.einsum. Here's an example with 5 matrices and 5 points:
In [49]: matrices.shape
Out[49]: (5, 3, 3)
In [50]: points.shape
Out[50]: (5, 3)
In [51]: p = np.einsum('ijk,ik->ij', matrices, points)
In [52]: p[0]
Out[52]: array([ 1.16532051, 0.95155227, 1.5130032 ])
In [53]: matrices[0].dot(points[0])
Out[53]: array([ 1.16532051, 0.95155227, 1.5130032 ])
In [54]: p[1]
Out[54]: array([ 0.79929572, 0.32048587, 0.81462493])
In [55]: matrices[1].dot(points[1])
Out[55]: array([ 0.79929572, 0.32048587, 0.81462493])
The above is doing matrix[i] * points[i] (i.e. multiplying on the right), but I just reread the question and noticed that your code uses points[i] * matrix[i]. You can do that by switching the indices and arguments of einsum:
In [76]: lp = np.einsum('ij,ijk->ik', points, matrices)
In [77]: lp[0]
Out[77]: array([ 1.39510822, 1.12011057, 1.05704609])
In [78]: points[0].dot(matrices[0])
Out[78]: array([ 1.39510822, 1.12011057, 1.05704609])
In [79]: lp[1]
Out[79]: array([ 0.49750324, 0.70664634, 0.7142573 ])
In [80]: points[1].dot(matrices[1])
Out[80]: array([ 0.49750324, 0.70664634, 0.7142573 ])
It doesn't make much sense to have multiple transform matrices. You can combine transform matrices as in this question:
If I want to apply matrix A, then B, then C, I will multiply the matrices in reverse order np.dot(C,np.dot(B,A))
So you can save some memory space by precomputing that matrix. Then applying a bunch of vectors to one transform matrix should be easily handled (within reason).
I don't know why you would need one million transformation on one million vectors, but I would suggest buying a larger RAM.
Edit:
There isn't a way to reduce the operations, no. Unless your transformation matrices hold a specific property such as sparsity, diagonality, etc. you're going to have to run all multiplications and summations. However, the way you process these operations can be optimized across cores and/or using vector operations on GPUs.
Also, python is notably slow. You can try splitting numpy across your cores using NumExpr. Or maybe use a BLAS framework on C++ (notably quick ;))
I'm working to implement the following equation:
X =(Y.T * Y + Y.T * C * Y) ^ -1
Y is a (n x f) matrix and C is (n x n) diagonal one; n is about 300k and f will vary between 100 and 200. As part of an optimization process this equation will be used almost 100 million times so it has to be processed really fast.
Y is initialized randomly and C is a very sparse matrix with only a few numbers out of the 300k on the diagonal will be different than 0.Since Numpy's diagonal functions creates dense matrices, I created C as a sparse csr matrix. But when trying to solve the first part of the equation:
r = dot(C, Y)
The computer crashes due Memory limits. I decided then trying to convert Y to csr_matrix and make the same operation:
r = dot(C, Ysparse)
and this approach took 1.38 ms. But this solution is somewhat "tricky" since I'm using a sparse matrix to store a dense one, I wonder how efficient this really.
So my question is if is there some way of multiplying the sparse C and the dense Y without having to turn Y into sparse and improve performance? If somehow C could be represented as diagonal dense without consuming tons of memory maybe this would lead to very efficient performance but I don't know if this is possible.
I appreciate your help!
The reason the dot product runs into memory issues when computing r = dot(C,Y) is because numpy's dot function does not have native support for handling sparse matrices. What is happening is numpy thinks of the sparse matrix C as a python object, and not a numpy array. If you inspect on small scale you can see the problem first hand:
>>> from numpy import dot, array
>>> from scipy import sparse
>>> Y = array([[1,2],[3,4]])
>>> C = sparse.csr_matrix(array([[1,0], [0,2]]))
>>> dot(C,Y)
array([[ (0, 0) 1
(1, 1) 2, (0, 0) 2
(1, 1) 4],
[ (0, 0) 3
(1, 1) 6, (0, 0) 4
(1, 1) 8]], dtype=object)
Clearly the above is not the result you are interested in. Instead what you want to do is compute using scipy's sparse.csr_matrix.dot function:
r = sparse.csr_matrix.dot(C, Y)
or more compactly
r = C.dot(Y)
Try:
import numpy as np
from scipy import sparse
f = 100
n = 300000
Y = np.random.rand(n, f)
Cdiag = np.random.rand(n) # diagonal of C
Cdiag[np.random.rand(n) < 0.99] = 0
# Compute Y.T * C * Y, skipping zero elements
mask = np.flatnonzero(Cdiag)
Cskip = Cdiag[mask]
def ytcy_fast(Y):
Yskip = Y[mask,:]
CY = Cskip[:,None] * Yskip # broadcasting
return Yskip.T.dot(CY)
%timeit ytcy_fast(Y)
# For comparison: all-sparse matrices
C_sparse = sparse.spdiags([Cdiag], [0], n, n)
Y_sparse = sparse.csr_matrix(Y)
%timeit Y_sparse.T.dot(C_sparse * Y_sparse)
My timings:
In [59]: %timeit ytcy_fast(Y)
100 loops, best of 3: 16.1 ms per loop
In [18]: %timeit Y_sparse.T.dot(C_sparse * Y_sparse)
1 loops, best of 3: 282 ms per loop
First, are you really sure you need to perform a full matrix inversion in your problem ? Most of the time, one only really need to compute x = A^-1 y which is a much easier problem to solve.
If this is really so, I would consider computing an approximation of the inverse matrix instead of the full matrix inversion. Since matrix inversion is really costly. See for example the Lanczos algorithm for an efficient approximation of the inverse matrix. The approximation can be stored sparsely as a bonus. Plus, it requires only matrix-vector operations so you don't even have to store the full matrix to inverse.
As an alternative, using pyoperators, you can also use to .todense method to compute the matrix to inverse using efficient matrix vector operations. There is a special sparse container for diagonal matrices.
For an implementation of the Lanczos algorithm, you can have a look at pyoperators (disclaimer: I am one of the coauthor of this piece of software).
I don't know if it was possible when the question was asked; but nowadays, broadcasting is your friend. An n*n diagonal matrix needs only be an array of the diagonal elements to be used in a matrix product:
>>> n, f = 5, 3
>>> Y = np.random.randint(0, 10, (n, f))
>>> C = np.random.randint(0, 10, (n,))
>>> Y.shape
(5, 3)
>>> C.shape
(5,)
>>> np.all(Y.T # np.diag(C) # Y == Y.T*C # Y)
True
Do note that Y.T*C # Y is non-associative:
>>> Y.T*(C # Y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (3,5) (3,)
But Y.T # (C[:, np.newaxis]*Y) would yield the expected result:
>>> np.all(Y.T*C # Y == Y.T#(C[:, np.newaxis]*Y))
True