Efficient linear algebra for block-diagonal matrices in compressed format

Efficient linear algebra for block-diagonal matrices in compressed format - python

I have a linear system in which all the matrices are block diagonal. They have N blocks identical in shape.
Matrices are stored in compressed format as numpy arrays with shape (N, n, m), while the shape of the vectors is (N, m).
I currently implemented the matrix-vector product as
import numpy as np
def mvdot(m, v):
return (m * np.expand_dims(v, -2)).sum(-1)
Thanks to broadcasting rules, if all the blocks of a matrix are the same I have to store it only once in an array with shape (1, n, m): the product with a vector (N, m) still gives the correct (N, n) vector.
My questions are:
how to implement an efficient matrix-matrix product that yields the matrix with shape (N, n, m) from two matrices with shapes (N, n, p) and (N, p, m)?
is there a way to perform these operations with a numpy built-in (possibly faster) function? Functions like np.linalg.inv make me think that numpy was designed to support this compressed format for block diagonal matrices.

If I understand your question correctly, you have two arrays of shape (N,n,p) and (N,p,m), respectively, and their product should be of shape (N,n,m) where element [i,:,:] is the matrix product of M1[i,:,:] and M2[i,:,:]. This can be achieved using numpy.einsum:
import numpy as np
N = 7
n,p,m = 3,4,5
M1 = np.random.rand(N,n,p)
M2 = np.random.rand(N,p,m)
Mprod = np.einsum('ijk,ikl->ijl',M1,M2)
# check if all the submatrices are what we expect
all([np.allclose(np.dot(M1[k,...],M2[k,...]),Mprod[k,...]) for k in range(N)])
# True
Numpy's einsum is an incredibly versatile construction for complicated linear operations, and it's usually pretty efficient with two operands. The idea is to rewrite your operation in an indexed way: what you need is to multiply M1[i,j,k] with M2[i,k,l] for each i,j,l, and sum over k. This is exactly what the above call to einsum does: it collapses the index k, and performs the necessary products and assignments along the remaining dimensions in the given order.
The matrix-vector product can be done similarly:
M = np.random.rand(N,n,m)
v = np.random.rand(N,m)
Mvprod = np.einsum('ijk,ik->ij',M,v)
It's possible that numpy.dot can be coerced with the proper transposes and dimension tricks to directly do what you want, but I couldn't make that work.
Both of the above operations can be done in the same function call by allowing an implicit number of dimensions within einsum:
def mvdot(M1,M2):
return np.einsum('ijk,ik...->ij...',M1,M2)
Mprod = mvdot(M1,M2)
Mvprod = mvdot(M,v)
In case the input argument M2 is a block matrix, there will be a leading dimension appended to the result, creating a block matrix. In case M2 is a "block vector", the result will be a block vector.

Since Python 3.5 and above, the previous example can be simplified using the matrix multiplication operator # (numpy.matmul) which treats this case as a stack of matrices residing in the last two indexes and broadcast accordingly:
import numpy as np
N = 7
n,p,m = 3,4,5
M1 = np.random.rand(N,n,p)
M2 = np.random.rand(N,p,m)
Mprod = M1 # M2 # similar to np.matmul(M1, M2)
all([np.allclose(np.dot(M1[k,...],M2[k,...]),Mprod[k,...]) for k in range(N)])
#True

Related

Product and summation with 3d and 1d arrays

Given a 3d array X with dimensions (K,n,m) that can be considered as a stack of K (n,m) matrices and a 1d vector b (dim n), the goal is to obtain the resulting vector r (dim n) each component of which is calculated as:
It is easy to see that the expression under the k-summation (i.e. two internal sums) is just a dot product X_k b X_k (and, therefore, can easily be calculated using numpy). So, the desired vector r is
where X_k is the k-th 2d (n,m) 'layer' of 3d array X.
I.e. the current solution is
r = 0
for k in range(K):
r += x[k,:,:] # (b # x[k, :, :])
Can r be efficiently calculated avoiding a for-loop by k?
Or maybe there is another efficient way to calculate r?
(I tried np.tensordot but since it is just pure summation by k I didn't get a correct result yet.)

This looks like a perfect usecase for einsum:
r = np.einsum('kij,l,klj->i', x, b, x)
which will vectorize the operation, e.g. it's more optimal than a for loop.

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Let w, x, y, z be torch tensors of shape (m, n) and we wish to compute the following unbiased estimator row-wise efficiently (without for loops), where I want to compute for every row 1, ..., m:
In case of only the unbiased estimator of the square of means, i.e., for :
this is possible, e.g., using torch.einsum:
batch_outer = torch.einsum('bi, bj -> bij', x, y)
zero_diag = 1-torch.eye(batch_outer.shape[1])
return (batch_outer * zero_diag).sum(dim=2).sum(dim=1) / (n * (n-1))
However, for the case to the power of four this is not so easy doable, mostly because these are not squared tensors and in particular, because the zeroing out of the diagonals becomes very tedious.
My questions:
1.) How can this be implemented efficiently ommitting any for loops?
2.) Which time and memory complexity would that solution have in big O notation?
3.) Can this solution also be used to do it with four 3D tensors of shape (m, k, n), where again we only want to do the computations along the axes of length n (dim=2)?
4.) If I want to do it in log-space for numerical stability, i.e., to use logsumexp for summations and sums for multiplications (because log(xy)= log(x)+log(y)), any solution with einsum wouldnt work anymore. How could that computation then be done in log space?

1 This implementation seems to work if I didn't make mess with the diagonal dimensions.
import numpy as np
import torch as th
x = np.array([1,4,5,3])
y = np.array([5,2,4,5])[np.newaxis]
z = np.array([5,7,4,5])[np.newaxis][np.newaxis]
w = np.array([3,9,5,1])[np.newaxis][np.newaxis][np.newaxis]
xth = th.Tensor(x)
yth = th.Tensor(y)
zth = th.Tensor(z)
wth = th.Tensor(w)
tensor = xth*th.transpose(yth, 0, 1)*th.transpose(zth,0,2)*th.transpose(wth,0,3)
diag = th.diagonal(tensor, dim1 = -2, dim2 = -1)
result = th.sum(tensor) - th.sum(diag)
result /= np.math.factorial(len(x))
print(result)
The order is between O(n^2.37..) - O(n^3), depending on the pytorch implementation of the matrix multiplication.
I don't see why not, just choose properly the dimensions to transpose and take the diagonal.
I don't see why would this solution won't work in a log-space.
pd: my knowledge in pytorch is quite limited, but I'm sure you can define x,y,z,w in a more elegant way.

Combining sparse and einsum to perform large sparse sum

I have a matrix A with shape=(N, N) and a matrix B with the same shape=(N, N).
I am constructing a matrix M using the following einsum (using the opt_einsum library):
M = oe.contract('nm,in,jm,pn,qm->ijpq', A, B, B, B, B)
This is evaluating the following sum:
This yeilds matrix M with shape (N, N, N, N). I then reshape this to a 2D array of shape (N**2, N**2)
M = M.reshape((N**2, N**2))
This must be 2D as it is treated as a linear operator.
I want to use the sparse library, as M is sparse, and becomes too large to store for large N. I can make A and B sparse, and insert them into the oe.contract.
The problem is, sparse only supports 2D arrays and so fails to produce the 4D output of shape (N, N, N. N). Is there a way to combine the einsum and reshape steps to allow sparse to be used in this way, as the final shape of M is 2D?

This may not help with your use of opt_einsum, but with a bit of reorganizing I can speed up np.einsum quite a bit, at least for small arrays.
Do a partial product of two B:
c1 = np.einsum('in,jm->ijnm',B,B).reshape(N*N,N,N)
The pq pair is the same, so we don't need to recalculate it:
c2 = np.einsum('nm,onm,pnm->op',A,c1,c1)
I verified that this works for two (3,3) arrays, and the speed up is about 10x.
We can even reshape the nm to 1d, though this doesn't improve speed:
c1 = np.einsum('in,jm->ijnm',B,B).reshape(N*N,N*N)
c3 = np.einsum('n,on,pn->op',A.reshape(N*N),c1,c1)

I did not correctly interpret the error given by opt_einsum.
The problem is not that sparse does not support ND sparse arrays (it does!), but that I was not using a true einsum, as the indices summed over appear more than twice (n and m). As stated in the opt_einsum documentation this will result in the use of the sparse.einsum function, of which none exists. Using only 1 or 2 of each index works. Using a differerent path, one suggested for example by hpaulj can be used to solve the problem.

Tensorflow - matmul of input matrix with batch data

I have some data represented by input_x. It is a tensor of unknown size (should be inputted by batch) and each item there is of size n. input_x undergoes tf.nn.embedding_lookup, so that embed now has dimensions [?, n, m] where m is the embedding size and ? refers to the unknown batch size.
This is described here:
input_x = tf.placeholder(tf.int32, [None, n], name="input_x")
embed = tf.nn.embedding_lookup(W, input_x)
I'm now trying to multiply each sample in my input data (which is now expanded by embedding dimension) by a matrix variable, U, and I can't seem to get how to do that.
I first tried using tf.matmul but it gives an error due to mismatch in shapes. I then tried the following, by expanding the dimension of U and applying batch_matmul (I also tried the function from tf.nn.math_ops., the result was the same):
U = tf.Variable( ... )
U1 = tf.expand_dims(U,0)
h=tf.batch_matmul(embed, U1)
This passes the initial compilation, but then when actual data is applied, I get the following error:
In[0].dim(0) and In[1].dim(0) must be the same: [64,58,128] vs [1,128,128]
I also know why this is happening - I replicated the dimension of U and it is now 1, but the minibatch size, 64, doesn't fit.
How can I do that matrix multiplication on my tensor-matrix input correctly (for unknown batch size)?

Previous answers are obsolete. Currently tf.matmul() support tensors with rank > 2:
The inputs must be matrices (or tensors of rank > 2, representing
batches of matrices), with matching inner dimensions, possibly after
transposition.
Also tf.batch_matmul() was removed and tf.matmul() is the right way to do batch multiplication. The main idea can be understood from the following code:
import tensorflow as tf
batch_size, n, m, k = 10, 3, 5, 2
A = tf.Variable(tf.random_normal(shape=(batch_size, n, m)))
B = tf.Variable(tf.random_normal(shape=(batch_size, m, k)))
tf.matmul(A, B)
Now you will receive a tensor of the shape (batch_size, n, k). Here is what is going on here. Assume you have batch_size of matrices nxm and batch_size of matrices mxk. Now for each pair of them you calculate nxm X mxk which gives you an nxk matrix. You will have batch_size of them.
Notice that something like this is also valid:
A = tf.Variable(tf.random_normal(shape=(a, b, n, m)))
B = tf.Variable(tf.random_normal(shape=(a, b, m, k)))
tf.matmul(A, B)
and will give you a shape (a, b, n, k)

1. I want to multiply a batch of matrices with a batch of matrices of the same length, pairwise
M = tf.random_normal((batch_size, n, m))
N = tf.random_normal((batch_size, m, p))
# python >= 3.5
MN = M # N
# or the old way,
MN = tf.matmul(M, N)
# MN has shape (batch_size, n, p)
2. I want to multiply a batch of matrices with a batch of vectors of the same length, pairwise
We fall back to case 1 by adding and removing a dimension to v.
M = tf.random_normal((batch_size, n, m))
v = tf.random_normal((batch_size, m))
Mv = (M # v[..., None])[..., 0]
# Mv has shape (batch_size, n)
3. I want to multiply a single matrix with a batch of matrices
In this case, we cannot simply add a batch dimension of 1 to the single matrix, because tf.matmul does not broadcast in the batch dimension.
3.1. The single matrix is on the right side
In that case, we can treat the matrix batch as a single large matrix, using a simple reshape.
M = tf.random_normal((batch_size, n, m))
N = tf.random_normal((m, p))
MN = tf.reshape(tf.reshape(M, [-1, m]) # N, [-1, n, p])
# MN has shape (batch_size, n, p)
3.2. The single matrix is on the left side
This case is more complicated. We can fall back to case 3.1 by transposing the matrices.
MT = tf.matrix_transpose(M)
NT = tf.matrix_transpose(N)
NTMT = tf.reshape(tf.reshape(NT, [-1, m]) # MT, [-1, p, n])
MN = tf.matrix_transpose(NTMT)
However, transposition can be a costly operation, and here it is done twice on an entire batch of matrices. It may be better to simply duplicate M to match the batch dimension:
MN = tf.tile(M[None], [batch_size, 1, 1]) # N
Profiling will tell which option works better for a given problem/hardware combination.
4. I want to multiply a single matrix with a batch of vectors
This looks similar to case 3.2 since the single matrix is on the left, but it is actually simpler because transposing a vector is essentially a no-op. We end-up with
M = tf.random_normal((n, m))
v = tf.random_normal((batch_size, m))
MT = tf.matrix_transpose(M)
Mv = v # MT
What about einsum?
All of the previous multiplications could have been written with the tf.einsum swiss army knife. For example the first solution for 3.2 could be written simply as
MN = tf.einsum('nm,bmp->bnp', M, N)
However, note that einsum is ultimately relying on tranpose and matmul for the computation.
So even though einsum is a very convenient way to write matrix multiplications, it hides the complexity of the operations underneath — for example it is not straightforward to guess how many times an einsum expression will transpose your data, and therefore how costly the operation will be. Also, it may hide the fact that there could be several alternatives for the same operation (see case 3.2) and might not necessarily choose the better option.
For this reason, I would personally use explicit formulas like those above to better convey their respective complexity. Although if you know what you are doing and like the simplicity of the einsum syntax, then by all means go for it.

The matmul operation only works on matrices (2D tensors). Here are two main approaches to do this, both assume that U is a 2D tensor.
Slice embed into 2D tensors and multiply each of them with U individually. This is probably easiest to do using tf.scan() like this:
h = tf.scan(lambda a, x: tf.matmul(x, U), embed)
On the other hand if efficiency is important it may be better to reshape embed to be a 2D tensor so the multiplication can be done with a single matmul like this:
embed = tf.reshape(embed, [-1, m])
h = tf.matmul(embed, U)
h = tf.reshape(h, [-1, n, c])
where c is the number of columns in U. The last reshape will make sure that h is a 3D tensor where the 0th dimension corresponds to the batch just like the original x_input and embed.

As answered by #Stryke, there are two ways to achieve this: 1. Scanning, and 2. Reshaping
tf.scan requires lambda functions and is generally used for recursive operations. Some examples for the same are here: https://rdipietro.github.io/tensorflow-scan-examples/
I personally prefer reshaping, since it is more intuitive. If you are trying to matrix multiply each matrix in the 3D tensor by the matrix that is the 2D tensor, like Cijl = Aijk * Bkl, you can do it with a simple reshape.
A' = tf.reshape(Aijk,[i*j,k])
C' = tf.matmul(A',Bkl)
C = tf.reshape(C',[i,j,l])

It seems that in TensorFlow 1.11.0 the docs for tf.matmul incorrectly say that it works for rank >= 2.
Instead, the best clean alternative I've found is to use tf.tensordot(a, b, (-1, 0)) (docs).
This function gets the dot product of any axis of array a and any axis of array b in its general form tf.tensordot(a, b, axis). Providing axis as (-1, 0) gets the standard dot product of two arrays.

Solve broadcasting error without for loop, speed up code

I may be misunderstanding how broadcasting works in Python, but I am still running into errors.
scipy offers a number of "special functions" which take in two arguments, in particular the eval_XX(n, x[,out]) functions.
See http://docs.scipy.org/doc/scipy/reference/special.html
My program uses many orthogonal polynomials, so I must evaluate these polynomials at distinct points. Let's take the concrete example scipy.special.eval_hermite(n, x, out=None).
I would like the x argument to be a matrix shape (50, 50). Then, I would like to evaluate each entry of this matrix at a number of points. Let's define n to be an a numpy array narr = np.arange(10) (where we have imported numpy as np, i.e. import numpy as np).
So, calling
scipy.special.eval_hermite(narr, matrix)
should return Hermitian polynomials H_0(matrix), H_1(matrix), H_2(matrix), etc. Each H_X(matrix) is of the shape (50,50), the shape of the original input matrix.
Then, I would like to sum these values. So, I call
matrix1 = np.sum( [scipy.eval_hermite(narr, matrix)], axis=0 )
but I get a broadcasting error!
ValueError: operands could not be broadcast together with shapes (10,) (50,50)
I can solve this with a for loop, i.e.
matrix2 = np.sum( [scipy.eval_hermite(i, matrix) for i in narr], axis=0)
This gives me the correct answer, and the output matrix2.shape = (50,50). But using this for loop slows down my code, big time. Remember, we are working with entries of matrices.
Is there a way to do this without a for loop?

eval_hermite broadcasts n with x, then evaluates Hn(x) at each point. Thus, the output shape will be the result of broadcasting n with x. So, if you want to make this work, you'll have to make n and x have compatible shapes:
import scipy.special as ss
import numpy as np
matrix = np.ones([100,100]) # example
narr = np.arange(10) # example
ss.eval_hermite(narr[:,None,None], matrix).shape # => (10, 100, 100)
But note that this might actually be faster:
out = np.zeros_like(matrix)
for n in narr:
out += ss.eval_hermite(n, matrix)
In testing, it appears to be between 5-10% faster than np.sum(...) of above.

The documentation for these functions is skimpy, and a lot of the code is compiled, so this is just based on experimentation:
special.eval_hermite(n, x, out=None)
n apparently is a scalar or array of integers. x can be an array of floats.
special.eval_hermite(np.ones(5,int)[:,None],np.ones(6)) gives me a (5,6) result. This is the same shape as what I'd get from np.ones(5,int)[:,None] * np.ones(6).
The np.ones(5,int)[:,None] is a (5,1) array, np.ones(6) a (6,), which for this purpose is equivalent of (1,6). Both can be expanded to (5,6).
So as best I can tell, broadcasting rules in these special functions is the same as for operators like *.
Since special.eval_hermite(nar[:,None,None], x) produces a (10,50,50), you just apply sum to axis 0 of that to produce the (50,50).
special.eval_hermite(nar[:,Nar,Nar], x).sum(axis=0)
Like I wrote before, the same broadcasting (and summing) rules apply for this hermite as they do for a basic operation like *.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient linear algebra for block-diagonal matrices in compressed format - python

Related

Product and summation with 3d and 1d arrays

Pytorch: Efficiently compute unbiased estimator of mean to the power of four

Combining sparse and einsum to perform large sparse sum

Tensorflow - matmul of input matrix with batch data

Solve broadcasting error without for loop, speed up code

Categories

Resources