I have two 3D tensors, tensor A which has shape [B,N,S] and tensor B which also has shape [B,N,S]. What I want to get is a third tensor C, which I expect to have [B,B,N] shape, where the element C[i,j,k] = np.dot(A[i,k,:], B[j,k,:]. I also want to achieve this is a vectorized way.
Some further info: The two tensors A and B have shape [Batch_size, Num_vectors, Vector_size]. The tensor C, is supposed to represent the dot product between each element in the batch from A and each element in the batch from B, between all of the different vectors.
Hope that it is clear enough and looking forward to you answers!
In [331]: A=np.random.rand(100,200,300)
In [332]: B=A
The suggested einsum, working directly from the
C[i,j,k] = np.dot(A[i,k,:], B[j,k,:]
expression:
In [333]: np.einsum( 'ikm, jkm-> ijk', A, B).shape
Out[333]: (100, 100, 200)
In [334]: timeit np.einsum( 'ikm, jkm-> ijk', A, B).shape
800 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
matmul does a dot on the last 2 dimensions, and treats the leading one(s) as batch. In your case 'k' is the batch dimension, and 'm' is the one that should obey the last A and 2nd to the last of B rule. So rewriting the ikm,jkm... to fit, and transposing A and B accordingly:
In [335]: np.einsum('kim,kmj->kij', A.transpose(1,0,2), B.transpose(1,2,0)).shape
Out[335]: (200, 100, 100)
In [336]: timeit np.einsum('kim,kmj->kij',A.transpose(1,0,2), B.transpose(1,2,0)).shape
774 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not much difference in performance. But now use matmul:
In [337]: (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
Out[337]: (100, 100, 200)
In [338]: timeit (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
64.4 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and verify that values match (though more often than not, if shapes match, values do to).
In [339]: np.allclose((A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0),np.einsum( 'ikm, jkm->
...: ijk', A, B))
Out[339]: True
I won't try to measure memory usage, but the time improvement suggests it too is better.
In some cases einsum is optimized to use matmul. Here that doesn't seem to be the case, though we could play with its parameters. I'm a little surprised the matmul is doing so much better.
===
I vaguely recall another SO about matmul taking a short cut when the two arrays are the same thing, A#A. I used B=A in these tests.
In [350]: timeit (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
60.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [352]: B2=np.random.rand(100,200,300)
In [353]: timeit (A.transpose(1,0,2)#B2.transpose(1,2,0)).transpose(1,2,0).shape
97.4 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But that only made a modest difference.
In [356]: np.__version__
Out[356]: '1.16.4'
My BLAS etc is standard Linux, nothing special.
I think you can use einsum such as:
np.einsum( 'ikm, jkm-> ijk', A, B)
with the subscripts 'ikm, jkm-> ijk', you can specify which dimension are reduced with the Einstein convention. The third dimension of both arrays A and B here named 'm' will be reduced as the dot operation does on vectors.
Try:
C = np.diagonal( np.tensordot(A,B, axes=(2,2)), axis1=1, axis2=3)
from https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot
Explanation
The solution is a composition of two operations. First the tensor product between A and B over their third axis as you want it. This outputs a rank-4 tensor, that you want to reduce to a rank-3 tensor by taking equal indices on axis 1 and 3 (your k in your notation, note that tensordot gives a different axis order than your maths). This can be done by taking the diagonal, as you can do when reducing a matrix to the vector of its diagonal entries.
Related
Given two tensors A and B with the same dimension (d>=2) and shapes [A_{1},...,A_{d-2},A_{d-1},A_{d}] and [A_{1},...,A_{d-2},B_{d-1},B_{d}] (shapes of the first d-2 dimensions are identical).
Is there a way to calculate the kronecker product over the last two dimensions?
The shape of my_kron(A,B)should be [A_{1},...,A_{d-2},A_{d-1}*B_{d-1},A_{d}*B_{d}].
For example with d=3,
A.shape=[2,3,3]
B.shape=[2,4,4]
C=my_kron(A,B)
C[0,...] should be the kronecker product of A[0,...] and B[0,...] and C[1,...] the kronecker product of A[1,...] and B[1,...].
For d=2 this is simply what the jnp.kron(or np.kron) function does.
For d=3 this can be achived with jax.vmap.
jax.vmap(lambda x, y: jnp.kron(x[0, :], y[0, :]))(A, B)
But I was not able to find a solution for general (unknown) dimensions.
Any suggestions?
In numpy terms I think this is what you are doing:
In [104]: A = np.arange(2*3*3).reshape(2,3,3)
In [105]: B = np.arange(2*4*4).reshape(2,4,4)
In [106]: C = np.array([np.kron(a,b) for a,b in zip(A,B)])
In [107]: C.shape
Out[107]: (2, 12, 12)
That treats the initial dimension, the 2, as a batch. One obvious generalization is to reshape the arrays, reducing the higher dimensions to 1, e.g. reshape(-1,3,3), etc. And then afterwards, reshape C back to the desired n-dimensions.
np.kron does accept 3d (and higher), but it's doing some sort of outer on the shared 2 dimension:
In [108]: np.kron(A,B).shape
Out[108]: (4, 12, 12)
And visualizing that 4 dimension as (2,2), I can take the diagonal and get your C:
In [109]: np.allclose(np.kron(A,B)[[0,3]], C)
Out[109]: True
The full kron does more calculations than needed, but is still faster:
In [110]: timeit C = np.array([np.kron(a,b) for a,b in zip(A,B)])
108 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [111]: timeit np.kron(A,B)[[0,3]]
76.4 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I'm sure it's possible to do your calculation in a more direct way, but doing that requires a better understanding of how the kron works. A quick glance as the np.kron code suggest that is does an outer(A,B)
In [114]: np.outer(A,B).shape
Out[114]: (18, 32)
which has the same number of elements, but it then reshapes and concatenates to produce the kron layout.
But following a hunch, I found that this is equivalent to what you want:
In [123]: D = A[:,:,None,:,None]*B[:,None,:,None,:]
In [124]: np.allclose(D.reshape(2,12,12),C)
Out[124]: True
In [125]: timeit np.reshape(A[:,:,None,:,None]*B[:,None,:,None,:],(2,12,12))
14.3 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
That is easily generalized to more leading dimensions.
def my_kron(A,B):
D = A[...,:,None,:,None]*B[...,None,:,None,:]
ds = D.shape
newshape = (*ds[:-4],ds[-4]*ds[-3],ds[-2]*ds[-1])
return D.reshape(newshape)
In [137]: my_kron(A.reshape(1,2,1,3,3),B.reshape(1,2,1,4,4)).shape
Out[137]: (1, 2, 1, 12, 12)
This is a query regarding the internal working of torch.einsum in the GPU. I know how to use einsum. Does it perform all possible matrix multiplications, and just pick out the relevant ones, or does it perform only the required computation?
For example, consider two tensors a and b, of shape (N,P), and I wish to find the dot product of each corresponding tensor ni, of shape (1,P).
Using einsum, the code is:
torch.einsum('ij,ij->i',a,b)
Without using einsum, another way to obtain the output is :
torch.diag(a # b.t())
Now, the second code is supposed to perform significantly more computations than the first one (eg if N = 2000, it performs 2000 times more computation). However, when I try to time the two operations, they take roughly the same amount of time to complete, which begs the question. Does einsum perform all combinations (like the second code), and picks out the relevant values?
Sample Code to test:
import time
import torch
for i in range(100):
a = torch.rand(50000, 256).cuda()
b = torch.rand(50000, 256).cuda()
t1 = time.time()
val = torch.diag(a # b.t())
t2 = time.time()
val2 = torch.einsum('ij,ij->i',a,b)
t3 = time.time()
print(t2-t1,t3-t2, torch.allclose(val,val2))
It probably has to do with the fact that the GPU can parallelize the computation of a # b.t(). This means that the GPU doesn't actually have to wait for each row-column multiplication computation to finish to compute then next multiplication.
If you check on CPU then you see that torch.diag(a # b.t()) is significantly slower than torch.einsum('ij,ij->i',a,b) for large a and b.
I can't speak for torch, but have worked with np.einsum in some detail years ago. Then it constructed a custom iterator based on the index string, doing only the necessary calculations. Since then it's been reworked in various ways, and evidently converts the problem to a # where possible, and thus taking advantage of BLAS (etc) library calls.
In [147]: a = np.arange(12).reshape(3,4)
In [148]: b = a
In [149]: np.einsum('ij,ij->i', a,b)
Out[149]: array([ 14, 126, 366])
I can't say for sure what method is used in this case. With the 'j' summation, it could also be done with:
In [150]: (a*b).sum(axis=1)
Out[150]: array([ 14, 126, 366])
As you note, the simplest dot creates a larger array from which we can pull the diagonal:
In [151]: (a#b.T).shape
Out[151]: (3, 3)
But that's not the right way to use #. # expands on np.dot by providing an efficient 'batch' handling. So the i dimension is the batch one, and j the dot one.
In [152]: a[:,None,:]#b[:,:,None]
Out[152]:
array([[[ 14]],
[[126]],
[[366]]])
In [156]: (a[:,None,:]#b[:,:,None])[:,0,0]
Out[156]: array([ 14, 126, 366])
In other words it is using a (3,1,4) with (3,4,1) to produce a (3,1,1), doing the sum of products on the shared size 4 dimension.
Some sample times:
In [162]: timeit np.einsum('ij,ij->i', a,b)
7.07 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [163]: timeit (a*b).sum(axis=1)
9.89 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [164]: timeit np.diag(a#b.T)
10.6 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [165]: timeit (a[:,None,:]#b[:,:,None])[:,0,0]
5.18 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.
Here's a sample of what my current code looks like:
A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4) # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.
Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.
Thus, a solution with tensordot would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum.
Specific to given problem
Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.
Even when used in a small loop tensordot is more than 10x faster:
timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784
out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True
I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!
Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop
You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)
Here's a simple code that "batch multiplies" a 4D matrix a by 3D matrix b:
from functools import reduce
import numpy as np
from operator import mul
def einsum(a, b):
return np.einsum('ijkl,jkl->ikl', a, b)
def original(a, b):
s0, s1, s2, s3 = a.shape
c = np.empty((s0, s2, s3))
for j in range(s3):
for i in range(s2):
c[:, j, i] = np.dot(a[:, :, j, i], b[:, j, i])
return c
sz_a = (16, 4, 512, 512)
sz_b = (4, 512, 512)
a = np.random.random(reduce(mul, sz_a)).reshape(sz_a)
b = np.random.random(reduce(mul, sz_b)).reshape(sz_b)
For timing:
%timeit original(a, b)
395 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit einsum(a, b)
23.1 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'd like to test out tensordot's performance to see how it compares, but I'm really having some trouble wrapping my ahead around how to use it here. If anyone is familiar enough to guide me with this, it would greatly appreciated. Thank you!
My original thought was:
np.tensordot(a, b, axes=((1),(0)))
But that gives me a MemoryError so I don't think that's right...
Time comparisons of your einsum with a matmul equivalent:
In [910]: timeit (a.transpose(2,3,0,1)#b[:,None].transpose(2,3,0,1)).transpose(2,3,0,1)[:
...: ,0]
90.5 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [911]: timeit np.einsum('ijkl,jkl->ikl', a, b)
92.7 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Times are close enough that I suspect einsum optimization is actually using matmul. Originally einsum used its own compiled sum-of-products iteration, but recent with recent changes it uses a variety of methods, including dot and matmul if they fit.
matmul was created to handle the case where the initial dimensions represent a stack of matrices. In your problem the last 2 dimensions are this stack, with the dot acting on the initial. matmul was created to handle this kind of stacked dots. dot, and its derivative tensordot don't handle that kind of stacking.