How exactly does torch / np einsum work internally

How exactly does torch / np einsum work internally - python

This is a query regarding the internal working of torch.einsum in the GPU. I know how to use einsum. Does it perform all possible matrix multiplications, and just pick out the relevant ones, or does it perform only the required computation?
For example, consider two tensors a and b, of shape (N,P), and I wish to find the dot product of each corresponding tensor ni, of shape (1,P).
Using einsum, the code is:
torch.einsum('ij,ij->i',a,b)
Without using einsum, another way to obtain the output is :
torch.diag(a # b.t())
Now, the second code is supposed to perform significantly more computations than the first one (eg if N = 2000, it performs 2000 times more computation). However, when I try to time the two operations, they take roughly the same amount of time to complete, which begs the question. Does einsum perform all combinations (like the second code), and picks out the relevant values?
Sample Code to test:
import time
import torch
for i in range(100):
a = torch.rand(50000, 256).cuda()
b = torch.rand(50000, 256).cuda()
t1 = time.time()
val = torch.diag(a # b.t())
t2 = time.time()
val2 = torch.einsum('ij,ij->i',a,b)
t3 = time.time()
print(t2-t1,t3-t2, torch.allclose(val,val2))

It probably has to do with the fact that the GPU can parallelize the computation of a # b.t(). This means that the GPU doesn't actually have to wait for each row-column multiplication computation to finish to compute then next multiplication.
If you check on CPU then you see that torch.diag(a # b.t()) is significantly slower than torch.einsum('ij,ij->i',a,b) for large a and b.

I can't speak for torch, but have worked with np.einsum in some detail years ago. Then it constructed a custom iterator based on the index string, doing only the necessary calculations. Since then it's been reworked in various ways, and evidently converts the problem to a # where possible, and thus taking advantage of BLAS (etc) library calls.
In [147]: a = np.arange(12).reshape(3,4)
In [148]: b = a
In [149]: np.einsum('ij,ij->i', a,b)
Out[149]: array([ 14, 126, 366])
I can't say for sure what method is used in this case. With the 'j' summation, it could also be done with:
In [150]: (a*b).sum(axis=1)
Out[150]: array([ 14, 126, 366])
As you note, the simplest dot creates a larger array from which we can pull the diagonal:
In [151]: (a#b.T).shape
Out[151]: (3, 3)
But that's not the right way to use #. # expands on np.dot by providing an efficient 'batch' handling. So the i dimension is the batch one, and j the dot one.
In [152]: a[:,None,:]#b[:,:,None]
Out[152]:
array([[[ 14]],
[[126]],
[[366]]])
In [156]: (a[:,None,:]#b[:,:,None])[:,0,0]
Out[156]: array([ 14, 126, 366])
In other words it is using a (3,1,4) with (3,4,1) to produce a (3,1,1), doing the sum of products on the shared size 4 dimension.
Some sample times:
In [162]: timeit np.einsum('ij,ij->i', a,b)
7.07 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [163]: timeit (a*b).sum(axis=1)
9.89 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [164]: timeit np.diag(a#b.T)
10.6 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [165]: timeit (a[:,None,:]#b[:,:,None])[:,0,0]
5.18 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Looping over np.einsum many times... Is there a faster way?

I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.
Here's a sample of what my current code looks like:
A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4) # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.

Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.
Thus, a solution with tensordot would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum.
Specific to given problem
Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.

Even when used in a small loop tensordot is more than 10x faster:
timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784
out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True

Numpy 3D matrix multiplication

I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!

Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop

You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)

Dot product between two 3D tensors

I have two 3D tensors, tensor A which has shape [B,N,S] and tensor B which also has shape [B,N,S]. What I want to get is a third tensor C, which I expect to have [B,B,N] shape, where the element C[i,j,k] = np.dot(A[i,k,:], B[j,k,:]. I also want to achieve this is a vectorized way.
Some further info: The two tensors A and B have shape [Batch_size, Num_vectors, Vector_size]. The tensor C, is supposed to represent the dot product between each element in the batch from A and each element in the batch from B, between all of the different vectors.
Hope that it is clear enough and looking forward to you answers!

In [331]: A=np.random.rand(100,200,300)
In [332]: B=A
The suggested einsum, working directly from the
C[i,j,k] = np.dot(A[i,k,:], B[j,k,:]
expression:
In [333]: np.einsum( 'ikm, jkm-> ijk', A, B).shape
Out[333]: (100, 100, 200)
In [334]: timeit np.einsum( 'ikm, jkm-> ijk', A, B).shape
800 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
matmul does a dot on the last 2 dimensions, and treats the leading one(s) as batch. In your case 'k' is the batch dimension, and 'm' is the one that should obey the last A and 2nd to the last of B rule. So rewriting the ikm,jkm... to fit, and transposing A and B accordingly:
In [335]: np.einsum('kim,kmj->kij', A.transpose(1,0,2), B.transpose(1,2,0)).shape
Out[335]: (200, 100, 100)
In [336]: timeit np.einsum('kim,kmj->kij',A.transpose(1,0,2), B.transpose(1,2,0)).shape
774 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not much difference in performance. But now use matmul:
In [337]: (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
Out[337]: (100, 100, 200)
In [338]: timeit (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
64.4 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and verify that values match (though more often than not, if shapes match, values do to).
In [339]: np.allclose((A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0),np.einsum( 'ikm, jkm->
...: ijk', A, B))
Out[339]: True
I won't try to measure memory usage, but the time improvement suggests it too is better.
In some cases einsum is optimized to use matmul. Here that doesn't seem to be the case, though we could play with its parameters. I'm a little surprised the matmul is doing so much better.
===
I vaguely recall another SO about matmul taking a short cut when the two arrays are the same thing, A#A. I used B=A in these tests.
In [350]: timeit (A.transpose(1,0,2)#B.transpose(1,2,0)).transpose(1,2,0).shape
60.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [352]: B2=np.random.rand(100,200,300)
In [353]: timeit (A.transpose(1,0,2)#B2.transpose(1,2,0)).transpose(1,2,0).shape
97.4 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But that only made a modest difference.
In [356]: np.__version__
Out[356]: '1.16.4'
My BLAS etc is standard Linux, nothing special.

I think you can use einsum such as:
np.einsum( 'ikm, jkm-> ijk', A, B)
with the subscripts 'ikm, jkm-> ijk', you can specify which dimension are reduced with the Einstein convention. The third dimension of both arrays A and B here named 'm' will be reduced as the dot operation does on vectors.

Try:
C = np.diagonal( np.tensordot(A,B, axes=(2,2)), axis1=1, axis2=3)
from https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot
Explanation
The solution is a composition of two operations. First the tensor product between A and B over their third axis as you want it. This outputs a rank-4 tensor, that you want to reduce to a rank-3 tensor by taking equal indices on axis 1 and 3 (your k in your notation, note that tensordot gives a different axis order than your maths). This can be done by taking the diagonal, as you can do when reducing a matrix to the vector of its diagonal entries.

Get part of array plus first element in numpy (In a pythonic way)

I have a numpy array and i need to get (without changing the original) the same array, but with the first item places at the end. Since i am using this a lot i am looking for clean way of getting this.
So for example, if my original array is [1,2,3,4] , i would like to get an array [4,1,2,3] without modifying the original array.
I found one solution:
x = [1,2,3,4]
a = np.append(x[1:],x[0])]
However, i am looking for a more pythonic way. Basically something like this:
x = [1,2,3,4]
a = x[(:1,0)]
However, this of course doesn't work. Is there a better way of doing what i want than using the append() function?

np.roll is easy to use, but not the fastest method. It is general purpose, with multiple dimensions and shifts.
Its action can be simplified to:
def simple_roll(x):
res = np.empty_like(x)
res[0] = x[-1]
res[1:] = x[:-1]
return res
In [90]: np.roll(np.arange(1,5),1)
Out[90]: array([4, 1, 2, 3])
In [91]: simple_roll(np.arange(1,5))
Out[91]: array([4, 1, 2, 3])
time tests:
In [92]: timeit np.roll(np.arange(1001),1)
36.8 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [93]: timeit simple_roll(np.arange(1001))
5.54 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We could also use r_ to construct one index array to do the copy. But it is slower (due to advanced indexing as opposed to slicing):
def simple_roll1(x):
idx = np.r_[-1,0:x.shape[0]-1]
return x[idx]
In [101]: timeit simple_roll1(np.arange(1001))
34.2 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can use np.roll, as from the docs:
Roll array elements along a given axis.
Elements that roll beyond the last position are re-introduced at the
first.
np.roll([1,2,3,4], 1)
# array([4, 1, 2, 3])
To roll in the other direction, use a negative shift:
np.roll([1,2,3,4], -1)
# array([2, 3, 4, 1])

Pandas Efficiency with small data

I am just curious!
Is there any lower limit, on which we shouldn't use pandas?
Using pandas for large data is good, considering the efficiency and readability.
But is there any lower limit on which we must use traditional looping(Python 3) over pandas?
When should I consider using pandas or numpy?

As far as i know pandas is using numpy (vector operations) under the hood quite extensively. Numpy is faster than python because it low level and has more memory friendly behaviour than python (in many cases). But it depends what you are doing of course. For numpy based operations pandas should have same performance than numpy of course.
For general vector like (eg. column wise apply) operations it will always be faster to use numpy / pandas.
"for" loops in python eg. over pandas dataframe rows are slow.
If you need to apply non vectorized key based lookups in pandas. Better go with something like dictionaries
Use pandas when you need time series or data frame like structures. Use numpy if you can organise your data in matrices / vectors (arithmetics).
Edit:
For very small python object, native python might be faster because low level libraries introduce small overhead!
Numpy example:
In [21]: a = np.random.rand(10)
In [22]: a
Out[22]:
array([ 0.60555782, 0.14585568, 0.94783553, 0.59123449, 0.07151141,
0.6480999 , 0.28743679, 0.19951774, 0.08312469, 0.16396394])
In [23]: %timeit a.mean()
5.16 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For loop example:
In [24]: b = a.tolist()
In [25]: b
Out[25]:
[0.6055578242263301,
0.14585568245745317,
0.9478355284829876,
0.5912344944487721,
0.07151141037216913,
0.6480999041895205,
0.2874367896457555,
0.19951773879879775,
0.0831246913880146,
0.16396394311100215]
In [26]: def mean(x):
...: s = 0
...: for i in x:
...: s += i
...: return s / len(x)
...:
In [27]: mean(b)
Out[27]: 0.37441380071208025
In [28]: a.mean()
Out[28]: 0.37441380071208025
In [29]: %timeit mean(b)
608 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Ooops, python for loop is faster here. I it seems that numpy creates a small overhead (maybe from interfacing to c) at each timit iteration.
So lets try with longer arrays.
In [34]: a = np.random.rand(int(1e6))
In [35]: %timeit a.mean()
599 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: b = a.tolist()
In [37]: %timeit mean(b)
31.8 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Ok, so my conclusion is that there is some minimum object size from which on the usage of low level libs like numpy and pandas pays back. If someone likes please feel free to repeat the experiment with pandas

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How exactly does torch / np einsum work internally - python

Related

Looping over np.einsum many times... Is there a faster way?

Numpy 3D matrix multiplication

Dot product between two 3D tensors

Get part of array plus first element in numpy (In a pythonic way)

Pandas Efficiency with small data

Categories

Resources