Here's a simple code that "batch multiplies" a 4D matrix a by 3D matrix b:
from functools import reduce
import numpy as np
from operator import mul
def einsum(a, b):
return np.einsum('ijkl,jkl->ikl', a, b)
def original(a, b):
s0, s1, s2, s3 = a.shape
c = np.empty((s0, s2, s3))
for j in range(s3):
for i in range(s2):
c[:, j, i] = np.dot(a[:, :, j, i], b[:, j, i])
return c
sz_a = (16, 4, 512, 512)
sz_b = (4, 512, 512)
a = np.random.random(reduce(mul, sz_a)).reshape(sz_a)
b = np.random.random(reduce(mul, sz_b)).reshape(sz_b)
For timing:
%timeit original(a, b)
395 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit einsum(a, b)
23.1 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'd like to test out tensordot's performance to see how it compares, but I'm really having some trouble wrapping my ahead around how to use it here. If anyone is familiar enough to guide me with this, it would greatly appreciated. Thank you!
My original thought was:
np.tensordot(a, b, axes=((1),(0)))
But that gives me a MemoryError so I don't think that's right...
Time comparisons of your einsum with a matmul equivalent:
In [910]: timeit (a.transpose(2,3,0,1)#b[:,None].transpose(2,3,0,1)).transpose(2,3,0,1)[:
...: ,0]
90.5 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [911]: timeit np.einsum('ijkl,jkl->ikl', a, b)
92.7 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Times are close enough that I suspect einsum optimization is actually using matmul. Originally einsum used its own compiled sum-of-products iteration, but recent with recent changes it uses a variety of methods, including dot and matmul if they fit.
matmul was created to handle the case where the initial dimensions represent a stack of matrices. In your problem the last 2 dimensions are this stack, with the dot acting on the initial. matmul was created to handle this kind of stacked dots. dot, and its derivative tensordot don't handle that kind of stacking.
Related
I'm a looking for a way to compute 2D convolutions/correlations fast on large images, preferably in Python. My filters can be made separable, though I incur some added computational cost. The fastest way to compute a 2D convolution that I have found so far is using OpenCV. However, their separable-filter function, sepFilter2D, is slower than the non-separable function. Here are the timings I get on a 4-core laptop:
import cv2
import numpy as np
import timeit
M = 2048
N = 8192
K = 25
A = np.random.randn(M, N).astype('float32')
b = np.random.randn(1, K).astype('float32')
c = np.random.randn(K, 1).astype('float32')
bc = c*b
X = cv2.filter2D(A, -1, bc, borderType=cv2.BORDER_CONSTANT)
Y = cv2.sepFilter2D(A, -1, b, c, borderType=cv2.BORDER_CONSTANT)
Z = cv2.filter2D(cv2.filter2D(A, -1, b, borderType=cv2.BORDER_CONSTANT), -1, c, borderType=cv2.BORDER_CONSTANT)
# check that all give the same result
assert np.linalg.norm(X-Y)/np.linalg.norm(X) < 1e-6
assert np.linalg.norm(X-Z)/np.linalg.norm(X) < 1e-6
%timeit cv2.filter2D(A, -1, bc, borderType=cv2.BORDER_CONSTANT)
%timeit cv2.sepFilter2D(A, -1, b, c, borderType=cv2.BORDER_CONSTANT)
%timeit cv2.filter2D(cv2.filter2D(A, -1, b, borderType=cv2.BORDER_CONSTANT), -1, c, borderType=cv2.BORDER_CONSTANT)
309 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
445 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
123 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In the last version I tried running two 1D filters sequentially, and this is faster than the other two methods. Why is this? Even then, I was hoping to get a larger benefit from the separable filter. Is there a faster way to compute a 2D convolution with a separable filter?
This question already has an answer here:
Dot product between 2D and 3D numpy arrays
(1 answer)
Closed 2 years ago.
I need to solve a Finite Element Method problem and have to calculate the following C from A and B with a large M (M>1M). For example,
import numpy as np
M=4000000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
C = (A * B).sum(axis = -1) # need to be optimized
Could anyone come up with a code which is faster than (A * B).sum(axis = -1)? You can reshape or re-arrange the axes of A, B, and C freely.
You can use np.einsum for a slightly more efficient approach, both in performance and memory usage:
M=40000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
out = (A * B).sum(axis = -1) # need to be optimized
%timeit (A * B).sum(axis = -1) # need to be optimized
# 5.23 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.einsum('ijk,jk->ij', A, B)
# 1.31 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(out, np.einsum('ijk,jk->ij', A, B))
# True
To speed up numpy multiplication in general, one possible approach is using ctypes. However, as far as I know, this approach probably will give you limited performance improvements (if any).
You could use NumExpr like this for a 3x speedup:
import numpy as np
import numexpr as ne
M=40000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
%timeit out = (A * B).sum(axis = -1)
2.12 ms ± 57.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit me = ne.evaluate('sum(A*B,2)')
662 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
out = (A * B).sum(axis = -1)
me = numexpr.evaluate('sum(A*B,2)')
np.allclose(out,me)
Out[29]: True
I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.
Here's a sample of what my current code looks like:
A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4) # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.
Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.
Thus, a solution with tensordot would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum.
Specific to given problem
Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.
Even when used in a small loop tensordot is more than 10x faster:
timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784
out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True
I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!
Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop
You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)
I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?
In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.
This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)