Looping over np.einsum many times... Is there a faster way?

Looping over np.einsum many times... Is there a faster way? - python

I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.
Here's a sample of what my current code looks like:
A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4) # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.

Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.
Thus, a solution with tensordot would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum.
Specific to given problem
Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.

Even when used in a small loop tensordot is more than 10x faster:
timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784
out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True

Related

Most efficient way to multiply a small matrix with a scalar in numpy

I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:
large_dimension = 1000
a = np.random.random((1,))
b = np.random.random((1, large_dimension))
c = np.matmul(a, b)
In other words, multiplying matrix b with the scalar a[0].
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this, and they are practically equivalent:
%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is there a more efficient way to compute this?
Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.

large_dimension = 1000
a = np.random.random((1,))
B = np.random.random((1, large_dimension))
%timeit np.matmul(a, B)
5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a[0] * B
5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Use just float
%timeit float(a[0]) * B
3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid memory allocation use "buffer"
buffer = np.empty_like(B)
%timeit np.multiply(float(a[0]), B, buffer)
2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
To avoid unnecessary getting attribute use "alias"
mul = np.multiply
%timeit mul(float(a[0]), B, buffer)
2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And I don't recommend using numpy scalars at all,
because if you avoid it, computation will be faster
a_float = float(a[0])
%timeit mul(a_float, B, buffer)
1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)
rng = range(1000)
%%timeit
for i in rng:
pass
24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in rng:
mul(a_float, B, buffer)
1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So,
"best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)
"speedup" = 5.43 / 1.89 = 2.87

In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).
since this operation is repeated millions of times
This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.
Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:
#nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
val = a[0]
for i in range(b.shape[1]):
out[0,i] = b[0,i] * val
res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.

import numpy as np
import numba
def matmult_numpy(matrix, c):
return np.matmul(c, matrix)
#numba.jit(nopython=True)
def matmult_numba(matrix, c):
return c*matrix
if __name__ == "__main__":
large_dimension = 1000
a = np.random.random((1, large_dimension))
c = np.random.random((1,))
About a factor of 3 speedup using Numba. Numba cognoscenti may be able to do better by explicitly casting the parameter "c" as a scalar
Check: The result of
%timeit matmult_numpy(a, c) 2.32 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit matmult_numba(a, c)
763 ns ± 6.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Numpy 3D matrix multiplication

I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!

Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop

You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?

In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.

This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Python FOR loop vs. builtin reduce() and mul() functions?

Here I have two functions:
from functools import reduce
from operator import mul
def fn1(arr):
product = 1
for number in arr:
product *= number
return [product//x for x in arr]
def fn2(arr):
product = reduce(mul, arr)
return [product//x for x in arr]
Benchmark:
In [2]: arr = list(range(1,11))
In [3]: %timeit fn1(arr)
1.62 µs ± 23.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit fn2(arr)
1.88 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: arr = list(range(1,101))
In [6]: %timeit fn1(arr)
38.5 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit fn2(arr)
41 µs ± 463 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: arr = list(range(1,1001))
In [9]: %timeit fn1(arr)
4.23 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %timeit fn2(arr)
4.24 ms ± 36.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: arr = list(range(1,10001))
In [12]: %timeit fn1(arr)
605 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit fn2(arr)
594 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here fn2() is marginally slower with the small lists. My understanding was that reduce() and mul() functions are both builtin functions, therefore they run at C speed and should be faster than the for loop. Probably because I have much more function calls (which also take some time) inside the fn2, it contributes to the end performance? But then the trend shows that fn2() outperforms fn1() with the larger lists. Why?

There could be a lot of reasons for this. First is CPU code execution predictions and compiler optimizations. As for me it shouldn't matter much for you which form of the code is faster if you choose proper algorithm. You need to use one that is suitable for your needs and looks better and leave performance to python compiler. Often faster options cause more problems with memory/readability/support. Also it's not guaranteed that little more complex code inside simple cycles wouldn't change performance, just because some optimizations could be applied.
-- Update --
If you want to increase performance of python on simple operations I would advice you to look at PyPy, Cython, nim which are made to gain performance of C when python type wrappers take too much.

these are always going to be very close, but for somewhat interesting reasons:
if the product is small (e.g. for the range(1,10)) then the functions are doing very little "useful" work, and everything is going into marshalling machine ints between Python objects so the functions can be invoked
if the product is large (e.g. the range(1,10001) case) the numbers become enormous (i.e. thousands of decimal digits) and the majority of time is spent multiplying very large numbers
for example, with Python 3.7.3 (in Linux 5.0.10):
from functools import reduce
from operator import mul
prod = reduce(mul, range(1, 10001))
prod has ~36k digits and consumes ~16KiB --- i.e. check with math.log10(prod) and sys.getsizeof(prod).
working with small (i.e. not bignum) products such as:
reduce(mul, [1] * 10001)
is ~50 times faster on my computer than when we need to use bignums as above. also notice that it's basically the same speed when using floating point numbers as compared to integers, e.g.
reduce(mul, [1.] * 10001)
only takes ~10% more time.
your additional code that makes an additional pass over the array seems to just be complicating issues hence I've ignored it --- microbenchmarks like these are awkward enough to get right!

What is the tensordot of this 4D einsum operation?

Here's a simple code that "batch multiplies" a 4D matrix a by 3D matrix b:
from functools import reduce
import numpy as np
from operator import mul
def einsum(a, b):
return np.einsum('ijkl,jkl->ikl', a, b)
def original(a, b):
s0, s1, s2, s3 = a.shape
c = np.empty((s0, s2, s3))
for j in range(s3):
for i in range(s2):
c[:, j, i] = np.dot(a[:, :, j, i], b[:, j, i])
return c
sz_a = (16, 4, 512, 512)
sz_b = (4, 512, 512)
a = np.random.random(reduce(mul, sz_a)).reshape(sz_a)
b = np.random.random(reduce(mul, sz_b)).reshape(sz_b)
For timing:
%timeit original(a, b)
395 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit einsum(a, b)
23.1 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'd like to test out tensordot's performance to see how it compares, but I'm really having some trouble wrapping my ahead around how to use it here. If anyone is familiar enough to guide me with this, it would greatly appreciated. Thank you!
My original thought was:
np.tensordot(a, b, axes=((1),(0)))
But that gives me a MemoryError so I don't think that's right...

Time comparisons of your einsum with a matmul equivalent:
In [910]: timeit (a.transpose(2,3,0,1)#b[:,None].transpose(2,3,0,1)).transpose(2,3,0,1)[:
...: ,0]
90.5 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [911]: timeit np.einsum('ijkl,jkl->ikl', a, b)
92.7 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Times are close enough that I suspect einsum optimization is actually using matmul. Originally einsum used its own compiled sum-of-products iteration, but recent with recent changes it uses a variety of methods, including dot and matmul if they fit.
matmul was created to handle the case where the initial dimensions represent a stack of matrices. In your problem the last 2 dimensions are this stack, with the dot acting on the initial. matmul was created to handle this kind of stacked dots. dot, and its derivative tensordot don't handle that kind of stacking.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping over np.einsum many times... Is there a faster way? - python

Related

Most efficient way to multiply a small matrix with a scalar in numpy

Numpy 3D matrix multiplication

numpy - einsum vs naive implementation runtime performaned

Python FOR loop vs. builtin reduce() and mul() functions?

What is the tensordot of this 4D einsum operation?

Categories

Resources