Numpy 3D matrix multiplication

Numpy 3D matrix multiplication - python

I have 2 matrices A(shape 10x10x36) and B(shape 10x27x36). I would like to multiply the last 2 axes and sum the result along axis 0 so that the result C is of the shape 10x27. Here is currently how I do it
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
C = np.sum(np.array(C), axis=0)
I want to achieve this in a vectorized manner but can't seem to find out how. I have checked out np.einsum but not yet sure how to apply it to achieve the result. Any help will be appreciated. Thanks!

Here the same result using np.einsum:
r1 = np.einsum('ijk,ilk->jl', A, B)
However in my machine the for loop implementation runs almost 2x faster:
def f(A,B):
C = []
for i in range(A.shape[0]):
C.append(np.matmul(A[i], B[i].T))
return np.sum(np.array(C), axis=0)
%timeit np.einsum('ijk,ilk->jl',A,B)
102 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f(A,B)
57.6 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

matmul supports stacking. You can simply do:
(A#B.transpose(0,2,1)).sum(0)
Checks (C is generated using OP's loop):
np.allclose((A#B.transpose(0,2,1)).sum(0),C)
# True
timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000)
# 0.03199950899579562
# twice as fast as original loop

You could also try the following using list comprehension. It's a bit more concise than what you are currently using.
C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)

Related

Looping over np.einsum many times... Is there a faster way?

I have a likelihood function that I am trying to sample with MCMC. I have used no for loops in the log likelihood itself, but I do call np.einsum() once.
Here's a sample of what my current code looks like:
A = np.random.rand(4,50,60,200) # Random NDarray
B = np.random.rand(200,1000,4) # Random NDarray
out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
The output out has dimensions (50,60,1000,4). This calculation is a bit too slow to allow for efficient MCMC sampling (~4 seconds on my machine), is there any way to speed it up? One useful piece of information is that for each call of the log-likelihood function, while the actual values in the arrays A and B are changing, the dimensions of each array remains fixed. I'd imagine this could be useful in speeding things up, since the same elements are always being multiplied together.

Well one of the axes stays aligned in A (first one) and B (last one) and stays in output as well (last one) and is a very small looping number of 4. So, we could simply loop over that one with with np.tensordot for a tensor sum-reduction. The benefit of 4x lesser memory congestion when working with such large datasets might overcome the 4x looping because the compute per iteration is also 4x lesser.
Thus, a solution with tensordot would be -
def func1(A, B):
out = np.empty(A.shape[1:3] + B.shape[1:])
for i in range(len(A)):
out[...,i] = np.tensordot(A[i], B[...,i],axes=(-1,0))
return out
Timings -
In [70]: A = np.random.rand(4,50,60,200) # Random NDarray
...: B = np.random.rand(200,1000,4) # Random NDarray
...: out = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
# Einsum solution without optimize
In [71]: %timeit np.einsum('ijkl,lui->jkui', A, B)
2.89 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Einsum solution with optimize
In [72]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.79 s ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# #Paul Panzer's soln
In [74]: %timeit np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
183 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit func1(A,B)
158 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to re-iterate the importance of memory-congestion and compute requirement, let's say we want to sum-reduce the last axis of length 4 as well, then we will see a noticeable difference in timings for optimal version -
In [78]: %timeit np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
2.76 s ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [79]: %timeit np.einsum('ijkl,lui->jku', A, B, optimize="optimal")
93.8 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, in that case, it would be better to go with einsum.
Specific to given problem
Given that dimensions of A and B stay the same, the array-initialization with out = np.empty(A.shape[1:3] + B.shape[1:]) could be done as a one-time affair and loop through each call of the log-likelihood function with the proposed looping over to use tensordot and update output out.

Even when used in a small loop tensordot is more than 10x faster:
timeit(lambda:np.einsum('ijkl,lui->jkui', A, B, optimize="optimal"),number=5)/5
# 3.052245747600682
timeit(lambda:np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1),number=10)/10
# 0.23842503569903784
out_td = np.stack([np.tensordot(a,b,1) for a,b in zip(A,B.transpose(2,0,1))],-1)
out_es = np.einsum('ijkl,lui->jkui', A, B, optimize="optimal")
np.allclose(out_td,out_es)
# True

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?

In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.

This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Creating an array of numbers that add up to 1 with given length

I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?

This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum

For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

What is the tensordot of this 4D einsum operation?

Here's a simple code that "batch multiplies" a 4D matrix a by 3D matrix b:
from functools import reduce
import numpy as np
from operator import mul
def einsum(a, b):
return np.einsum('ijkl,jkl->ikl', a, b)
def original(a, b):
s0, s1, s2, s3 = a.shape
c = np.empty((s0, s2, s3))
for j in range(s3):
for i in range(s2):
c[:, j, i] = np.dot(a[:, :, j, i], b[:, j, i])
return c
sz_a = (16, 4, 512, 512)
sz_b = (4, 512, 512)
a = np.random.random(reduce(mul, sz_a)).reshape(sz_a)
b = np.random.random(reduce(mul, sz_b)).reshape(sz_b)
For timing:
%timeit original(a, b)
395 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit einsum(a, b)
23.1 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'd like to test out tensordot's performance to see how it compares, but I'm really having some trouble wrapping my ahead around how to use it here. If anyone is familiar enough to guide me with this, it would greatly appreciated. Thank you!
My original thought was:
np.tensordot(a, b, axes=((1),(0)))
But that gives me a MemoryError so I don't think that's right...

Time comparisons of your einsum with a matmul equivalent:
In [910]: timeit (a.transpose(2,3,0,1)#b[:,None].transpose(2,3,0,1)).transpose(2,3,0,1)[:
...: ,0]
90.5 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [911]: timeit np.einsum('ijkl,jkl->ikl', a, b)
92.7 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Times are close enough that I suspect einsum optimization is actually using matmul. Originally einsum used its own compiled sum-of-products iteration, but recent with recent changes it uses a variety of methods, including dot and matmul if they fit.
matmul was created to handle the case where the initial dimensions represent a stack of matrices. In your problem the last 2 dimensions are this stack, with the dot acting on the initial. matmul was created to handle this kind of stacked dots. dot, and its derivative tensordot don't handle that kind of stacking.

Improve performance in lists

I have a problem where I am trying to take a randomly ordered list and I want to know how many elements with a greater index than the current element are smaller in value than the current element.
For example:
[1,2,5,3,7,6,8,4]
should return:
[0,0,2,0,2,1,1,0]
This is the code I have that is currently working.
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
This does produce the desired array but it runs slowly. What is the more pythonic way to get this accomplished?

We could fiddle around with the code in the question, but still it would be an O(n^2) algorithm. To truly improve the performance is not a matter of making the implementation more or less pythonic, but to use a different approach with a helper data structure.
Here's an outline for an O(n log n) solution: implement a self-balancing BST (AVL or red-black are good options), and additionally store in each node an attribute with the size of the subtree rooted in it. Now traverse the list from right to left and insert all its elements in the tree as new nodes. We also need an extra output list of the same size of the input list to keep track of the answer.
For every node we insert in the tree, we compare its key with the root. If it's greater than the value in the root, it means that it's greater than all the nodes in the left subtree, hence we need to add the size of the left subtree to the answer list at the position of the element we're trying to insert.
We keep doing this recursively and updating the size attribute in each node we visit, until we find the right place to insert the new node, and proceed to the next element in the input list. In the end the output list will contain the answer.

Another option that's much simpler than implementing a balanced BST is to adapt merge sort to count inversions and accumulate them during the process. Clearly, any single swap is an inversion so the lower-indexed element gets one count. Then during the merge traversal, simply keep track of how many elements from the right group have moved to the left and add that count for elements added to the right group.
Here's a very crude illustration :)
[1,2,5,3,7,6,8,4]
sort 1,2 | 5,3
3,5 -> 5: 1
merge
1,2,3,5
sort 7,6 | 8,4
6,7 -> 7: 1
4,8 -> 8: 1
merge
4 -> 6: 1, 7: 2
4,6,7,8
merge 1,2,3,5 | 4,6,7,8
1,2,3,4 -> 1 moved
5 -> +1 -> 5: 2
6,7,8

There are several ways of speeding up your code without touching the overall computational complexity.
This is so because there are several ways of writing this very algorithm.
Let's start with your code:
def bribe_orig(q):
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
return bribe_array
This is of somewhat mixed style: firstly, you generate a list of zeros (which is not really needed, as you can append items on demand; secondly, the outer list uses a range() which is sub-optimal given that you would like to access a specific item multiple times and hence a local name would be faster; thirdly, you write a generator inside sum() which is also sub-optimal since it will be summing up booleans and hence perform implicit conversions all the time.
A cleaner approach would be:
def bribe(items):
result = []
for i, item in enumerate(items):
partial_sum = 0
for x in items[i + 1:]:
if x < item:
partial_sum += 1
result.append(partial_sum)
return result
This is somewhat simpler and since it does a number of things explicitly, and only performing a summation when necessary (thus skipping when you would be adding 0), it may be faster.
Another way of writing your code in a more compact way is:
def bribe_compr(items):
return [sum(x < item for x in items[i + 1:]) for i, item in enumerate(items)]
This involves the use of generators and list comprehensions, but also the outer loop is written with enumerate() following the typical Python style.
But Python is infamously slow in raw looping, therefore when possible, vectorization can be helpful. One way of doing this (only for the inner loop) is with numpy:
import numpy as np
def bribe_np(items):
items = np.array(items)
return [np.sum(items[i + 1:] < item) for i, item in enumerate(items)]
Finally, it is possible to use a JIT compiler to speed up the plain Python loops using Numba:
import numba
bribe_jit = nb.jit(bribe)
As for any JIT it has some costs for the just-in-time compilation, which is eventually offset for large enough loops.
Unfortunately, Numba's JIT does not support all Python code, but when it does, like in this case, it can be pretty rewarding.
Let's look at some numbers.
Consider the input generated with the following:
import numpy as np
np.random.seed(0)
n = 10
q = np.random.randint(1, n, n)
On a small-sized inputs (n = 10):
%timeit bribe_orig(q)
# 228 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe(q)
# 20.3 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_compr(q)
# 216 µs ± 5.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_np(q)
# 133 µs ± 9.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_jit(q)
# 1.11 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
On a medium-sized inputs (n = 100):
%timeit bribe_orig(q)
# 20.5 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe(q)
# 741 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_compr(q)
# 18.9 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_np(q)
# 1.22 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_jit(q)
# 7.54 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
On a larger inputs (n = 10000):
%timeit bribe_orig(q)
# 1.99 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe(q)
# 60.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe_compr(q)
# 1.8 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe_np(q)
# 12.8 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_jit(q)
# 182 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From these results, we observe that we gain the most from substituting sum() with the explicit construct involving only Python loops.
The use of comprehensions does not land you above approx. 10% improvement over your code.
For larger inputs, the use of NumPy can be even faster than the explicit construct involving only Python loops.
However, you will get the real deal when you use the Numba's JITed version of bribe().

You can get better performance by progressively building a sorted list going from last to first in your array. Using a binary search algorithm on the sorted list for each element in the array, you get the index at which the element will be inserted which also happens to be the number of elements that are smaller in the ones already processed.
Collecting these insertion points will give you the expected result (in reverse order).
Here's an example:
a = [1,2,5,3,7,6,8,4]
from bisect import bisect_left
s = []
r = []
for x in reversed(a):
p = bisect_left(s,x)
r.append(p)
s.insert(p,x)
r = r[::-1]
print(r) #[0,0,2,0,2,1,1]
For this example, the progression will be as follows:
step 1: x = 4, p=0 ==> r=[0] s=[4]
step 2: x = 8, p=1 ==> r=[0,1] s=[4,8]
step 3: x = 6, p=1 ==> r=[0,1,1] s=[4,6,8]
step 4: x = 7, p=2 ==> r=[0,1,1,2] s=[4,6,7,8]
step 5: x = 3, p=0 ==> r=[0,1,1,2,0] s=[3,4,6,7,8]
step 6: x = 5, p=2 ==> r=[0,1,1,2,0,2] s=[3,4,5,6,7,8]
step 7: x = 2, p=0 ==> r=[0,1,1,2,0,2,0] s=[2,3,4,5,6,7,8]
step 8: x = 1, p=0 ==> r=[0,1,1,2,0,2,0,0] s=[1,2,3,4,5,6,7,8]
Reverse r, r = r[::-1] r=[0,0,2,0,2,1,1,0]
You will be performing N loops (size of the array) and the binary search performs in log(i) where i is 1 to N. So, smaller than O(N*log(N)). The only caveat is the performance of s.insert(p,x) which will introduce some variability depending on the order of the original list.
Overall the performance profile should be between O(N) and O(N*log(N)) with a worst case of O(n^2) when the array is already sorted.
If you only need to make your code a little faster and more concise, you could use a list comprehension (but that'll still be O(n^2) time):
r = [sum(v<p for v in a[i+1:]) for i,p in enumerate(a)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy 3D matrix multiplication - python

matmul supports stacking. You can simply do: (A#B.transpose(0,2,1)).sum(0) Checks (C is generated using OP's loop): np.allclose((A#B.transpose(0,2,1)).sum(0),C) # True timeit(lambda:(A#B.transpose(0,2,1)).sum(0),number=1000) # 0.03199950899579562 # twice as fast as original loop

You could also try the following using list comprehension. It's a bit more concise than what you are currently using. C=np.array([A[i] # B.T[:,:,i] for i in range(10)]).sum(0)

Related

Looping over np.einsum many times... Is there a faster way?

numpy - einsum vs naive implementation runtime performaned

Creating an array of numbers that add up to 1 with given length

What is the tensordot of this 4D einsum operation?

Improve performance in lists

Categories

Resources