multiple numpy dot products without a loop - python

Is it possible to compute several dot products without a loop?
say you have the following:
a = randn(100, 3, 3)
b = randn(100, 3, 3)
I want to get an array z of shape (100, 3, 3) such that for all i
z[i, ...] == dot(a[i, ...], b[i, ...])
in other words, which verifies:
for va, vb, vz in izip(a, b, z):
assert (vq == dot(va, vb)).all()
The straightforward solution would be:
z = array([dot(va, vb) for va, vb in zip(a, b)])
which uses an implicit loop (list comprehension + array).
Is there a more efficient way to compute z?

np.einsum can be useful here. Try running this copy+pasteable code:
import numpy as np
a = np.random.randn(100, 3, 3)
b = np.random.randn(100, 3, 3)
z = np.einsum("ijk, ikl -> ijl", a, b)
z2 = np.array([ai.dot(bi) for ai, bi in zip(a, b)])
assert (z == z2).all()
einsum is compiled code and runs very fast, even compared to np.tensordot (which doesn't apply here exactly, but often is applicable). Here are some stats:
In [8]: %timeit z = np.einsum("ijk, ikl -> ijl", a, b)
10000 loops, best of 3: 105 us per loop
In [9]: %timeit z2 = np.array([ai.dot(bi) for ai, bi in zip(a, b)])
1000 loops, best of 3: 1.06 ms per loop

Try Einstein summation in numpy:
z = np.einsum('...ij,...jk->...ik', a, b)
It's elegant and does not require you to write a loop, as you requested.
It gives me a factor of 4.8 speed increase on my system:
%timeit z = array([dot(va, vb) for va, vb in zip(a, b)])
1000 loops, best of 3: 454 µs per loop
%timeit z = np.einsum('...ij,...jk->...ik', a, b)
10000 loops, best of 3: 94.6 µs per loop

This solution still uses a loop, but is faster because it avoids unnecessary creation of temp arrays, by using the out arg of dot:
def dotloop(a,b):
res = empty(a.shape)
for ai,bi,resi in zip(a,b,res):
np.dot(ai, bi, out = resi)
return res
%timeit dotloop(a,b)
1000 loops, best of 3: 453 us per loop
%timeit array([dot(va, vb) for va, vb in zip(a, b)])
1000 loops, best of 3: 843 us per loop

In addition to the other answers, I want to add that:
np.einsum("ijk, ijk -> ij", a, b)
Is suitable for a related case I encountered, where you have two 3D arrays consisting of matching 2D fields of 2D vectors (points or directions). This gives a kind of "element-wise" dot product between those 2D vectors.
For example:
np.einsum("ijk, ijk -> ij", [[[1,2],[3,4]]], [[[5,6],[7,8]]])
# => array([[17, 53]])
Where:
np.dot([1,2],[5,6])
# => 17
np.dot([3,4],[7,8])
# => 53

Related

Faster alternative to the (V A V^T).diagonal in python [duplicate]

Imagine having 2 numpy arrays:
> A, A.shape = (n,p)
> B, B.shape = (p,p)
Typically p is a smaller number (p <= 200), while n can be arbitrarily large.
I am doing the following:
result = np.diag(A.dot(B).dot(A.T))
As you can see, I am keeping only the n diagonal entries, however there is an intermediate (n x n) array calculated from which only the diagonal entries are kept.
I wish for a function like diag_dot(), which only calculates the diagonal entries of the result and does not allocate the complete memory.
A result would be:
> result = diag_dot(A.dot(B), A.T)
Is there a premade functionality like this and can this be done efficiently without the need for allocating the intermediate (n x n) array?
I think i got it on my own, but nevertheless will share the solution:
since getting only the diagonals of a matrix multiplication
> Z = N.diag(X.dot(Y))
is equivalent to the individual sum of the scalar product of rows of X and columns of Y, the previous statement is equivalent to:
> Z = (X * Y.T).sum(-1)
For the original variables this means:
> result = (A.dot(B) * A).sum(-1)
Please correct me if I am wrong but this should be it ...
You can get almost anything you ever dreamed of with numpy.einsum. Until you start getting the hang of it, it basically seems like black voodoo...
>>> a = np.arange(15).reshape(5, 3)
>>> b = np.arange(9).reshape(3, 3)
>>> np.diag(np.dot(np.dot(a, b), a.T))
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,ji->i', np.dot(a, b), a.T)
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,ij->i', np.dot(a, b), a)
array([ 60, 672, 1932, 3840, 6396])
EDIT You can actually get the whole thing in a single shot, it's ridiculous...
>>> np.einsum('ij,jk,ki->i', a, b, a.T)
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,jk,ik->i', a, b, a)
array([ 60, 672, 1932, 3840, 6396])
EDIT You don't want to let it figure too much on its own though... Added the OP's answer to its own question for comparison also.
n, p = 10000, 200
a = np.random.rand(n, p)
b = np.random.rand(p, p)
In [2]: %timeit np.einsum('ij,jk,ki->i', a, b, a.T)
1 loops, best of 3: 1.3 s per loop
In [3]: %timeit np.einsum('ij,ij->i', np.dot(a, b), a)
10 loops, best of 3: 105 ms per loop
In [4]: %timeit np.diag(np.dot(np.dot(a, b), a.T))
1 loops, best of 3: 5.73 s per loop
In [5]: %timeit (a.dot(b) * a).sum(-1)
10 loops, best of 3: 115 ms per loop
A pedestrian answer, which avoids the construction of large intermediate arrays is:
result=np.empty([n,], dtype=A.dtype )
for i in xrange(n):
result[i]=A[i,:].dot(B).dot(A[i,:])

Pairwise vdot using Numpy

I'm trying to compute the pairwise np.vdot of a complex 2D array x with itself. So the behaviour I want is:
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
Is there a way to do this without the explicit loops? I tried using pairwise_kernel from sklearn but it assumes the input arrays are real numbers. I also tried broadcasting, but vdot flattens its inputs.
X = np.einsum('ik,jk->ij', np.conj(x), x)
is equivalent to
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
np.einsum
takes a sum of products. The subscript 'ik,jk->ij' tells np.einsum that the second argument,
np.conj(x) is an array with subscripts ik and the third argument, x has
subscripts jk. Thus, the product np.conj(x)[i,k]*x[j,k] is computed for all
i,j,k. The sum is taken over the repeated subscript, k, and since that
leaves i and j remaining, they become the subscripts of the resultant array.
For example,
import numpy as np
N, M = 10, 20
a = np.random.random((N,M))
b = np.random.random((N,M))
x = a + b*1j
def orig(x):
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
return X
def alt(x):
return np.einsum('ik,jk->ij', np.conj(x), x)
assert np.allclose(orig(x), alt(x))
In [307]: %timeit orig(x)
10000 loops, best of 3: 143 µs per loop
In [308]: %timeit alt(x)
100000 loops, best of 3: 8.63 µs per loop
To extend the np.vdot to all rows, you can use np.tensordot and I am borrowing the conjugate idea straight off #unutbu's solution , like so -
np.tensordot(np.conj(x),x,axes=(1,1))
Basically with np.tensordot, we specify the axes to be reduced, which in this case is the last axis for the conjugate version of x and the array itself, when applied on those two.
Runtime test -
Let's time #unutbu's solution with np.einsum and the proposed solution in this post -
In [27]: import numpy as np # From #unutbu's` solution again
...:
...: N, M = 1000, 1000
...: a = np.random.random((N,M))
...: b = np.random.random((N,M))
...: x = a + b*1j
...:
In [28]: %timeit np.einsum('ik,jk->ij', np.conj(x), x) # #unutbu's` solution
1 loops, best of 3: 4.45 s per loop
In [29]: %timeit np.tensordot(np.conj(x),x,axes=(1,1))
1 loops, best of 3: 3.76 s per loop

Tensor multiplication with numpy tensordot

I have a tensor U composed of n matrices of dimension (d,k) and a matrix V of dimension (k,n).
I would like to multiply them so that the result returns a matrix of dimension (d,n) in which column j is the result of the matrix multiplication between the matrix j of U and the column j of V.
One possible way to obtain this is:
for j in range(n):
res[:,j] = U[:,:,j] * V[:,j]
I am wondering if there is a faster approach using numpy library. In particular I'm thinking of the np.tensordot() function.
This small snippet allows me to multiply a single matrix by a scalar, but the obvious generalization to a vector is not returning what I was hoping for.
a = np.array(range(1, 17))
a.shape = (4,4)
b = np.array((1,2,3,4,5,6,7))
r1 = np.tensordot(b,a, axes=0)
Any suggestion?
There are a couple of ways you could do this. The first thing that comes to mind is np.einsum:
# some fake data
gen = np.random.RandomState(0)
ni, nj, nk = 10, 20, 100
U = gen.randn(ni, nj, nk)
V = gen.randn(nj, nk)
res1 = np.zeros((ni, nk))
for k in range(nk):
res1[:,k] = U[:,:,k].dot(V[:,k])
res2 = np.einsum('ijk,jk->ik', U, V)
print(np.allclose(res1, res2))
# True
np.einsum uses Einstein notation to express tensor contractions. In the expression 'ijk,jk->ik' above, i,j and k are subscripts that correspond to the different dimensions of U and V. Each comma-separated grouping corresponds to one of the operands passed to np.einsum (in this case U has dimensions ijk and V has dimensions jk). The '->ik' part specifies the dimensions of the output array. Any dimensions with subscripts that aren't present in the output string are summed over.
np.einsum is incredibly useful for performing complex tensor contractions, but it can take a while to fully wrap your head around how it works. You should take a look at the examples in the documentation (linked above).
Some other options:
Element-wise multiplication with broadcasting, followed by summation:
res3 = (U * V[None, ...]).sum(1)
inner1d with a load of transposing:
from numpy.core.umath_tests import inner1d
res4 = inner1d(U.transpose(0, 2, 1), V.T)
Some benchmarks:
In [1]: ni, nj, nk = 100, 200, 1000
In [2]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
....: np.einsum('ijk,jk->ik', U, V)
....:
10 loops, best of 3: 23.4 ms per loop
In [3]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
(U * V[None, ...]).sum(1)
....:
10 loops, best of 3: 59.7 ms per loop
In [4]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
inner1d(U.transpose(0, 2, 1), V.T)
....:
10 loops, best of 3: 45.9 ms per loop

Efficient way to fill up a 4d array from entries of a product of two matrices

Title might be not as precise than I hoped, but here is the problem. Basically I'm filling a 4d numpy array from the entries of the product of two matrices. Right now the code is the following :
M = P.dot(U)
C_arr = np.zeros((b_size,b_size,N,N))
for alpha in xrange(b_size):
for beta in xrange(b_size):
for i in xrange(N):
for j in xrange(N):
C_arr[alpha,beta,i,j] = np.conjugate(M[i,alpha])*M[j,beta]
It turns out that this function is called quite ofen and appears to be very time-consumming. I'm just beginning with Python and I suspect that there could be a more efficient way to write this function by avoiding those loops, but haven't been able to figure it out by myself...
You can use numpy.einsum:
C = np.einsum('ia,jb->abij', M.conj(), M)
Or, since there is no actual sum being computed (i.e. this is a form of an outer product), you can use numpy broadcasting with regular array multiplication after reshaping the input matrix M appropriately:
nrows, ncols = M.shape
C = M.T.reshape(1, ncols, 1, nrows) * M.T.conj().reshape(ncols, 1, nrows, 1)
Apart from the terse code with np.einsum listed in the other solution, you can also use outer-product with np.outer like so -
np.outer(M.conj().ravel(),M.ravel()).reshape(N,b_size,N,b_size).transpose(1,3,0,2)
Runtime tests -
In [54]: # Create input and get shape parameters
...: M = np.random.rand(10,10)
...: N,b_size = M.shape
...:
In [55]: %timeit np.einsum('ia,jb->abij', M.conj(), M)
10000 loops, best of 3: 26 µs per loop
In [56]: %timeit np.outer(M.conj().ravel(),M.ravel()).reshape(N,b_size,N,b_size).transpose(1,3,0,2)
10000 loops, best of 3: 55.6 µs per loop
In [57]: # Create input and get shape parameters
...: M = np.random.rand(40,40)
...: N,b_size = M.shape
...:
In [58]: %timeit np.einsum('ia,jb->abij', M.conj(), M)
10 loops, best of 3: 31 ms per loop
In [59]: %timeit np.outer(M.conj().ravel(),M.ravel()).reshape(N,b_size,N,b_size).transpose(1,3,0,2)
10 loops, best of 3: 24.2 ms per loop
In [60]: # Create input and get shape parameters
...: M = np.random.rand(80,80)
...: N,b_size = M.shape
...:
In [61]: %timeit np.einsum('ia,jb->abij', M.conj(), M)
1 loops, best of 3: 497 ms per loop
In [62]: %timeit np.outer(M.conj().ravel(),M.ravel()).reshape(N,b_size,N,b_size).transpose(1,3,0,2)
1 loops, best of 3: 399 ms per loop
Thus, depending on the shape of the input array, you can go either way.

How to apply numpy.linalg.norm to each row of a matrix?

I have a 2D matrix and I want to take norm of each row. But when I use numpy.linalg.norm(X) directly, it takes the norm of the whole matrix.
I can take norm of each row by using a for loop and then taking norm of each X[i], but it takes a huge time since I have 30k rows.
Any suggestions to find a quicker way? Or is it possible to apply np.linalg.norm to each row of a matrix?
For numpy 1.9+
Note that, as perimosocordiae shows, as of NumPy version 1.9, np.linalg.norm(x, axis=1) is the fastest way to compute the L2-norm.
For numpy < 1.9
If you are computing an L2-norm, you could compute it directly (using the axis=-1 argument to sum along rows):
np.sum(np.abs(x)**2,axis=-1)**(1./2)
Lp-norms can be computed similarly of course.
It is considerably faster than np.apply_along_axis, though perhaps not as convenient:
In [48]: %timeit np.apply_along_axis(np.linalg.norm, 1, x)
1000 loops, best of 3: 208 us per loop
In [49]: %timeit np.sum(np.abs(x)**2,axis=-1)**(1./2)
100000 loops, best of 3: 18.3 us per loop
Other ord forms of norm can be computed directly too (with similar speedups):
In [55]: %timeit np.apply_along_axis(lambda row:np.linalg.norm(row,ord=1), 1, x)
1000 loops, best of 3: 203 us per loop
In [54]: %timeit np.sum(abs(x), axis=-1)
100000 loops, best of 3: 10.9 us per loop
Resurrecting an old question due to a numpy update. As of the 1.9 release, numpy.linalg.norm now accepts an axis argument. [code, documentation]
This is the new fastest method in town:
In [10]: x = np.random.random((500,500))
In [11]: %timeit np.apply_along_axis(np.linalg.norm, 1, x)
10 loops, best of 3: 21 ms per loop
In [12]: %timeit np.sum(np.abs(x)**2,axis=-1)**(1./2)
100 loops, best of 3: 2.6 ms per loop
In [13]: %timeit np.linalg.norm(x, axis=1)
1000 loops, best of 3: 1.4 ms per loop
And to prove it's calculating the same thing:
In [14]: np.allclose(np.linalg.norm(x, axis=1), np.sum(np.abs(x)**2,axis=-1)**(1./2))
Out[14]: True
Much faster than the accepted answer is using NumPy's einsum,
numpy.sqrt(numpy.einsum('ij,ij->i', a, a))
And even faster than that is arranging the data such that the norms are computed across all columns,
numpy.sqrt(numpy.einsum('ij,ij->j', aT, aT))
Note the log-scale:
Code to reproduce the plot:
import numpy as np
import perfplot
rng = np.random.default_rng(0)
def setup(n):
x = rng.random((n, 3))
xt = np.ascontiguousarray(x.T)
return x, xt
def sum_sqrt(a, _):
return np.sqrt(np.sum(np.abs(a) ** 2, axis=-1))
def apply_norm_along_axis(a, _):
return np.apply_along_axis(np.linalg.norm, 1, a)
def norm_axis(a, _):
return np.linalg.norm(a, axis=1)
def einsum_sqrt(a, _):
return np.sqrt(np.einsum("ij,ij->i", a, a))
def einsum_sqrt_columns(_, aT):
return np.sqrt(np.einsum("ij,ij->j", aT, aT))
b = perfplot.bench(
setup=setup,
kernels=[
sum_sqrt,
apply_norm_along_axis,
norm_axis,
einsum_sqrt,
einsum_sqrt_columns,
],
n_range=[2**k for k in range(20)],
xlabel="len(a)",
)
b.show()
b.save("out.png")
Try the following:
In [16]: numpy.apply_along_axis(numpy.linalg.norm, 1, a)
Out[16]: array([ 5.38516481, 1.41421356, 5.38516481])
where a is your 2D array.
The above computes the L2 norm. For a different norm, you could use something like:
In [22]: numpy.apply_along_axis(lambda row:numpy.linalg.norm(row,ord=1), 1, a)
Out[22]: array([9, 2, 9])

Categories

Resources