Tensor multiplication with numpy tensordot

Tensor multiplication with numpy tensordot - python

I have a tensor U composed of n matrices of dimension (d,k) and a matrix V of dimension (k,n).
I would like to multiply them so that the result returns a matrix of dimension (d,n) in which column j is the result of the matrix multiplication between the matrix j of U and the column j of V.
One possible way to obtain this is:
for j in range(n):
res[:,j] = U[:,:,j] * V[:,j]
I am wondering if there is a faster approach using numpy library. In particular I'm thinking of the np.tensordot() function.
This small snippet allows me to multiply a single matrix by a scalar, but the obvious generalization to a vector is not returning what I was hoping for.
a = np.array(range(1, 17))
a.shape = (4,4)
b = np.array((1,2,3,4,5,6,7))
r1 = np.tensordot(b,a, axes=0)
Any suggestion?

There are a couple of ways you could do this. The first thing that comes to mind is np.einsum:
# some fake data
gen = np.random.RandomState(0)
ni, nj, nk = 10, 20, 100
U = gen.randn(ni, nj, nk)
V = gen.randn(nj, nk)
res1 = np.zeros((ni, nk))
for k in range(nk):
res1[:,k] = U[:,:,k].dot(V[:,k])
res2 = np.einsum('ijk,jk->ik', U, V)
print(np.allclose(res1, res2))
# True
np.einsum uses Einstein notation to express tensor contractions. In the expression 'ijk,jk->ik' above, i,j and k are subscripts that correspond to the different dimensions of U and V. Each comma-separated grouping corresponds to one of the operands passed to np.einsum (in this case U has dimensions ijk and V has dimensions jk). The '->ik' part specifies the dimensions of the output array. Any dimensions with subscripts that aren't present in the output string are summed over.
np.einsum is incredibly useful for performing complex tensor contractions, but it can take a while to fully wrap your head around how it works. You should take a look at the examples in the documentation (linked above).
Some other options:
Element-wise multiplication with broadcasting, followed by summation:
res3 = (U * V[None, ...]).sum(1)
inner1d with a load of transposing:
from numpy.core.umath_tests import inner1d
res4 = inner1d(U.transpose(0, 2, 1), V.T)
Some benchmarks:
In [1]: ni, nj, nk = 100, 200, 1000
In [2]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
....: np.einsum('ijk,jk->ik', U, V)
....:
10 loops, best of 3: 23.4 ms per loop
In [3]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
(U * V[None, ...]).sum(1)
....:
10 loops, best of 3: 59.7 ms per loop
In [4]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk)
inner1d(U.transpose(0, 2, 1), V.T)
....:
10 loops, best of 3: 45.9 ms per loop

Related

3D or 4D Numpy array indexed using 2D mask & entire matrix operation

I am trying to do some operation on an entire 3D or 4D array, but only on subgroups of smaller size (2D array contained in the bigger array).
Example:
input = np.arange(75).reshape((3, 5, 5)) # or any other 3D or 4D matrix.
mask_hor = np.arange(-1, 2)
mask_ver = mask_hor[:, None]
output = np.zeros((3, 3, 3))
for i in range(1, 5):
for j in range(1, 5):
output[:, i-1, j-1] = foo(input[:, i+mask_ver, j+mask_hor])
where foo is some sort of manipulation of the input
My question is:
Is there a method / mask which I can pass to the input such that I can get rid of the nested for loops? I am looking for a speedup mainly.
Thank you!

This is more of a quick and dirty optimization than anything elegant. For the sake of argument we're going to sum the 9 elements in the window as our foo function.
import numpy as np
from scipy import ndimage
# take the sum of a 3x3 window of a matrix
def foo_lin_mat(mat):
return mat.sum(axis=(-2, -1)) # sum over the last two axes
# sum up the individual matrices
def foo_lin_nine(m1, m2, m3, m4, m5, m6, m7, m8, m9):
return m1 + m2 + m3 + m4 + m5 + m6 + m7 + m8 + m9
# compute foo on an input matrix by shifting the mask around
def nestfor(input, foo):
depth, n, m = input.shape
output = np.zeros((depth, n - 2, m - 2))
mask_hor = np.arange(-1, 2)
mask_ver = mask_hor[:, None]
for i in range(1, n - 1):
for j in range(1, m - 1):
output[:, i - 1, j - 1] = foo(input[:, i + mask_ver, j + mask_hor])
return output
# compute foo on an input matrix by breaking the input matrix into 9 submatrices
def flatargs(input, foo):
depth, n, m = input.shape
return foo(input[:, :n-2, :m-2],
input[:, 1:n-1, :m-2],
input[:, 2:, :m-2],
input[:, :n-2, 1:m-1],
input[:, 1:n-1, 1:m-1],
input[:, 2:, 1:m-1],
input[:, :n-2, 2:],
input[:, 1:n-1, 2:],
input[:, 2:, 2:], )
# compute the sum of a window using ndimage.convolve
def convolve(input, mask):
mask = np.ones((1, 3, 3))
out = ndimage.convolve(input, mask)
# cut off the outer edges
return out[1:-1, 1:-1]
So we've got three functions that'll take a matrix and sum up individual 3x3 windows. I've confirmed that they spit out the same matrix at the end. As for benchmarking
In [62]: %timeit nestfor(input, foo_lin_mat)
1000 loops, best of 3: 261 µs per loop
In [63]: %timeit flatargs(input, foo_lin_nine)
10000 loops, best of 3: 35.8 µs per loop
In [66]: mask = np.ones((1,3,3))
In [69]: %timeit convolve(input, mask)
The slowest run took 6.12 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 42.2 µs per loop
I.e. the flatargs version is about 7x faster than the original, nested for-loop, version.
If your foo function is linear function of the input windows, you can also use the ndimage.convolve function to do the windowing, as in the convolve function here. It might be a bit easier to read the final code, but you would have to be careful of the what array you use for the mask.

Dot product of two sparse matrices affecting zero values only

I'm trying to compute a simple dot product but leave nonzero values from the original matrix unchanged. A toy example:
import numpy as np
A = np.array([[2, 1, 1, 2],
[0, 2, 1, 0],
[1, 0, 1, 1],
[2, 2, 1, 0]])
B = np.array([[ 0.54331039, 0.41018682, 0.1582158 , 0.3486124 ],
[ 0.68804647, 0.29520239, 0.40654206, 0.20473451],
[ 0.69857579, 0.38958572, 0.30361365, 0.32256483],
[ 0.46195299, 0.79863505, 0.22431876, 0.59054473]])
Desired outcome:
C = np.array([[ 2. , 1. , 1. , 2. ],
[ 2.07466874, 2. , 1. , 0.73203386],
[ 1. , 1.5984076 , 1. , 1. ],
[ 2. , 2. , 1. , 1.42925865]])
The actual matrices in question, however, are sparse and look more like this:
A = sparse.rand(250000, 1700, density=0.001, format='csr')
B = sparse.rand(1700, 1700, density=0.02, format='csr')
One simple way go would be just setting the values using mask index, like that:
mask = A != 0
C = A.dot(B)
C[mask] = A[mask]
However, my original arrays are sparse and quite large, so changing them via index assignment is painfully slow. Conversion to lil matrix helps a bit, but again, conversion itself takes a lot of time.
The other obvious approach, I guess, would be just resort to iteration and skip masked values, but I'd like not to throw away the benefits of numpy/scipy-optimized array multiplication.
Some clarifications: I'm actually interested in some kind of special case, where B is always square, and therefore, A and C are of the same shape. So if there's a solution that doesn't work on arbitrary arrays but fits in my case, that's fine.
UPDATE: Some attempts:
from scipy import sparse
import numpy as np
def naive(A, B):
mask = A != 0
out = A.dot(B).tolil()
out[mask] = A[mask]
return out.tocsr()
def proposed(A, B):
Az = A == 0
R, C = np.where(Az)
out = A.copy()
out[Az] = np.einsum('ij,ji->i', A[R], B[:, C])
return out
%timeit naive(A, B)
1 loops, best of 3: 4.04 s per loop
%timeit proposed(A, B)
/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py:215: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.pyc in __init__(self, arg1, shape, dtype, copy)
173 self.shape = M.shape
174
--> 175 self.row, self.col = M.nonzero()
176 self.data = M[self.row, self.col]
177 self.has_canonical_format = True
MemoryError:
ANOTHER UPDATE:
Couldn't make anything more or less useful out of Cython, at least without going too far away from Python. The idea was to leave the dot product to scipy and just try to set those original values as fast as possible, something like this:
cimport cython
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef coo_replace(int [:] row1, int [:] col1, float [:] data1, int[:] row2, int[:] col2, float[:] data2):
cdef int N = row1.shape[0]
cdef int M = row2.shape[0]
cdef int i, j
cdef dict d = {}
for i in range(M):
d[(row2[i], col2[i])] = data2[i]
for j in range(N):
if (row1[j], col1[j]) in d:
data1[j] = d[(row1[j], col1[j])]
This was a bit better then my pre-first "naive" implementation (using .tolil()), but following hpaulj's approach, lil can be thrown out. Maybe replacing python dict with something like std::map would help.

A possibly cleaner and faster version of your naive code:
In [57]: r,c=A.nonzero() # this uses A.tocoo()
In [58]: C=A*B
In [59]: Cl=C.tolil()
In [60]: Cl[r,c]=A.tolil()[r,c]
In [61]: Cl.tocsr()
C[r,c]=A[r,c] gives an efficiency warning, but I think that's aimed more a people do that kind of assignment in loop.
In [63]: %%timeit C=A*B
...: C[r,c]=A[r,c]
...
The slowest run took 7.32 times longer than the fastest....
1000 loops, best of 3: 334 µs per loop
In [64]: %%timeit C=A*B
...: Cl=C.tolil()
...: Cl[r,c]=A.tolil()[r,c]
...: Cl.tocsr()
...:
100 loops, best of 3: 2.83 ms per loop
My A is small, only (250,100), but it looks like the round trip to lil isn't a time saver, despite the warning.
Masking with A==0 is bound to give problems when A is sparse
In [66]: Az=A==0
....SparseEfficiencyWarning...
In [67]: r1,c1=Az.nonzero()
Compared to the nonzero r for A, this r1 is much larger - the row index of all zeros in the sparse matrix; everything but the 25 nonzeros.
In [70]: r.shape
Out[70]: (25,)
In [71]: r1.shape
Out[71]: (24975,)
If I index A with that r1 I get a much larger array. In effect I am repeating each row by the number of zeros in it
In [72]: A[r1,:]
Out[72]:
<24975x100 sparse matrix of type '<class 'numpy.float64'>'
with 2473 stored elements in Compressed Sparse Row format>
In [73]: A
Out[73]:
<250x100 sparse matrix of type '<class 'numpy.float64'>'
with 25 stored elements in Compressed Sparse Row format>
I've increased the shape and number of nonzero elements by roughly 100 (the number of columns).
Defining foo, and copying Divakar's tests:
def foo(A,B):
r,c = A.nonzero()
C = A*B
C[r,c] = A[r,c]
return C
In [83]: timeit naive(A,B)
100 loops, best of 3: 2.53 ms per loop
In [84]: timeit proposed(A,B)
/...
SparseEfficiencyWarning)
100 loops, best of 3: 4.48 ms per loop
In [85]: timeit foo(A,B)
...
SparseEfficiencyWarning)
100 loops, best of 3: 2.13 ms per loop
So my version has a modest speed inprovement. As Divakar found out, changing sparsity changes the relative advantages. I expect size to also change them.
The fact that A.nonzero uses the coo format, suggests it might be feasible to construct the new array with that format. A lot of sparse code builds a new matrix via the coo values.
In [97]: Co=C.tocoo()
In [98]: Ao=A.tocoo()
In [99]: r=np.concatenate((Co.row,Ao.row))
In [100]: c=np.concatenate((Co.col,Ao.col))
In [101]: d=np.concatenate((Co.data,Ao.data))
In [102]: r.shape
Out[102]: (79,)
In [103]: C1=sparse.csr_matrix((d,(r,c)),shape=A.shape)
In [104]: C1
Out[104]:
<250x100 sparse matrix of type '<class 'numpy.float64'>'
with 78 stored elements in Compressed Sparse Row format>
This C1 has, I think, the same non-zero elements as the C constructed by other means. But I think one value is different because the r is longer. In this particular example, C and A share one nonzero element, and the coo style of input sums those, where as we'd prefer to have A values overwrite everything.
If you can tolerate this discrepancy, this is a faster way (at least for this test case):
def bar(A,B):
C=A*B
Co=C.tocoo()
Ao=A.tocoo()
r=np.concatenate((Co.row,Ao.row))
c=np.concatenate((Co.col,Ao.col))
d=np.concatenate((Co.data,Ao.data))
return sparse.csr_matrix((d,(r,c)),shape=A.shape)
In [107]: timeit bar(A,B)
1000 loops, best of 3: 1.03 ms per loop

Cracked it! Well, there's a lot of scipy stuffs specific to sparse matrices that I learnt along the way. Here's the implementation that I could muster -
# Find the indices in output array that are to be updated
R,C = ((A!=0).dot(B!=0)).nonzero()
mask = np.asarray(A[R,C]==0).ravel()
R,C = R[mask],C[mask]
# Make a copy of A and get the dot product through sliced rows and columns
# off A and B using the definition of matrix-multiplication
out = A.copy()
out[R,C] = (A[R].multiply(B[:,C].T).sum(1)).ravel()
The most expensive part seems to be element-wise multiplication and summing. On some quick timing tests, it seems that this would be good on a sparse matrices with a high degree of sparsity to beat the original dot-mask-based solution in terms of performance, which I think comes from its focus on memory efficiency.
Runtime test
Function definitions -
def naive(A, B):
mask = A != 0
out = A.dot(B).tolil()
out[mask] = A[mask]
return out.tocsr()
def proposed(A, B):
R,C = ((A!=0).dot(B!=0)).nonzero()
mask = np.asarray(A[R,C]==0).ravel()
R,C = R[mask],C[mask]
out = A.copy()
out[R,C] = (A[R].multiply(B[:,C].T).sum(1)).ravel()
return out
Timings -
In [57]: # Input matrices
...: M,N = 25000, 170
...: A = sparse.rand(M, N, density=0.001, format='csr')
...: B = sparse.rand(N, N, density=0.02, format='csr')
...:
In [58]: %timeit naive(A, B)
10 loops, best of 3: 92.2 ms per loop
In [59]: %timeit proposed(A, B)
10 loops, best of 3: 132 ms per loop
In [60]: # Input matrices with increased sparse-ness
...: M,N = 25000, 170
...: A = sparse.rand(M, N, density=0.0001, format='csr')
...: B = sparse.rand(N, N, density=0.002, format='csr')
...:
In [61]: %timeit naive(A, B)
10 loops, best of 3: 78.1 ms per loop
In [62]: %timeit proposed(A, B)
100 loops, best of 3: 8.03 ms per loop

Python isn't my main language, but I thought this was an interesting problem and I wanted to give this a stab :)
Preliminaries:
import numpy
import scipy.sparse
# example matrices and sparse versions
A = numpy.array([[1, 2, 0, 1], [1, 0, 1, 2], [0, 1, 2 ,1], [1, 2, 1, 0]])
B = numpy.array([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]])
A_s = scipy.sparse.lil_matrix(A)
B_s = scipy.sparse.lil_matrix(B)
So you want to convert the original problem of:
C = A.dot(B)
C[A.nonzero()] = A[A.nonzero()]
To something sparse-y.
Just to get this out of the way, the direct "sparse" translation of the above is:
C_s = A_s.dot(B_s)
C_s[A_s.nonzero()] = A_s[A_s.nonzero()]
But it sounds like you're not happy about this, as it calculates all the dot products first, which you worry might be inefficient.
So, your question is, if you find the zeros first, and only evaluate dot products on those elements, will that be faster? I.e. for a dense matrix this could be something like:
Xs, Ys = numpy.nonzero(A==0)
D = A[:]
D[Xs, Ys] = map ( lambda x,y: A[x,:].dot(B[:,y]), Xs, Ys)
Let's translate this to a sparse matrix. My main stumbling block here was finding the "Zero" indices; since A_s==0 doesn't make sense for sparse matrices, I found them this way:
Xmax, Ymax = A_s.shape
DenseSize = Xmax * Ymax
Xgrid, Ygrid = numpy.mgrid[0:Xmax, 0:Ymax]
Ygrid = Ygrid.reshape([DenseSize,1])[:,0]
Xgrid = Xgrid.reshape([DenseSize,1])[:,0]
AllIndices = numpy.array([Xgrid, Ygrid])
NonzeroIndices = numpy.array(A_s.nonzero())
ZeroIndices = numpy.array([x for x in AllIndices.T.tolist() if x not in NonzeroIndices.T.tolist()]).T
If you know of a better / faster way, by all means try it. Once we have the Zero indices, we can do a similar mapping as before:
D_s = A_s[:]
D_s[ZeroIndices[0], ZeroIndices[1]] = map ( lambda x, y : A_s[x,:].dot(B[:,y])[0], ZeroIndices[0], ZeroIndices[1] )
which gives you your sparse matrix result.
Now I don't know if this is faster or not. I mostly took a stab because it was an interesting problem, and to see if I could do it in python. In fact I suspect it might not be faster than direct whole-matrix dotproduct, because it uses listcomprehensions and mapping on a large dataset (like you say, you expect a lot of zeros). But it is an answer to your question of "how can I only calculate dot products for the zero values without doing multiplying the matrices as a whole". I'd be interested to see if you do try this how it compares in terms of speed on your datasets.
EDIT: I'm putting below an example "block processing" version based on the above, which I think should allow you to process your large dataset without problems. Let me know if it works.
import numpy
import scipy.sparse
# example matrices and sparse versions
A = numpy.array([[1, 2, 0, 1], [1, 0, 1, 2], [0, 1, 2 ,1], [1, 2, 1, 0]])
B = numpy.array([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]])
A_s = scipy.sparse.lil_matrix(A)
B_s = scipy.sparse.lil_matrix(B)
# Choose a grid division (i.e. how many processing blocks you want to create)
BlockGrid = numpy.array([2,2])
D_s = A_s[:] # initialise from A
Xmax, Ymax = A_s.shape
BaseBSiz = numpy.array([Xmax, Ymax]) / BlockGrid
for BIndX in range(0, Xmax, BlockGrid[0]):
for BIndY in range(0, Ymax, BlockGrid[1]):
BSizX, BSizY = D_s[ BIndX : BIndX + BaseBSiz[0], BIndY : BIndY + BaseBSiz[1] ].shape
Xgrid, Ygrid = numpy.mgrid[BIndX : BIndX + BSizX, BIndY : BIndY + BSizY]
Xgrid = Xgrid.reshape([BSizX*BSizY,1])[:,0]
Ygrid = Ygrid.reshape([BSizX*BSizY,1])[:,0]
AllInd = numpy.array([Xgrid, Ygrid]).T
NZeroInd = numpy.array(A_s[Xgrid, Ygrid].reshape((BSizX,BSizY)).nonzero()).T + numpy.array([[BIndX],[BIndY]]).T
ZeroInd = numpy.array([x for x in AllInd.tolist() if x not in NZeroInd.tolist()]).T
#
# Replace zero-values in current block
D_s[ZeroInd[0], ZeroInd[1]] = map ( lambda x, y : A_s[x,:].dot(B[:,y])[0], ZeroInd[0], ZeroInd[1] )

Pairwise vdot using Numpy

I'm trying to compute the pairwise np.vdot of a complex 2D array x with itself. So the behaviour I want is:
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
Is there a way to do this without the explicit loops? I tried using pairwise_kernel from sklearn but it assumes the input arrays are real numbers. I also tried broadcasting, but vdot flattens its inputs.

X = np.einsum('ik,jk->ij', np.conj(x), x)
is equivalent to
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
np.einsum
takes a sum of products. The subscript 'ik,jk->ij' tells np.einsum that the second argument,
np.conj(x) is an array with subscripts ik and the third argument, x has
subscripts jk. Thus, the product np.conj(x)[i,k]*x[j,k] is computed for all
i,j,k. The sum is taken over the repeated subscript, k, and since that
leaves i and j remaining, they become the subscripts of the resultant array.
For example,
import numpy as np
N, M = 10, 20
a = np.random.random((N,M))
b = np.random.random((N,M))
x = a + b*1j
def orig(x):
X = np.empty((x.shape[0], x.shape[0]), dtype='complex128')
for i in range(x.shape[0]):
for j in range(x.shape[0]):
X[i, j] = np.vdot(x[i], x[j])
return X
def alt(x):
return np.einsum('ik,jk->ij', np.conj(x), x)
assert np.allclose(orig(x), alt(x))
In [307]: %timeit orig(x)
10000 loops, best of 3: 143 µs per loop
In [308]: %timeit alt(x)
100000 loops, best of 3: 8.63 µs per loop

To extend the np.vdot to all rows, you can use np.tensordot and I am borrowing the conjugate idea straight off #unutbu's solution , like so -
np.tensordot(np.conj(x),x,axes=(1,1))
Basically with np.tensordot, we specify the axes to be reduced, which in this case is the last axis for the conjugate version of x and the array itself, when applied on those two.
Runtime test -
Let's time #unutbu's solution with np.einsum and the proposed solution in this post -
In [27]: import numpy as np # From #unutbu's` solution again
...:
...: N, M = 1000, 1000
...: a = np.random.random((N,M))
...: b = np.random.random((N,M))
...: x = a + b*1j
...:
In [28]: %timeit np.einsum('ik,jk->ij', np.conj(x), x) # #unutbu's` solution
1 loops, best of 3: 4.45 s per loop
In [29]: %timeit np.tensordot(np.conj(x),x,axes=(1,1))
1 loops, best of 3: 3.76 s per loop

indexing in numpy (related to max/argmax)

Suppose I have an N-dimensional numpy array x and an (N-1)-dimensional index array m (for example, m = x.argmax(axis=-1)). I'd like to construct (N-1) dimensional array y such that y[i_1, ..., i_N-1] = x[i_1, ..., i_N-1, m[i_1, ..., i_N-1]] (for the argmax example above it would be equivalent to y = x.max(axis=-1)).
For N=3 I could achieve what I want by
y = x[np.arange(x.shape[0])[:, np.newaxis], np.arange(x.shape[1]), m]
The question is, how do I do this for an arbitrary N?

you can use indices :
firstdims=np.indices(x.shape[:-1])
And add yours :
ind=tuple(firstdims)+(m,)
Then x[ind] is what you want.
In [228]: allclose(x.max(-1),x[ind])
Out[228]: True

Here's one approach using reshaping and linear indexing to handle multi-dimensional arrays of arbitrary dimensions -
shp = x.shape[:-1]
n_ele = np.prod(shp)
y_out = x.reshape(n_ele,-1)[np.arange(n_ele),m.ravel()].reshape(shp)
Let's take a sample case with a ndarray of 6 dimensions and let's say we are using m = x.argmax(axis=-1) to index into the last dimension. So, the output would be x.max(-1). Let's verify this for the proposed solution -
In [121]: x = np.random.randint(0,9,(4,5,3,3,2,4))
In [122]: m = x.argmax(axis=-1)
In [123]: shp = x.shape[:-1]
...: n_ele = np.prod(shp)
...: y_out = x.reshape(n_ele,-1)[np.arange(n_ele),m.ravel()].reshape(shp)
...:
In [124]: np.allclose(x.max(-1),y_out)
Out[124]: True
I liked #B. M.'s solution for its elegance. So, here's a runtime test to benchmark these two -
def reshape_based(x,m):
shp = x.shape[:-1]
n_ele = np.prod(shp)
return x.reshape(n_ele,-1)[np.arange(n_ele),m.ravel()].reshape(shp)
def indices_based(x,m): ## #B. M.'s solution
firstdims=np.indices(x.shape[:-1])
ind=tuple(firstdims)+(m,)
return x[ind]
Timings -
In [152]: x = np.random.randint(0,9,(4,5,3,3,4,3,6,2,4,2,5))
...: m = x.argmax(axis=-1)
...:
In [153]: %timeit indices_based(x,m)
10 loops, best of 3: 30.2 ms per loop
In [154]: %timeit reshape_based(x,m)
100 loops, best of 3: 5.14 ms per loop

multiple numpy dot products without a loop

Is it possible to compute several dot products without a loop?
say you have the following:
a = randn(100, 3, 3)
b = randn(100, 3, 3)
I want to get an array z of shape (100, 3, 3) such that for all i
z[i, ...] == dot(a[i, ...], b[i, ...])
in other words, which verifies:
for va, vb, vz in izip(a, b, z):
assert (vq == dot(va, vb)).all()
The straightforward solution would be:
z = array([dot(va, vb) for va, vb in zip(a, b)])
which uses an implicit loop (list comprehension + array).
Is there a more efficient way to compute z?

np.einsum can be useful here. Try running this copy+pasteable code:
import numpy as np
a = np.random.randn(100, 3, 3)
b = np.random.randn(100, 3, 3)
z = np.einsum("ijk, ikl -> ijl", a, b)
z2 = np.array([ai.dot(bi) for ai, bi in zip(a, b)])
assert (z == z2).all()
einsum is compiled code and runs very fast, even compared to np.tensordot (which doesn't apply here exactly, but often is applicable). Here are some stats:
In [8]: %timeit z = np.einsum("ijk, ikl -> ijl", a, b)
10000 loops, best of 3: 105 us per loop
In [9]: %timeit z2 = np.array([ai.dot(bi) for ai, bi in zip(a, b)])
1000 loops, best of 3: 1.06 ms per loop

Try Einstein summation in numpy:
z = np.einsum('...ij,...jk->...ik', a, b)
It's elegant and does not require you to write a loop, as you requested.
It gives me a factor of 4.8 speed increase on my system:
%timeit z = array([dot(va, vb) for va, vb in zip(a, b)])
1000 loops, best of 3: 454 µs per loop
%timeit z = np.einsum('...ij,...jk->...ik', a, b)
10000 loops, best of 3: 94.6 µs per loop

This solution still uses a loop, but is faster because it avoids unnecessary creation of temp arrays, by using the out arg of dot:
def dotloop(a,b):
res = empty(a.shape)
for ai,bi,resi in zip(a,b,res):
np.dot(ai, bi, out = resi)
return res
%timeit dotloop(a,b)
1000 loops, best of 3: 453 us per loop
%timeit array([dot(va, vb) for va, vb in zip(a, b)])
1000 loops, best of 3: 843 us per loop

In addition to the other answers, I want to add that:
np.einsum("ijk, ijk -> ij", a, b)
Is suitable for a related case I encountered, where you have two 3D arrays consisting of matching 2D fields of 2D vectors (points or directions). This gives a kind of "element-wise" dot product between those 2D vectors.
For example:
np.einsum("ijk, ijk -> ij", [[[1,2],[3,4]]], [[[5,6],[7,8]]])
# => array([[17, 53]])
Where:
np.dot([1,2],[5,6])
# => 17
np.dot([3,4],[7,8])
# => 53

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensor multiplication with numpy tensordot - python

Related

3D or 4D Numpy array indexed using 2D mask & entire matrix operation

Dot product of two sparse matrices affecting zero values only

Pairwise vdot using Numpy

indexing in numpy (related to max/argmax)

multiple numpy dot products without a loop

Categories

Resources