np.dot product between two 3D matrices along specified axis - python

I have two 3D matrices:
a = np.random.normal(size=[3,2,5])
b = np.random.normal(size=[5,2,3])
I want the dot product of each slice along 2 and 0 axes respectively:
c = np.zeros([3,3,5]) # c.size is 45
c[:,:,0] = a[:,:,0].dot(b[0,:,:])
c[:,:,1] = a[:,:,1].dot(b[1,:,:])
...
I would like to do that using np.tensordot (for efficiency and speed)
I have tried:
c = np.tensordot(a, b, axes=[2,0])
but I get a 4D array with 36 elements (instead of 45). c.shape, c.size = ((3L, 2L, 2L, 3L), 36). I have found a similar question here (Numpy tensor: Tensordot over frontal slices of tensor) but it's not exactly what I want, and I was unable to extrapolate that solution to my problem.
To summarise, can I use np.tensordot to compute c array show above?
Update #1
The answer by #hpaulj is what I wanted, however in my system (python 2.7 and np 1.13.3) those aproaches are pretty slow:
n = 3000
a = np.random.normal(size=[n, 20, 5])
b = np.random.normal(size=[5, 20, n])
t = time.clock()
c_slice = a[:,:,0].dot(b[0,:,:])
print('one slice_x_5: {:.3f} seconds'.format( (time.clock()-t)*5 ))
t = time.clock()
c = np.zeros([n, n, 5])
for i in range(5):
c[:,:,i] = a[:,:,i].dot(b[i,:,:])
print('for loop: {:.3f} seconds'.format(time.clock()-t))
t = time.clock()
d = np.einsum('abi,ibd->adi', a, b)
print('einsum: {:.3f} seconds'.format(time.clock()-t))
t = time.clock()
e = np.tensordot(a,b,[1,1])
e1 = e.transpose(0,3,1,2)[:,:,np.arange(5),np.arange(5)]
print('tensordot: {:.3f} seconds'.format(time.clock()-t))
a = a.transpose(2,0,1)
t = time.clock()
f = np.matmul(a,b)
print('matmul: {:.3f} seconds'.format(time.clock()-t))

It's easier to work with einsum than tensordot. So let's start there:
In [469]: a = np.random.normal(size=[3,2,5])
...: b = np.random.normal(size=[5,2,3])
...:
In [470]: c = np.zeros([3,3,5]) # c.size is 45
In [471]: for i in range(5):
...: c[:,:,i] = a[:,:,i].dot(b[i,:,:])
...:
In [472]: d = np.einsum('abi,ibd->iad', a, b)
In [473]: d.shape
Out[473]: (5, 3, 3)
In [474]: d = np.einsum('abi,ibd->adi', a, b)
In [475]: d.shape
Out[475]: (3, 3, 5)
In [476]: np.allclose(c,d)
Out[476]: True
I had to think a bit about to match up the dimensions. It helped to focus on a[:,:,i] as 2d, and similarly for b[i,:,:]. So the dot sum is over the middle dimension of both arrays (size 2).
In testing ideas it might help if the first 2 dimensions of c were different. There'd be less chance of mixing them up.
It's easy to specify the dot summation axis (axes) in tensordot, but harder to constrain the handling of the other dimensions. That's why you get a 4d array.
I can get it to work with a transpose, followed by taking the diagonal:
In [477]: e = np.tensordot(a,b,[1,1])
In [478]: e.shape
Out[478]: (3, 5, 5, 3)
In [479]: e1 = e.transpose(0,3,1,2)[:,:,np.arange(5),np.arange(5)]
In [480]: e1.shape
Out[480]: (3, 3, 5)
In [481]: np.allclose(c,e1)
Out[481]: True
I've calculated a lot more values than needed, and thrown most of them away.
matmul with some transposing might work better.
In [482]: f = a.transpose(2,0,1)#b
In [483]: f.shape
Out[483]: (5, 3, 3)
In [484]: np.allclose(c, f.transpose(1,2,0))
Out[484]: True
I think of the 5 dimension as 'going-along-for-ride'. That's what your loop does. In einsum the i is the same in all parts.

Related

NumPy broadcasting vs. simple loop

I have been using NumPy for a while but there still are instances in which broadcasting is impenetrable to me. Please see the code below:
import numpy as np
np.random.seed(123)
# Number of measurements for the x variable
nx = 7
# Number of measurements for the y variable
ny = 11
# Number of items for which we run the simulation
nc = 23
# Fake some data
x = np.random.uniform(0, 1, size=(nx, ))
y = np.random.uniform(0, 1, size=(ny, ))
# histogram_2d represents the 2d frequency of the x, y measurements
histogram_2d = np.random.randint(0, 20, size=(nx, ny))
# c is the actual simulation results, size=(nx*ny, nc)
c = np.random.uniform(0, 9, size=(nx*ny, nc))
# Try broadcasting
c_3d = c.reshape((nc, nx, ny))
numpy_sum = (c_3d * histogram_2d).sum()
# Attempt possible replacement with a simple loop
partial_sum = 0.0
for i in range(nc):
c_2d = np.reshape(c[:, i], (nx, ny))
partial_sum += (c_2d * histogram_2d).sum()
print('Numpy broadcasting: ', numpy_sum)
print('Actual loop : ', partial_sum)
In my naivete, I was expecting the two approaches to give the same results (up to some multiple of machine precision). But on my system I get this:
Numpy broadcasting: 74331.4423599
Actual loop : 73599.8596346
As my ignorance is showing: given that histogram_2d is a 2D array and c_3d is a 3D array, I was simply thinking that NumPy would magically expand histogram_2d with nc copies of itself in the first axis and do the multiplication. But it appears I am not quite correct.
I would like to know how to replace the condensed, broadcasted multiplication + sum with a proper for loop - I am looking at some code in Fortran to do the same and this Fortran code:
hist_3d = spread(histogram_2d, 1, nc)
c_3d = reshape(c, [nc, nx, ny])
partial_sum = sum(c_3d*hist_3d)
Does not do what the NumPy broadcasting does... Which means I am doing something fundamentally wrong somewhere and/or my understanding of broadcasting is still very limited.
In [3]: c.shape
Out[3]: (77, 23)
This isn't a good reshape; it works, but will mess up the layout
In [5]: c_3d = c.reshape((nc, nx, ny)); c_3d.shape
Out[5]: (23, 7, 11)
This is good - splitting the 77 into 7 and 11:
In [6]: c_3d = c.reshape((nx, ny,nc)); c_3d.shape
Out[6]: (7, 11, 23)
To multiply with:
In [7]: histogram_2d.shape
Out[7]: (7, 11)
Use:
In [8]: (histogram_2d[:,:,None]*c_3d).shape
Out[8]: (7, 11, 23)
In [9]: (histogram_2d[:,:,None]*c_3d).sum()
Out[9]: 73599.85963455029
With this broadcasting
(7,11,1) and (7,11,23) => (7,11,23)
The 2 key rules are:
add leading dimensions as need to match total ndim
change all size 1 dimensions to match
(I used change, because the 1 may actually be changed to 0. That's not a common case, but illustrates the generality of broadcasting. )
New trailing dimensions have to be explicit. This avoids some ambiguities, as when trying to add a (2,) and (3,). One of those can be expanded to (1,2) or (1,3), but which? If one is (3,1), then expanding the other to (1,2) is unambiguous.

Broadcasting 3d arrays for elementwise multiplication

Good evening,
I need some help understanding advanced broadcasting with complex numpy arrays.
I have:
array A: 50000x2000
array B: 2000x10x10
Implementation with for loop:
for k in range(50000):
temp = A[k,:].reshape(2000,1,1)
finalarray[k,:,:]=np.sum ( B*temp , axis=0)
I want an element-wise multiplication and summation of the axis with 2000 elements, with endproduct:
finalarray: 50000x10x10
Is it possible to avoid the for loop?
Thank you!
For something like this I'd use np.einsum, which makes it pretty easy to write down what you want to happen in terms of the index actions you want:
fast = np.einsum('ij,jkl->ikl', A, B)
which gives me the same result (dropping 50000->500 so the loopy one finishes quickly):
A = np.random.random((500, 2000))
B = np.random.random((2000, 10, 10))
finalarray = np.zeros((500, 10, 10))
for k in range(500):
temp = A[k,:].reshape(2000,1,1)
finalarray[k,:,:]=np.sum ( B*temp , axis=0)
fast = np.einsum('ij,jkl->ikl', A, B)
gives me
In [81]: (finalarray == fast).all()
Out[81]: True
and reasonable performance even in the 50000 case:
In [88]: %time fast = np.einsum('ij,jkl->ikl', A, B)
Wall time: 4.93 s
In [89]: fast.shape
Out[89]: (50000, 10, 10)
Alternatively, in this case, you could use tensordot:
faster = np.tensordot(A, B, axes=1)
which will be a few times faster (at the cost of being less general):
In [29]: A = np.random.random((50000, 2000))
In [30]: B = np.random.random((2000, 10, 10))
In [31]: %time fast = np.einsum('ij,jkl->ikl', A, B)
Wall time: 5.08 s
In [32]: %time faster = np.tensordot(A, B, axes=1)
Wall time: 504 ms
In [33]: np.allclose(fast, faster)
Out[33]: True
I had to use allclose here because the values wind up being very slightly different:
In [34]: abs(fast - faster).max()
Out[34]: 2.7853275241795927e-12
This should work:
(A[:, :, None, None] * B[None, :, :]).sum(axis=1)
But it will blow up your memory for the intermediate array created by the product.
The product has shape (50000, 2000, 10, 10), thus contains 10 billion elements, which is 80 GB for 64 bit floating point values.

Vectorized groupby with NumPy

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the results.
Can this be done flexibly in NumPy without a native Python for-loop? With a Python loop, this would look like:
>>> import numpy as np
>>> X = np.arange(10).reshape(5, 2)
>>> groups = np.array([0, 0, 0, 1, 1])
# Split up elements (rows) of `X` based on their element wise group
>>> np.array([X[groups==i].sum() for i in np.unique(groups)])
array([15, 30])
Above 15 is the sum of the first three rows of X, and 30 is the sum of the remaining two.
By "flexibly,” I just mean that we aren't focusing on one particular computation such as sum, count, maximum, etc, but rather passing any computation to the grouped arrays.
If not, is there a faster approach than the above?
How about using scipy sparse matrix
import numpy as np
from scipy import sparse
import time
x_len = 500000
g_len = 100
X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)
# original
s = time.time()
a = np.array([X[groups==i].sum() for i in np.unique(groups)])
print(time.time() - s)
# using scipy sparse matrix
s = time.time()
x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
(
x_sum,
(groups, np.arange(len(x_sum)))
),
shape=(g_len, x_len)
).sum(axis=1)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-b)).sum())
result on my PC
0.15915322303771973
0.012875080108642578
0
More than 10 times faster.
Update!
Let's benchmark answers of #Paul Panzer and #Daniel F. It is summation only benchmark.
import numpy as np
from scipy import sparse
import time
# by #Daniel F
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
x_len = 500000
g_len = 200
X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)
print("original")
s = time.time()
a = np.array([X[groups==i].sum() for i in np.unique(groups)])
print(time.time() - s)
print("use scipy coo matrix")
s = time.time()
x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
(
x_sum,
(groups, np.arange(len(x_sum)))
),
shape=(g_len, x_len)
).sum(axis=1)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-b)).sum())
print("use scipy csr matrix #Daniel F")
s = time.time()
x_sum = X.sum(axis=1)
c = np.array(sparse.csr_matrix(
(
x_sum,
groups,
np.arange(len(groups)+1)
),
shape=(len(groups), g_len)
).sum(axis=0)).ravel()
print(time.time() - s)
#compare
print(np.abs((a-c)).sum())
print("use bincount #Paul Panzer #Daniel F")
s = time.time()
d = np.bincount(groups, X.sum(axis=1), g_len)
print(time.time() - s)
#compare
print(np.abs((a-d)).sum())
print("use ufunc #Daniel F")
s = time.time()
e = groupby_np(X, groups)
print(time.time() - s)
#compare
print(np.abs((a-e)).sum())
STDOUT
original
0.2882847785949707
use scipy coo matrix
0.012301445007324219
0
use scipy csr matrix #Daniel F
0.01046299934387207
0
use bincount #Paul Panzer #Daniel F
0.007468223571777344
0.0
use ufunc #Daniel F
0.04431319236755371
0
The winner is the bincount solution. But the csr matrix solution is also very interesting.
#klim's sparse matrix solution would at first sight appear to be tied to summation. We can, however, use it in the general case by converting between the csr and csc formats:
Let's look at a small example:
>>> m, n = 3, 8
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>
>>> M = sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m))
>>>
>>> idx
array([0, 2, 2, 1, 1, 2, 2, 0])
>>>
>>> M = M.tocsc()
>>>
>>> M.indptr, M.indices
(array([0, 2, 4, 8], dtype=int32), array([0, 7, 3, 4, 1, 2, 5, 6], dtype=int32))
As we can see after conversion the internal representation of the sparse matrix yields the indices grouped and sorted:
>>> groups = np.split(M.indices, M.indptr[1:-1])
>>> groups
[array([0, 7], dtype=int32), array([3, 4], dtype=int32), array([1, 2, 5, 6], dtype=int32)]
>>>
We could have obtained the same using a stable argsort:
>>> np.argsort(idx, kind='mergesort')
array([0, 7, 3, 4, 1, 2, 5, 6])
>>>
But sparse matrices are actually faster, even when we allow argsort to use a faster non-stable algorithm:
>>> m, n = 1000, 100000
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>
>>> timeit('sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m)).tocsc()', **kwds)
2.250748165184632
>>> timeit('np.argsort(idx)', **kwds)
5.783584725111723
If we require argsort to keep groups sorted, the difference is even larger:
>>> timeit('np.argsort(idx, kind="mergesort")', **kwds)
10.507467685034499
If you want a more flexible implementation of groupby that can group using any of numpy's ufuncs:
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
groupby_np(X, groups)
array([15, 30])
groupby_np(X, groups, uf = np.multiply)
array([ 0, 3024])
groupby_np(X, groups, uf = np.maximum)
array([5, 9])
groupby_np(X, groups, uf = np.minimum)
array([0, 6])
There's probably a faster way than this (both of the operands are making copies right now), but:
np.bincount(np.broadcast_to(groups, X.T.shape).ravel(), X.T.ravel())
array([ 15., 30.])
If you want to extend the answer to a ndarray, and still have a fast computation, you could extend the Daniel's solution :
x_len = 500000
g_len = 200
y_len = 2
X = np.arange(x_len * y_len).reshape(x_len, y_len)
groups = np.random.randint(0, g_len, x_len)
# original
a = np.array([X[groups==i].sum(axis=0) for i in np.unique(groups)])
# alternative
bins = [0] + list(np.bincount(groups, minlength=g_len).cumsum())
Z = np.argsort(groups)
d = np.array([X.take(Z[bins[i]:bins[i+1]],0).sum(axis=0) for i in range(g_len)])
It took about 30 ms (15ms for creating bins + 15ms for summing) instead of 280 ms on the original way in this example.
d.shape
>>> (1000, 2)

Numpy outer addition of subarrays

Is there a way, in numpy, to perform what amounts to an outer addition of subarrays?
That is to say, I have 2 arrays of the form 2x2xNxM, which may each be considered a stack of 2x2 matrices N high and M wide. I would like to add each of these matrices to each matrix from the other array, to form a 2x2xNxMxNxM array in which the last four indices correspond to the indices in my initial two arrays so that I can index output[:,:,x1,y1,x2,y2] == a1[:,:,x1,y1] + a2[:,:,x2,y2].
If these were arrays of scalars, it would be trivial, all I'd have to do is:
A, B = a.ravel(), b.ravel()
four_D = (a[...:np.newaxis] + b).reshape(*a1.shape, *a2.shape)
for (x1, y1, x2, y2), added in np.ndenumerate(four_D):
assert added == a1[x1,y1] + a2[x2,y2]
However, this doesn't work for the case where a and b comprise of matrices. I could, of course, use nested for loops, but my dataset is going to be fairly large, and I'm expecting to run this over multiple datasets.
Is there an efficient way to do this?
Extend arrays to have more dimensions and then leverage broadcasting -
output = a1[...,None,None] + a2[...,None,None,:,:]
Sample run -
In [38]: # Setup input arrays
...: N = 3
...: M = 4
...: a1 = np.random.rand(2,2,N,M)
...: a2 = np.random.rand(2,2,N,M)
...:
...: output = np.zeros((2,2,N,M,N,M))
...: for x1 in range(N):
...: for x2 in range(N):
...: for y1 in range(M):
...: for y2 in range(M):
...: output[:,:,x1,y1,x2,y2] = a1[:,:,x1,y1] + a2[:,:,x2,y2]
...:
...: output1 = a1[...,None,None] + a2[...,None,None,:,:]
...:
...: print np.allclose(output, output1)
True
Same as for scalars inserting additional axes works for higher dimensional arrays too (this is called broadcasting):
import numpy as np
a1 = np.random.randn(2, 2, 3, 4)
a2 = np.random.randn(2, 2, 3, 4)
added = a1[..., np.newaxis, np.newaxis] + a2[..., np.newaxis, np.newaxis, :, :]
print(added.shape) # (2, 2, 3, 4, 3, 4)

Creating 3rd order tensors with python and numpy

I have a two 1 dimensional arrays, a such that np.shape(a) == (n,) and b such that np.shape(b) == (m,).
I want to make a (3rd order) tensor c such that np.shape(c) == (n,n,m,)by doing c = np.outer(np.outer(a,a),b).
But when I do this, I get:
>> np.shape(c)
(n*n,m)
which is just a rectangular matrix. How can I make a 3D tensor like I want?
You could perhaps use np.multiply.outer instead of np.outer to get the required outer product:
>>> a = np.arange(4)
>>> b = np.ones(5)
>>> mo = np.multiply.outer
Then we have:
>>> mo(mo(a, a), b).shape
(4, 4, 5)
A better way could be to use np.einsum (this avoids creating intermediate arrays):
>>> c = np.einsum('i,j,k->ijk', a, a, b)
>>> c.shape
(4, 4, 5)

Categories

Resources