Broadcasting 3d arrays for elementwise multiplication

Broadcasting 3d arrays for elementwise multiplication - python

Good evening,
I need some help understanding advanced broadcasting with complex numpy arrays.
I have:
array A: 50000x2000
array B: 2000x10x10
Implementation with for loop:
for k in range(50000):
temp = A[k,:].reshape(2000,1,1)
finalarray[k,:,:]=np.sum ( B*temp , axis=0)
I want an element-wise multiplication and summation of the axis with 2000 elements, with endproduct:
finalarray: 50000x10x10
Is it possible to avoid the for loop?
Thank you!

For something like this I'd use np.einsum, which makes it pretty easy to write down what you want to happen in terms of the index actions you want:
fast = np.einsum('ij,jkl->ikl', A, B)
which gives me the same result (dropping 50000->500 so the loopy one finishes quickly):
A = np.random.random((500, 2000))
B = np.random.random((2000, 10, 10))
finalarray = np.zeros((500, 10, 10))
for k in range(500):
temp = A[k,:].reshape(2000,1,1)
finalarray[k,:,:]=np.sum ( B*temp , axis=0)
fast = np.einsum('ij,jkl->ikl', A, B)
gives me
In [81]: (finalarray == fast).all()
Out[81]: True
and reasonable performance even in the 50000 case:
In [88]: %time fast = np.einsum('ij,jkl->ikl', A, B)
Wall time: 4.93 s
In [89]: fast.shape
Out[89]: (50000, 10, 10)
Alternatively, in this case, you could use tensordot:
faster = np.tensordot(A, B, axes=1)
which will be a few times faster (at the cost of being less general):
In [29]: A = np.random.random((50000, 2000))
In [30]: B = np.random.random((2000, 10, 10))
In [31]: %time fast = np.einsum('ij,jkl->ikl', A, B)
Wall time: 5.08 s
In [32]: %time faster = np.tensordot(A, B, axes=1)
Wall time: 504 ms
In [33]: np.allclose(fast, faster)
Out[33]: True
I had to use allclose here because the values wind up being very slightly different:
In [34]: abs(fast - faster).max()
Out[34]: 2.7853275241795927e-12

This should work:
(A[:, :, None, None] * B[None, :, :]).sum(axis=1)
But it will blow up your memory for the intermediate array created by the product.
The product has shape (50000, 2000, 10, 10), thus contains 10 billion elements, which is 80 GB for 64 bit floating point values.

Related

Faster alternative to the (V A V^T).diagonal in python [duplicate]

Imagine having 2 numpy arrays:
> A, A.shape = (n,p)
> B, B.shape = (p,p)
Typically p is a smaller number (p <= 200), while n can be arbitrarily large.
I am doing the following:
result = np.diag(A.dot(B).dot(A.T))
As you can see, I am keeping only the n diagonal entries, however there is an intermediate (n x n) array calculated from which only the diagonal entries are kept.
I wish for a function like diag_dot(), which only calculates the diagonal entries of the result and does not allocate the complete memory.
A result would be:
> result = diag_dot(A.dot(B), A.T)
Is there a premade functionality like this and can this be done efficiently without the need for allocating the intermediate (n x n) array?

I think i got it on my own, but nevertheless will share the solution:
since getting only the diagonals of a matrix multiplication
> Z = N.diag(X.dot(Y))
is equivalent to the individual sum of the scalar product of rows of X and columns of Y, the previous statement is equivalent to:
> Z = (X * Y.T).sum(-1)
For the original variables this means:
> result = (A.dot(B) * A).sum(-1)
Please correct me if I am wrong but this should be it ...

You can get almost anything you ever dreamed of with numpy.einsum. Until you start getting the hang of it, it basically seems like black voodoo...
>>> a = np.arange(15).reshape(5, 3)
>>> b = np.arange(9).reshape(3, 3)
>>> np.diag(np.dot(np.dot(a, b), a.T))
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,ji->i', np.dot(a, b), a.T)
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,ij->i', np.dot(a, b), a)
array([ 60, 672, 1932, 3840, 6396])
EDIT You can actually get the whole thing in a single shot, it's ridiculous...
>>> np.einsum('ij,jk,ki->i', a, b, a.T)
array([ 60, 672, 1932, 3840, 6396])
>>> np.einsum('ij,jk,ik->i', a, b, a)
array([ 60, 672, 1932, 3840, 6396])
EDIT You don't want to let it figure too much on its own though... Added the OP's answer to its own question for comparison also.
n, p = 10000, 200
a = np.random.rand(n, p)
b = np.random.rand(p, p)
In [2]: %timeit np.einsum('ij,jk,ki->i', a, b, a.T)
1 loops, best of 3: 1.3 s per loop
In [3]: %timeit np.einsum('ij,ij->i', np.dot(a, b), a)
10 loops, best of 3: 105 ms per loop
In [4]: %timeit np.diag(np.dot(np.dot(a, b), a.T))
1 loops, best of 3: 5.73 s per loop
In [5]: %timeit (a.dot(b) * a).sum(-1)
10 loops, best of 3: 115 ms per loop

A pedestrian answer, which avoids the construction of large intermediate arrays is:
result=np.empty([n,], dtype=A.dtype )
for i in xrange(n):
result[i]=A[i,:].dot(B).dot(A[i,:])

np.dot product between two 3D matrices along specified axis

I have two 3D matrices:
a = np.random.normal(size=[3,2,5])
b = np.random.normal(size=[5,2,3])
I want the dot product of each slice along 2 and 0 axes respectively:
c = np.zeros([3,3,5]) # c.size is 45
c[:,:,0] = a[:,:,0].dot(b[0,:,:])
c[:,:,1] = a[:,:,1].dot(b[1,:,:])
...
I would like to do that using np.tensordot (for efficiency and speed)
I have tried:
c = np.tensordot(a, b, axes=[2,0])
but I get a 4D array with 36 elements (instead of 45). c.shape, c.size = ((3L, 2L, 2L, 3L), 36). I have found a similar question here (Numpy tensor: Tensordot over frontal slices of tensor) but it's not exactly what I want, and I was unable to extrapolate that solution to my problem.
To summarise, can I use np.tensordot to compute c array show above?
Update #1
The answer by #hpaulj is what I wanted, however in my system (python 2.7 and np 1.13.3) those aproaches are pretty slow:
n = 3000
a = np.random.normal(size=[n, 20, 5])
b = np.random.normal(size=[5, 20, n])
t = time.clock()
c_slice = a[:,:,0].dot(b[0,:,:])
print('one slice_x_5: {:.3f} seconds'.format( (time.clock()-t)*5 ))
t = time.clock()
c = np.zeros([n, n, 5])
for i in range(5):
c[:,:,i] = a[:,:,i].dot(b[i,:,:])
print('for loop: {:.3f} seconds'.format(time.clock()-t))
t = time.clock()
d = np.einsum('abi,ibd->adi', a, b)
print('einsum: {:.3f} seconds'.format(time.clock()-t))
t = time.clock()
e = np.tensordot(a,b,[1,1])
e1 = e.transpose(0,3,1,2)[:,:,np.arange(5),np.arange(5)]
print('tensordot: {:.3f} seconds'.format(time.clock()-t))
a = a.transpose(2,0,1)
t = time.clock()
f = np.matmul(a,b)
print('matmul: {:.3f} seconds'.format(time.clock()-t))

It's easier to work with einsum than tensordot. So let's start there:
In [469]: a = np.random.normal(size=[3,2,5])
...: b = np.random.normal(size=[5,2,3])
...:
In [470]: c = np.zeros([3,3,5]) # c.size is 45
In [471]: for i in range(5):
...: c[:,:,i] = a[:,:,i].dot(b[i,:,:])
...:
In [472]: d = np.einsum('abi,ibd->iad', a, b)
In [473]: d.shape
Out[473]: (5, 3, 3)
In [474]: d = np.einsum('abi,ibd->adi', a, b)
In [475]: d.shape
Out[475]: (3, 3, 5)
In [476]: np.allclose(c,d)
Out[476]: True
I had to think a bit about to match up the dimensions. It helped to focus on a[:,:,i] as 2d, and similarly for b[i,:,:]. So the dot sum is over the middle dimension of both arrays (size 2).
In testing ideas it might help if the first 2 dimensions of c were different. There'd be less chance of mixing them up.
It's easy to specify the dot summation axis (axes) in tensordot, but harder to constrain the handling of the other dimensions. That's why you get a 4d array.
I can get it to work with a transpose, followed by taking the diagonal:
In [477]: e = np.tensordot(a,b,[1,1])
In [478]: e.shape
Out[478]: (3, 5, 5, 3)
In [479]: e1 = e.transpose(0,3,1,2)[:,:,np.arange(5),np.arange(5)]
In [480]: e1.shape
Out[480]: (3, 3, 5)
In [481]: np.allclose(c,e1)
Out[481]: True
I've calculated a lot more values than needed, and thrown most of them away.
matmul with some transposing might work better.
In [482]: f = a.transpose(2,0,1)#b
In [483]: f.shape
Out[483]: (5, 3, 3)
In [484]: np.allclose(c, f.transpose(1,2,0))
Out[484]: True
I think of the 5 dimension as 'going-along-for-ride'. That's what your loop does. In einsum the i is the same in all parts.

What is the fastest way to multiply with extremely sparse matrix?

I have an extremely sparse structured matrix. My matrix has exactly one non zero entry per column. But its huge(10k*1M) and given in following form(uisng random values for example)
rows = np.random.randint(0, 10000, 1000000)
values = np.random.randint(0,10,1000000)
where rows gives us the row number for nonzero entry in each column. I want fast matrix multiplication with S and I am doing following right now - I convert this form to a sparse matrix(S) and do S.dot(X) for multiplication with matrix X(which can be sparse or dense).
S=scipy.sparse.csr_matrix( (values, (rows, scipy.arange(1000000))), shape = (10000,1000000))
For X of size 1M * 2500 and nnz(X)=8M this takes 178ms to create S and 255 ms to apply it. So my question is this what is the best way of doing SX (where X could be sparse or dense) given my S is as described. Since creating S is itself very time consuming I was thinking of something adhoc. I did try creating something using loops but its not even close.
Simple looping procedure looks something like this
SX = np.zeros((rows.size,X.shape[1]))
for i in range(X.shape[0]):
SX[rows[i],:]+=values[i]*X[i,:]
return SX
Can we make this efficient?
Any suggestions are greatly appreciated. Thanks

Approach #1
Given that there's exactly one entry per column in the first input, we could use np.bincount using inputs - rows, values and X and thus also avoids creating sparse matrix S -
def sparse_matrix_mult(rows, values, X):
nrows = rows.max()+1
ncols = X.shape[1]
nelem = nrows * ncols
ids = rows + nrows*np.arange(ncols)[:,None]
sums = np.bincount(ids.ravel(), (X.T*values).ravel(), minlength=nelem)
out = sums.reshape(ncols,-1).T
return out
Sample run -
In [746]: import numpy as np
...: from scipy.sparse import csr_matrix
...: import scipy as sp
...:
In [747]: np.random.seed(1234)
...: m,n = 3,4
...: rows = np.random.randint(0, m, n)
...: values = np.random.randint(2,10,n)
...: X = np.random.randint(2, 10, (n,5))
...:
In [748]: S = csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))
In [749]: S.dot(X)
Out[749]:
array([[42, 27, 45, 78, 87],
[24, 18, 18, 12, 24],
[18, 6, 8, 16, 10]])
In [750]: sparse_matrix_mult(rows, values, X)
Out[750]:
array([[ 42., 27., 45., 78., 87.],
[ 24., 18., 18., 12., 24.],
[ 18., 6., 8., 16., 10.]])
Approach #2
Using np.add.reduceat to replace np.bincount -
def sparse_matrix_mult_v2(rows, values, X):
nrows = rows.max()+1
ncols = X.shape[1]
scaled_ar = X*values[:,None]
sidx = rows.argsort()
rows_s = rows[sidx]
cut_idx = np.concatenate(([0],np.flatnonzero(rows_s[1:] != rows_s[:-1])+1))
sums = np.add.reduceat(scaled_ar[sidx],cut_idx,axis=0)
out = np.empty((nrows, ncols),dtype=sums.dtype)
row_idx = rows_s[cut_idx]
out[row_idx] = sums
return out
Runtime test
I could not run it on the sizes mentioned in the question, as those were too big for me to handle. So, on reduced datasets, here's what I am getting -
In [149]: m,n = 1000, 100000
...: rows = np.random.randint(0, m, n)
...: values = np.random.randint(2,10,n)
...: X = np.random.randint(2, 10, (n,2500))
...:
In [150]: S = csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))
In [151]: %timeit csr_matrix( (values, (rows, sp.arange(n))), shape = (m,n))
100 loops, best of 3: 16.1 ms per loop
In [152]: %timeit S.dot(X)
1 loop, best of 3: 193 ms per loop
In [153]: %timeit sparse_matrix_mult(rows, values, X)
1 loop, best of 3: 4.4 s per loop
In [154]: %timeit sparse_matrix_mult_v2(rows, values, X)
1 loop, best of 3: 2.81 s per loop
So, the proposed methods don't seem to over-power numpy.dot on performance, but they should be good on memory efficiency.
For sparse X
For sparse X, we need some modifications, as listed in the modified method listed below -
from scipy.sparse import find
def sparse_matrix_mult_sparseX(rows, values, Xs): # Xs is sparse
nrows = rows.max()+1
ncols = Xs.shape[1]
nelem = nrows * ncols
scaled_vals = Xs.multiply(values[:,None])
r,c,v = find(scaled_vals)
ids = rows[r] + c*nrows
sums = np.bincount(ids, v, minlength=nelem)
out = sums.reshape(ncols,-1).T
return out

Inspired by this post Fastest way to sum over rows of sparse matrix, I have found the best way to do this is to write loops and port things to numba. Here is the
`
#njit
def sparse_mul(SX,row,col,data,values,row_map):
N = len(data)
for idx in range(N):
SX[row_map[row[idx]],col[idx]]+=data[idx]*values[row[idx]]
return SX
X_coo=X.tocoo()
s=row_map.max()+1
SX = np.zeros((s,X.shape[1]))
sparse_mul(SX,X_coo.row,X_coo.col,X_coo.data,values,row_map)`
Here row_map is the rows in the question. On X of size (1M* 1K), 1% sparsity and with s=10K this performs twice as good as forming sparse matrix from row_map and doing S.dot(A).

As I recall, Knuth TAOP talks about representing a sparse matrix instead as a linked list of (for your app) non-zero values. Maybe something like that? Then traverse the linked list rather than the entire array by each dimension.

Dot product of two sparse matrices affecting zero values only

I'm trying to compute a simple dot product but leave nonzero values from the original matrix unchanged. A toy example:
import numpy as np
A = np.array([[2, 1, 1, 2],
[0, 2, 1, 0],
[1, 0, 1, 1],
[2, 2, 1, 0]])
B = np.array([[ 0.54331039, 0.41018682, 0.1582158 , 0.3486124 ],
[ 0.68804647, 0.29520239, 0.40654206, 0.20473451],
[ 0.69857579, 0.38958572, 0.30361365, 0.32256483],
[ 0.46195299, 0.79863505, 0.22431876, 0.59054473]])
Desired outcome:
C = np.array([[ 2. , 1. , 1. , 2. ],
[ 2.07466874, 2. , 1. , 0.73203386],
[ 1. , 1.5984076 , 1. , 1. ],
[ 2. , 2. , 1. , 1.42925865]])
The actual matrices in question, however, are sparse and look more like this:
A = sparse.rand(250000, 1700, density=0.001, format='csr')
B = sparse.rand(1700, 1700, density=0.02, format='csr')
One simple way go would be just setting the values using mask index, like that:
mask = A != 0
C = A.dot(B)
C[mask] = A[mask]
However, my original arrays are sparse and quite large, so changing them via index assignment is painfully slow. Conversion to lil matrix helps a bit, but again, conversion itself takes a lot of time.
The other obvious approach, I guess, would be just resort to iteration and skip masked values, but I'd like not to throw away the benefits of numpy/scipy-optimized array multiplication.
Some clarifications: I'm actually interested in some kind of special case, where B is always square, and therefore, A and C are of the same shape. So if there's a solution that doesn't work on arbitrary arrays but fits in my case, that's fine.
UPDATE: Some attempts:
from scipy import sparse
import numpy as np
def naive(A, B):
mask = A != 0
out = A.dot(B).tolil()
out[mask] = A[mask]
return out.tocsr()
def proposed(A, B):
Az = A == 0
R, C = np.where(Az)
out = A.copy()
out[Az] = np.einsum('ij,ji->i', A[R], B[:, C])
return out
%timeit naive(A, B)
1 loops, best of 3: 4.04 s per loop
%timeit proposed(A, B)
/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py:215: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.pyc in __init__(self, arg1, shape, dtype, copy)
173 self.shape = M.shape
174
--> 175 self.row, self.col = M.nonzero()
176 self.data = M[self.row, self.col]
177 self.has_canonical_format = True
MemoryError:
ANOTHER UPDATE:
Couldn't make anything more or less useful out of Cython, at least without going too far away from Python. The idea was to leave the dot product to scipy and just try to set those original values as fast as possible, something like this:
cimport cython
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef coo_replace(int [:] row1, int [:] col1, float [:] data1, int[:] row2, int[:] col2, float[:] data2):
cdef int N = row1.shape[0]
cdef int M = row2.shape[0]
cdef int i, j
cdef dict d = {}
for i in range(M):
d[(row2[i], col2[i])] = data2[i]
for j in range(N):
if (row1[j], col1[j]) in d:
data1[j] = d[(row1[j], col1[j])]
This was a bit better then my pre-first "naive" implementation (using .tolil()), but following hpaulj's approach, lil can be thrown out. Maybe replacing python dict with something like std::map would help.

A possibly cleaner and faster version of your naive code:
In [57]: r,c=A.nonzero() # this uses A.tocoo()
In [58]: C=A*B
In [59]: Cl=C.tolil()
In [60]: Cl[r,c]=A.tolil()[r,c]
In [61]: Cl.tocsr()
C[r,c]=A[r,c] gives an efficiency warning, but I think that's aimed more a people do that kind of assignment in loop.
In [63]: %%timeit C=A*B
...: C[r,c]=A[r,c]
...
The slowest run took 7.32 times longer than the fastest....
1000 loops, best of 3: 334 µs per loop
In [64]: %%timeit C=A*B
...: Cl=C.tolil()
...: Cl[r,c]=A.tolil()[r,c]
...: Cl.tocsr()
...:
100 loops, best of 3: 2.83 ms per loop
My A is small, only (250,100), but it looks like the round trip to lil isn't a time saver, despite the warning.
Masking with A==0 is bound to give problems when A is sparse
In [66]: Az=A==0
....SparseEfficiencyWarning...
In [67]: r1,c1=Az.nonzero()
Compared to the nonzero r for A, this r1 is much larger - the row index of all zeros in the sparse matrix; everything but the 25 nonzeros.
In [70]: r.shape
Out[70]: (25,)
In [71]: r1.shape
Out[71]: (24975,)
If I index A with that r1 I get a much larger array. In effect I am repeating each row by the number of zeros in it
In [72]: A[r1,:]
Out[72]:
<24975x100 sparse matrix of type '<class 'numpy.float64'>'
with 2473 stored elements in Compressed Sparse Row format>
In [73]: A
Out[73]:
<250x100 sparse matrix of type '<class 'numpy.float64'>'
with 25 stored elements in Compressed Sparse Row format>
I've increased the shape and number of nonzero elements by roughly 100 (the number of columns).
Defining foo, and copying Divakar's tests:
def foo(A,B):
r,c = A.nonzero()
C = A*B
C[r,c] = A[r,c]
return C
In [83]: timeit naive(A,B)
100 loops, best of 3: 2.53 ms per loop
In [84]: timeit proposed(A,B)
/...
SparseEfficiencyWarning)
100 loops, best of 3: 4.48 ms per loop
In [85]: timeit foo(A,B)
...
SparseEfficiencyWarning)
100 loops, best of 3: 2.13 ms per loop
So my version has a modest speed inprovement. As Divakar found out, changing sparsity changes the relative advantages. I expect size to also change them.
The fact that A.nonzero uses the coo format, suggests it might be feasible to construct the new array with that format. A lot of sparse code builds a new matrix via the coo values.
In [97]: Co=C.tocoo()
In [98]: Ao=A.tocoo()
In [99]: r=np.concatenate((Co.row,Ao.row))
In [100]: c=np.concatenate((Co.col,Ao.col))
In [101]: d=np.concatenate((Co.data,Ao.data))
In [102]: r.shape
Out[102]: (79,)
In [103]: C1=sparse.csr_matrix((d,(r,c)),shape=A.shape)
In [104]: C1
Out[104]:
<250x100 sparse matrix of type '<class 'numpy.float64'>'
with 78 stored elements in Compressed Sparse Row format>
This C1 has, I think, the same non-zero elements as the C constructed by other means. But I think one value is different because the r is longer. In this particular example, C and A share one nonzero element, and the coo style of input sums those, where as we'd prefer to have A values overwrite everything.
If you can tolerate this discrepancy, this is a faster way (at least for this test case):
def bar(A,B):
C=A*B
Co=C.tocoo()
Ao=A.tocoo()
r=np.concatenate((Co.row,Ao.row))
c=np.concatenate((Co.col,Ao.col))
d=np.concatenate((Co.data,Ao.data))
return sparse.csr_matrix((d,(r,c)),shape=A.shape)
In [107]: timeit bar(A,B)
1000 loops, best of 3: 1.03 ms per loop

Cracked it! Well, there's a lot of scipy stuffs specific to sparse matrices that I learnt along the way. Here's the implementation that I could muster -
# Find the indices in output array that are to be updated
R,C = ((A!=0).dot(B!=0)).nonzero()
mask = np.asarray(A[R,C]==0).ravel()
R,C = R[mask],C[mask]
# Make a copy of A and get the dot product through sliced rows and columns
# off A and B using the definition of matrix-multiplication
out = A.copy()
out[R,C] = (A[R].multiply(B[:,C].T).sum(1)).ravel()
The most expensive part seems to be element-wise multiplication and summing. On some quick timing tests, it seems that this would be good on a sparse matrices with a high degree of sparsity to beat the original dot-mask-based solution in terms of performance, which I think comes from its focus on memory efficiency.
Runtime test
Function definitions -
def naive(A, B):
mask = A != 0
out = A.dot(B).tolil()
out[mask] = A[mask]
return out.tocsr()
def proposed(A, B):
R,C = ((A!=0).dot(B!=0)).nonzero()
mask = np.asarray(A[R,C]==0).ravel()
R,C = R[mask],C[mask]
out = A.copy()
out[R,C] = (A[R].multiply(B[:,C].T).sum(1)).ravel()
return out
Timings -
In [57]: # Input matrices
...: M,N = 25000, 170
...: A = sparse.rand(M, N, density=0.001, format='csr')
...: B = sparse.rand(N, N, density=0.02, format='csr')
...:
In [58]: %timeit naive(A, B)
10 loops, best of 3: 92.2 ms per loop
In [59]: %timeit proposed(A, B)
10 loops, best of 3: 132 ms per loop
In [60]: # Input matrices with increased sparse-ness
...: M,N = 25000, 170
...: A = sparse.rand(M, N, density=0.0001, format='csr')
...: B = sparse.rand(N, N, density=0.002, format='csr')
...:
In [61]: %timeit naive(A, B)
10 loops, best of 3: 78.1 ms per loop
In [62]: %timeit proposed(A, B)
100 loops, best of 3: 8.03 ms per loop

Python isn't my main language, but I thought this was an interesting problem and I wanted to give this a stab :)
Preliminaries:
import numpy
import scipy.sparse
# example matrices and sparse versions
A = numpy.array([[1, 2, 0, 1], [1, 0, 1, 2], [0, 1, 2 ,1], [1, 2, 1, 0]])
B = numpy.array([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]])
A_s = scipy.sparse.lil_matrix(A)
B_s = scipy.sparse.lil_matrix(B)
So you want to convert the original problem of:
C = A.dot(B)
C[A.nonzero()] = A[A.nonzero()]
To something sparse-y.
Just to get this out of the way, the direct "sparse" translation of the above is:
C_s = A_s.dot(B_s)
C_s[A_s.nonzero()] = A_s[A_s.nonzero()]
But it sounds like you're not happy about this, as it calculates all the dot products first, which you worry might be inefficient.
So, your question is, if you find the zeros first, and only evaluate dot products on those elements, will that be faster? I.e. for a dense matrix this could be something like:
Xs, Ys = numpy.nonzero(A==0)
D = A[:]
D[Xs, Ys] = map ( lambda x,y: A[x,:].dot(B[:,y]), Xs, Ys)
Let's translate this to a sparse matrix. My main stumbling block here was finding the "Zero" indices; since A_s==0 doesn't make sense for sparse matrices, I found them this way:
Xmax, Ymax = A_s.shape
DenseSize = Xmax * Ymax
Xgrid, Ygrid = numpy.mgrid[0:Xmax, 0:Ymax]
Ygrid = Ygrid.reshape([DenseSize,1])[:,0]
Xgrid = Xgrid.reshape([DenseSize,1])[:,0]
AllIndices = numpy.array([Xgrid, Ygrid])
NonzeroIndices = numpy.array(A_s.nonzero())
ZeroIndices = numpy.array([x for x in AllIndices.T.tolist() if x not in NonzeroIndices.T.tolist()]).T
If you know of a better / faster way, by all means try it. Once we have the Zero indices, we can do a similar mapping as before:
D_s = A_s[:]
D_s[ZeroIndices[0], ZeroIndices[1]] = map ( lambda x, y : A_s[x,:].dot(B[:,y])[0], ZeroIndices[0], ZeroIndices[1] )
which gives you your sparse matrix result.
Now I don't know if this is faster or not. I mostly took a stab because it was an interesting problem, and to see if I could do it in python. In fact I suspect it might not be faster than direct whole-matrix dotproduct, because it uses listcomprehensions and mapping on a large dataset (like you say, you expect a lot of zeros). But it is an answer to your question of "how can I only calculate dot products for the zero values without doing multiplying the matrices as a whole". I'd be interested to see if you do try this how it compares in terms of speed on your datasets.
EDIT: I'm putting below an example "block processing" version based on the above, which I think should allow you to process your large dataset without problems. Let me know if it works.
import numpy
import scipy.sparse
# example matrices and sparse versions
A = numpy.array([[1, 2, 0, 1], [1, 0, 1, 2], [0, 1, 2 ,1], [1, 2, 1, 0]])
B = numpy.array([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]])
A_s = scipy.sparse.lil_matrix(A)
B_s = scipy.sparse.lil_matrix(B)
# Choose a grid division (i.e. how many processing blocks you want to create)
BlockGrid = numpy.array([2,2])
D_s = A_s[:] # initialise from A
Xmax, Ymax = A_s.shape
BaseBSiz = numpy.array([Xmax, Ymax]) / BlockGrid
for BIndX in range(0, Xmax, BlockGrid[0]):
for BIndY in range(0, Ymax, BlockGrid[1]):
BSizX, BSizY = D_s[ BIndX : BIndX + BaseBSiz[0], BIndY : BIndY + BaseBSiz[1] ].shape
Xgrid, Ygrid = numpy.mgrid[BIndX : BIndX + BSizX, BIndY : BIndY + BSizY]
Xgrid = Xgrid.reshape([BSizX*BSizY,1])[:,0]
Ygrid = Ygrid.reshape([BSizX*BSizY,1])[:,0]
AllInd = numpy.array([Xgrid, Ygrid]).T
NZeroInd = numpy.array(A_s[Xgrid, Ygrid].reshape((BSizX,BSizY)).nonzero()).T + numpy.array([[BIndX],[BIndY]]).T
ZeroInd = numpy.array([x for x in AllInd.tolist() if x not in NZeroInd.tolist()]).T
#
# Replace zero-values in current block
D_s[ZeroInd[0], ZeroInd[1]] = map ( lambda x, y : A_s[x,:].dot(B[:,y])[0], ZeroInd[0], ZeroInd[1] )

Efficiently slice windows from a 1D numpy array, around indices given by second 2D array

I want to extract multiple slices from the same 1D numpy array, where the slice indices are drawn from a random distribution. Basically, I want to achieve the following:
import numpy as np
import numpy.random
# generate some 1D data
data = np.random.randn(500)
# window size (slices are 2*winsize long)
winsize = 60
# number of slices to take from the data
inds_size = (100, 200)
# get random integers that function as indices into the data
inds = np.random.randint(low=winsize, high=len(data)-winsize, size=inds_size)
# now I want to extract slices of data, running from inds[0,0]-60 to inds[0,0]+60
sliced_data = np.zeros( (winsize*2,) + inds_size )
for k in range(inds_size[0]):
for l in range(inds_size[1]):
sliced_data[:,k,l] = data[inds[k,l]-winsize:inds[k,l]+winsize]
# sliced_data.shape is now (120, 100, 200)
The above nested loop works fine, but is very slow. In my real code, I will need to do this thousands of times, for data arrays a lot bigger than these. Is there any way to do this more efficiently?
Note that inds will always be 2D in my case, but after getting the slices I will always be summing over one of these two dimensions, so an approach that only accumulates the sum across the one dimension would be fine.
I found this question and this answer which seem almost the same. However, the question is only about a 1D indexing vector (as opposed to my 2D). Also, the answer lacks a bit of context, as I don't really understand how the suggested as_strided works. Since my problem does not seem uncommon, I thought I'd ask again in the hope of a more explanatory answer rather than just code.

Using as_strided in this way appears to be somewhat faster than Divakar's approach (20 ms vs 35 ms here), although memory usage might be an issue.
data_wins = as_strided(data, shape=(data.size - 2*winsize + 1, 2*winsize), strides=(8, 8))
inds = np.random.randint(low=0, high=data.size - 2*winsize, size=inds_size)
sliced = data_wins[inds]
sliced = sliced.transpose((2, 0, 1)) # to use the same index order as before
Strides are the steps in bytes for the index in each dimension. For example, with an array of shape (x, y, z) and a data type of size d (8 for float64), the strides will ordinarily be (y*z*d, z*d, d), so that the second index steps over whole rows of z items. Setting both values to 8, data_wins[i, j] and data_wins[j, i] will refer to the same memory location.
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>> a = np.arange(10, dtype=np.int8)
>>> as_strided(a, shape=(3, 10 - 2), strides=(1, 1))
array([[0, 1, 2, 3, 4, 5, 6, 7],
[1, 2, 3, 4, 5, 6, 7, 8],
[2, 3, 4, 5, 6, 7, 8, 9]], dtype=int8)

Here's a vectorized approach using broadcasting -
# Get 3D offsetting array and add to inds for all indices
allinds = inds + np.arange(-60,60)[:,None,None]
# Index into data with all indices for desired output
sliced_dataout = data[allinds]
Runtime test -
In [20]: # generate some 1D data
...: data = np.random.randn(500)
...:
...: # window size (slices are 2*winsize long)
...: winsize = 60
...:
...: # number of slices to take from the data
...: inds_size = (100, 200)
...:
...: # get random integers that function as indices into the data
...: inds=np.random.randint(low=winsize,high=len(data)-winsize, size=inds_size)
...:
In [21]: %%timeit
...: sliced_data = np.zeros( (winsize*2,) + inds_size )
...: for k in range(inds_size[0]):
...: for l in range(inds_size[1]):
...: sliced_data[:,k,l] = data[inds[k,l]-winsize:inds[k,l]+winsize]
...:
10 loops, best of 3: 66.9 ms per loop
In [22]: %%timeit
...: allinds = inds + np.arange(-60,60)[:,None,None]
...: sliced_dataout = data[allinds]
...:
10 loops, best of 3: 24.1 ms per loop
Memory consumption : Compromise solution
If memory consumption is an issue, here's a compromise solution with one loop -
sliced_dataout = np.zeros( (winsize*2,) + inds_size )
for k in range(sliced_data.shape[0]):
sliced_dataout[k] = data[inds-winsize+k]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Broadcasting 3d arrays for elementwise multiplication - python

This should work: (A[:, :, None, None] * B[None, :, :]).sum(axis=1) But it will blow up your memory for the intermediate array created by the product. The product has shape (50000, 2000, 10, 10), thus contains 10 billion elements, which is 80 GB for 64 bit floating point values.

Related

Faster alternative to the (V A V^T).diagonal in python [duplicate]

np.dot product between two 3D matrices along specified axis

What is the fastest way to multiply with extremely sparse matrix?

Dot product of two sparse matrices affecting zero values only

Efficiently slice windows from a 1D numpy array, around indices given by second 2D array

Categories

Resources