I built some sparse matrix M in Python using the coo_matrix format. I would like to find an efficient way to compute:
A = M + M.T - D
where D is the restriction of M to its diagonal (M is potentially very large). I can't find a way to efficiently build D while keeping a coo_matrix format. Any ideas?
Could D = scipy.sparse.spdiags(coo_matrix.diagonal(M),0,M.shape[0],M.shape[0]) be a solution?
I have come up with a faster coo diagonal:
msk = M.row==M.col
D1 = sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
sparse.tril uses this method with mask = A.row + k >= A.col (sparse/extract.py)
Some times for a (100,100) M (and M1 = M.tocsr())
In [303]: timeit msk=M.row==M.col; D1=sparse.coo_matrix((M.data[msk],(M.row[msk],M.col[msk])),shape=M.shape)
10000 loops, best of 3: 115 µs per loop
In [305]: timeit D=sparse.diags(M.diagonal(),0)
1000 loops, best of 3: 358 µs per loop
So the coo way of getting the diagional is fast, at least for this small, and very sparse matrix (only 1 time in the diagonal)
If I start with the csr form, the diags is faster. That's because .diagonal works in the csr format:
In [306]: timeit D=sparse.diags(M1.diagonal(),0)
10000 loops, best of 3: 176 µs per loop
But creating D is a small part of the overall calculation. Again, working with M1 is faster. The sum is done in csr format.
In [307]: timeit M+M.T-D
1000 loops, best of 3: 1.35 ms per loop
In [308]: timeit M1+M1.T-D
1000 loops, best of 3: 1.11 ms per loop
Another way to do the whole thing is to take advantage of that fact that coo allows duplicate i,j values, which will be summed when converted to csr format. So you could stack the row, col, data arrays for M with those for M.T (see M.transpose for how those are constructed), along with masked values for D. (or the masked diagonals could be removed from M or M.T)
For example:
def MplusMT(M):
msk=M.row!=M.col;
data=np.concatenate([M.data, M.data[msk]])
rows=np.concatenate([M.row, M.col[msk]])
cols=np.concatenate([M.col, M.row[msk]])
MM=sparse.coo_matrix((data, (rows, cols)), shape=M.shape)
return MM
# alt version with a more explicit D
# msk=M.row==M.col;
# data=np.concatenate([M.data, M.data,-M.data[msk]])
MplusMT as written is very fast because it is just doing array concatenation, not summation. To do that we have to convert it to a csr matrix.
MplusMT(M).tocsr()
which takes considerably longer. Still this approach is, in my limited testing, more than 2x faster than M+M.T-D. So it's a potential tool for constructing complex sparse matrices.
You probably want
from scipy.sparse import diags
D = diags(M.diagonal(), 0, format='coo')
This will still build an M-size 1d array as an intermediate step, but that will probably not be so bad.
Related
I'd like to correlate the columns of an mxn matrix with a 1xm array. This should give me an 1xn array back. At the moment I am doing this a bit clumsy with:
c = np.corrcoef(X, y)[:-1,-1]
I find the correlations I want here in the last column and with the last row/column corresponding to the correlation the array have with it self (so r = 1.0).
This is fine, but however, I need to do this on quite big matrices and that is basically when it becomes too computationally heavy and my computer gives up.
For example the largest matrix I am doing this for has the size:
48x290400 (= X) and 48x1 (=y), where I want to end up with 290400 r-values
This works fine in Matlab, but not in python using np.corrcoef. Anyone got a good solution for this?
Cheers
Daniel
We could use corr2_coeff from this post after transposing the input arrays -
corr2_coeff(a.T,b.T).ravel()
Sample run -
In [160]: a = np.random.rand(3, 5)
In [161]: b = np.random.rand(3, 1)
# Proposed in the question
In [162]: np.corrcoef(a.T, b.T)[:-1,-1]
Out[162]: array([-0.0716, 0.1905, 0.9699, 0.7482, -0.1511])
# Proposed in this post
In [163]: corr2_coeff(a.T,b.T).ravel()
Out[163]: array([-0.0716, 0.1905, 0.9699, 0.7482, -0.1511])
Runtime test -
In [171]: a = np.random.rand(48, 10000)
In [172]: b = np.random.rand(48, 1)
In [173]: %timeit np.corrcoef(a.T, b.T)[:-1,-1]
1 loops, best of 3: 619 ms per loop
In [174]: %timeit corr2_coeff(a.T,b.T).ravel()
1000 loops, best of 3: 1.72 ms per loop
In [176]: 619.0/1.72
Out[176]: 359.8837209302326
Massive 360x speedup there!
Scaling it further -
In [239]: a = np.random.rand(48, 29040)
In [240]: b = np.random.rand(48, 1)
In [241]: %timeit np.corrcoef(a.T, b.T)[:-1,-1]
1 loops, best of 3: 5.19 s per loop
In [242]: %timeit corr2_coeff(a.T,b.T).ravel()
100 loops, best of 3: 8.09 ms per loop
In [244]: 5190.0/8.09
Out[244]: 641.5327564894932
640x+ speedup on this bigger dataset and should scale better as we go towards actual dataset sizes!
I am working with scipy's csc sparse matrix and currently a major bottleneck in the code is a line similar to the following
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
The matrices that I am working with are extremely large, of size typically more than 10**6x10**6 and I don't want to convert them to dense matrix. In fact I have a restriction to always have the matrix in csc format. My attempts show that converting to coo_matrix or lil_matrix also does not pay off.
Here is my rudimentary attempts using csc, csr and coo:
n=1000
sA = csc_matrix(np.random.rand(n,n))
F = np.random.rand(n,1)
multiply_cols = np.unique(np.random.randint(0,int(0.6*n),size=n))
values = np.random.rand(multiply_cols.shape[0])
def foo1(mat,F,values,multiply_cols):
factor = 0.75
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo2(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tocsr()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo3(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tocoo()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
def foo4(mat,F,values,multiply_cols):
factor = 0.75
mat = mat.tolil()
for i in range(multiply_cols.shape[0]):
F = F - factor*values[i]*mat.getcol(multiply_cols[i])
and timing them I get:
In [41]: %timeit foo1(sA,F,values,multiply_cols)
10 loops, best of 3: 133 ms per loop
In [42]: %timeit foo2(sA,F,values,multiply_cols)
1 loop, best of 3: 999 ms per loop
In [43]: %timeit foo3(sA,F,values,multiply_cols)
1 loop, best of 3: 6.38 s per loop
In [44]: %timeit foo4(sA,F,values,multiply_cols)
1 loop, best of 3: 45.1 s per loop
So certainly coo_matrix and lil_matrix are not a good choice here. Does anyone know a faster way of doing this. Is it a good option to retrieve the underlyng indptr, indices and data have a custom cython solution?
I found in
Sparse matrix slicing using list of int
that column (or row) indexing for sparse matrices is essentially a matrix multiplication task - construct a sparse matrix with the right mix of 1s and 0s, and multiply. Also row (and column) sums are done with multiplication.
This function implements that idea. M is a 1 column sparse matrix, with values in the multiply_cols slots:
def wghtsum(sA, values, multiply_cols):
cols = np.zeros_like(multiply_cols)
M=sparse.csc_matrix((values,(multiply_cols,cols)),shape=(sA.shape[1],1))
return (sA*M).A
testing:
In [794]: F1=wghtsum(sA,values,multiply_cols)
In [800]: F2=(sA[:,multiply_cols]*values)[:,None] # Divaker's
In [802]: np.allclose(F1,F2)
Out[802]: True
It has a modest time savings over #Divakar's solution:
In [803]: timeit F2=(sA[:,multiply_cols]*values)[:,None]
100 loops, best of 3: 18.3 ms per loop
In [804]: timeit F1=wghtsum(sA,values,multiply_cols)
100 loops, best of 3: 6.57 ms per loop
=======
sA as created is dense - it's a sparse rendition of a dense random array. sparse.rand can be used to create a sparse random matrix with a defined level of sparsity.
In testing your foo1 I had a problem with getcol:
In [818]: sA.getcol(multiply_cols[0])
...
TypeError: an integer is required
In [819]: sA.getcol(multiply_cols[0].item())
Out[819]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Column format>
In [822]: sA[:,multiply_cols[0]]
Out[822]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Column format>
I suspect that's caused by a scipy version difference.
In [821]: scipy.__version__
Out[821]: '0.17.0'
This issue did go away in 0.18; but I can't find a relevant issue/pullrequest.
Well you could use a vectorized approach that uses matrix-multiplication of sliced out columns from sparse matrix against values, like so -
F -= (mat[:,multiply_cols]*values*factor)[:,None]
Benchmarking
It seems foo1 is the fastest of the lot listed in the question. So, let's time the proposed approach against that one.
Function definitions -
def foo1(mat,F,values,multiply_cols):
factor = 0.75
outF = F.copy()
for i in range(multiply_cols.shape[0]):
outF -= factor*values[i]*mat.getcol(multiply_cols[i])
return outF
def foo_vectorized(mat,F,values,multiply_cols):
factor = 0.75
return F - (mat[:,multiply_cols]*values*factor)[:,None]
Timings and verification on bigger set with sparseness -
In [242]: # Setup inputs
...: n = 3000
...: mat = csc_matrix(np.random.randint(0,3,(n,n))) #Sparseness with 0s
...: F = np.random.rand(n,1)
...: multiply_cols = np.unique(np.random.randint(0,int(0.6*n),size=n))
...: values = np.random.rand(multiply_cols.shape[0])
...:
In [243]: out1 = foo1(mat,F,values,multiply_cols)
In [244]: out2 = foo_vectorized(mat,F,values,multiply_cols)
In [245]: np.allclose(out1, out2)
Out[245]: True
In [246]: %timeit foo1(mat,F,values,multiply_cols)
1 loops, best of 3: 641 ms per loop
In [247]: %timeit foo_vectorized(mat,F,values,multiply_cols)
10 loops, best of 3: 40.3 ms per loop
In [248]: 641/40.3
Out[248]: 15.905707196029779
There we have a 15x+ speedup!
I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:
import numpy as np
import scipy as sp
import scipy.sparse
# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')
# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=100, replace=False)
# 60% of time is spent here
extracted_rows = data[inds]
# 20% of time is spent here
row_sum = extracted_rows.sum(axis=0)
The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.
It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.
It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.
But first some timings. Using data and ind as you show I get
In [23]: datad=data.A # times at 3.76 ms per loop
In [24]: timeit row_sumd=datad[inds].sum(axis=0)
1000 loops, best of 3: 529 µs per loop
In [25]: timeit row_sum=data[inds].sum(axis=0)
1000 loops, best of 3: 890 µs per loop
In [26]: timeit d=datad[inds]
10000 loops, best of 3: 55.9 µs per loop
In [27]: timeit d=data[inds]
1000 loops, best of 3: 617 µs per loop
The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.
The sparse sum is done with a matrix product
def sparse.spmatrix.sum
....
return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self
That suggests that faster way - turn inds into an appropriate array of 1s and multiply.
In [49]: %%timeit
....: b=np.zeros((1,data.shape[0]),'int8')
....: b[:,inds]=1
....: rowmul=b*data
....:
1000 loops, best of 3: 587 µs per loop
That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)
==================
The last time test is missing the np.asmatrix that is present in the sparse sum. But times are similar, and the results are the same
In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
1000 loops, best of 3: 661 µs per loop
In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
1000 loops, best of 3: 605 µs per loop
One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B against 1st of data. Even though b is an array, the task is actually delegated to data and its matrix product - in a not so transparent a way.
In [234]: x1
Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
In [235]: x2
Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
b*data.A is element multiplication and raises an error; np.dot(b,data.A) works but is slower.
Newer numpy/python has a matmul operator. I see the same time pattern:
In [280]: timeit b#dataA # dense product
100 loops, best of 3: 2.64 ms per loop
In [281]: timeit b#data.A # slower due to `.A` conversion
100 loops, best of 3: 6.44 ms per loop
In [282]: timeit b#data # sparse product
1000 loops, best of 3: 571 µs per loop
np.dot may also delegate action to sparse, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A).
Here's a vectorized approach after converting data to a dense array and also getting all those inds in a vectorized manner using argpartition-based method -
# Number of selections as a parameter
n = 100
# Get inds across all iterations in a vectorized manner as a 2D array.
inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]
# Index into data with those 2D array indices. Then, convert to dense NumPy array,
# reshape and sum reduce to get the final output
out = np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)
Runtime test -
1) Function definitions :
def org_app(nrand,n):
out = np.zeros((nrand,data.shape[1]),dtype=int)
for k in range(nrand):
inds = np.random.choice(data.shape[0], size=n, replace=False)
extracted_rows = data[inds]
out[k] = extracted_rows.sum(axis=0)
return out
def vectorized_app(nrand,n):
inds2D = np.random.rand(nrand,data.shape[0]).argpartition(n)[:,:n]
return np.array(data.todense())[inds2D.ravel()].reshape(nrand,n,-1).sum(1)
Timings :
In [205]: # create some data with sparsely distributed ones
...: data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
...: data = sp.sparse.csr_matrix(data, dtype='int8')
...:
...: # generate column-wise sums over random subsets of rows
...: nrand = 1000
...: n = 100
...:
In [206]: %timeit org_app(nrand,n)
1 loops, best of 3: 1.38 s per loop
In [207]: %timeit vectorized_app(nrand,n)
1 loops, best of 3: 826 ms per loop
Let A,B be ((day,observation,dim)) arrays. Each array contains for a given day the same number of observations, an observation being a point with dim dimensions (that is dim floats). For every day, I want to compute the spatial distances between all observations in A and B that day.
For example:
import numpy as np
from scipy.spatial.distance import cdist
A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)
output = []
for day in range(50):
output.append(cdist(A[day],B[day]))
where I use scipy.spatial.distance.cdist.
Is there a faster way to do this? Ideally, I would like to get for output a ((day,observation,observation)) array that contains for every day the pairwise distances between observations in A and B that day, whilst somehow avoid the loop over days.
One way to do it (though it will require a massive amount of memory) is to make clever use of array broadcasting:
output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
Edit
But after some testing, it seems that probably scikit-learn's euclidean_distances is the best option for large arrays. (Note that I've rewritten your loop into a list comprehension.)
This is for 100 data points per day:
# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 8.81 ms per loop
# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )
10 loops, best of 3: 46.9 ms per loop
# scikit-learn's algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop
and this is for 2000 data points per day:
In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop
In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 2.94 s per loop
Edit: I'm an idiot and forgot that python's map is evaluated lazily. My "faster" code wasn't actually doing any of the work! Forcing evaluation removed the performance boost.
I think your time is going to be dominated by the time spent inside the scipy function. I'd use map instead of the loop anyway as I think its a bit neater but I don't think theres any magic way to get a huge performance boost here. Maybe compiling the code with cython or using numba would help a little.
I have a large (n=50000) block diagonal csr_matrix M representing the adjacency matrices of a set of graphs. I have to have multiply M by a dense numpy.array v several times. Hence I use M.dot(v).
Surprisingly, I have discovered that first converting M to numpy.array and then using numpy.dot is much faster.
Any ideas why this it the case?
I don't have enough memory to hold a 50000x50000 dense matrix in memory and multiply it by a 50000 vector. But find here some tests with lower dimensionality.
Setup:
import numpy as np
from scipy.sparse import csr_matrix
def make_csr(n, N):
rows = np.random.choice(N, n)
cols = np.random.choice(N, n)
data = np.ones(n)
return csr_matrix((data, (rows, cols)), shape=(N,N), dtype=np.float32)
The code above generates sparse matrices with n non-zero elements in a NxN matrix.
Matrices:
N = 5000
# Sparse matrices
A = make_csr(10*10, N) # ~100 non-zero
B = make_csr(100*100, N) # ~10000 non-zero
C = make_csr(1000*1000, N) # ~1000000 non-zero
D = make_csr(5000*5000, N) # ~25000000 non-zero
E = csr_matrix(np.random.randn(N,N), dtype=np.float32) # non-sparse
# Numpy dense arrays
An = A.todense()
Bn = B.todense()
Cn = C.todense()
Dn = D.todense()
En = E.todense()
b = np.random.randn(N)
Timings:
>>> %timeit A.dot(b) # 9.63 µs per loop
>>> %timeit An.dot(b) # 41.6 ms per loop
>>> %timeit B.dot(b) # 41.3 µs per loop
>>> %timeit Bn.dot(b) # 41.2 ms per loop
>>> %timeit C.dot(b) # 3.2 ms per loop
>>> %timeit Cn.dot(b) # 41.2 ms per loop
>>> %timeit D.dot(b) # 35.4 ms per loop
>>> %timeit Dn.dot(b) # 43.2 ms per loop
>>> %timeit E.dot(b) # 55.5 ms per loop
>>> %timeit En.dot(b) # 43.4 ms per loop
For highly sparse matrices (A and B) it is more than 1000x times faster.
For not very sparse matrices (C), it still gets 10x speedup.
For almost non-sparse matrix (D will have some 0 due to repetition in the indices, but not many probabilistically speaking), it is still faster, not much, but faster.
For a truly non-sparse matrix (E), the operation is slower, but not much slower.
Conclusion: the speedup you get depends on the sparsity of your matrix, but with N = 5000 sparse matrices are always faster (as long as they have some zero entries).
I can't try it for N = 50000 due to memory issues. You can try the above code and see what is like for you with that N.