Constructing Sparse CSR Matrix Directly vs Using COO tocsr() - Scipy - python

My goal here is to build the sparse CSR matrix very fast. It is currently the main bottleneck in my process, and I've already optimized it by constructing the coo matrix relatively fast, and then using tocsr().
However, I would imagine that constructing the csr matrix directly must be faster?
I have a very specific format of a sparse matrix that is also large (i.e. on orders of 100000x50000). I've looked online at these other answers, but most are not addressing the question I have.
Efficiently construct FEM/FVM matrix
Looks at constructing a very specific formatted sparse matrix vs using coo, which led to a scipy merge improvement on the speed of tocsr().
Sparse Matrix Structure:
The sparse matrix, H, is comprised of W lists of size N, or built from an initial array of size NxW, lets call it A. Along the diagonal are repeating lists of size N for N times. So for the first N rows of H, is A[:,0] repeated, but sliding along N steps for each row.
Comparison to COO.tocsr()
When I scale up N and W, and build up the COO matrix first, then running tocsr(), it is actually faster then just building the CSR matrix directly. I'm not sure why this would be the case? I am wondering if perhaps I can take advantage of the structure of my sparse matrix H in some way? Since there are many repeating elements in there.
Code Sample
Here is a code sample to visualize what is going on for a small sample size:
from scipy.sparse import linalg, dok_matrix, coo_matrix, csr_matrix
import numpy as np
import matplotlib.pyplot as plt
def test_csr(testdata):
indices = [x for _ in range(W-1) for x in range(N**2)]
ptrs = [N*(i) for i in range(N*(W-1))]
ptrs.append(len(indices))
data = []
# loop along the first axis
for i in range(W-1):
vec = testdata[:,i].squeeze()
# repeat vector N times
for i in range(N):
data.extend(vec)
Hshape = ((N*(W-1), N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
return H
N = 4
W = 8
testdata = np.random.randn(N,W)
print(testdata.shape)
H = test_csr(testdata)
plt.imshow(H.toarray(), cmap='jet')
plt.show()

It looks like your output has only the first W-1 rows of test data. I'm not sure if this is intentional or not. My solutions assumes you want to use all of testdata.
When you construct the COO matrix are you also constructing the indices and data in a similar way?
One thing which might speed up constructing the csr_matrix is to use built in numpy functions to generate the data for the csr_matrix rather than python loops and arrays. I would expect this to improve the speed of generating the indices significantly. You can adjust the dtype to different type of int depending on the size of your matrix.
N = 4
W = 8
testdata = np.random.randn(N,W)
ptrs = N*np.arange(W*N+1,dtype='int')
indices = np.tile(np.arange(N*N,dtype='int'),W)
data = np.tile(testdata,N).flatten()
Hshape = ((N*W, N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
Another possibility is to first construct the large array, and then define each of the N vertical column blocks at once. This means that you don't need to make a ton of copies of the original data before you put it into the sparse matrix. However, it may be slow to convert the matrix type.
N = 4
W = 8
testdata = np.random.randn(N,W)
Hshape = ((N*W, N**2))
H = sp.sparse.lol_matrix(Hshape)
for j in range(N):
H[N*np.arange(W)+j,N*j:N*(j+1)] = testdata.T
H = H.tocsc()

Related

Memory issues with N large matrices

I have to create N large matrices, of size M x M, with M = 100'000, on a cluster. I can create them one by one. Usually I would first define a tensor
mat_all = torch.zeros((N,M,M))
And then I would fill mat_all as follows:
for i in range(N):
tmp = create_matrix(M,M)
mat_all[i,:,:] = tmp
where the function create_matrix creates a square matrix of size M.
My problem is: if I do that, I have memory issue in creating the big tensor mat_all with torch.ones . I do not have these issues when I create the matrices one by one with create_matrix.
I was wondering if there is a way to have a tensor as mat_all which deals with N matrices MxM but in such a way that I do not have memory issues.

make a matrix multiplication without loop when the matrix is stored with vectors

I'm trying to make a matrix/vector multiplication, but my matrix is stored in a way # operator cannot be used.
My matrix Z is actually a list on size N containing the columns of the matrix which are all PETSc4py.Vec of size NN, where NN≫N (eg. NN=10000 and N=10). As N is small, I can make a for loop on it, so for instance, if I want to compute r =Z.T # u with u a vector of size NN, I do
r = np.zeros(N)
for i,z in enumerate(Z):
r[i] = u * z # scalar product
Now I have a vector u of size N and I want to make the multiplication w = Z # u, I can't apply the same method because it would involve a loop of size NN which I'm trying to avoid.
I could convert my "matrix" Z to a NumPy matrix, but I'm also trying to avoid it...
I represented on the figure 1 the way the matrix is stored. A red line represents a vector that should be read for a the matrix-vector multiplication.
Is there a mathematical way (or a magic trick !) to compute this operation without making the big loop ?
Thanks

Fastest way generate and sum arrays

I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)

Optimize Scipy Sparse Matrix

I have a sparse matrix where I'm currently enumerating over each row and performing some calculations based on the information from each row. Each row is completely independent of the others. However, for large matrices, this code is extremely slow (takes about 2 hours) and I can't convert the matrix to a dense one either (limited to 8GB RAM).
import scipy.sparse
import numpy as np
def process_row(a, b):
"""
a - contains the row indices for a sparse matrix
b - contains the column indices for a sparse matrix
Returns a new vector of length(a)
"""
return
def assess(mat):
"""
"""
mat_csr = mat.tocsr()
nrows, ncols = mat_csr.shape
a = np.arange(ncols, dtype=np.int32)
b = np.empty(ncols, dtype=np.int32)
result = []
for i, row in enumerate(mat_csr):
# Process one row at a time
b.fill(i)
result.append(process_row(b, a))
return result
if __name__ == '__main__':
row = np.array([8,2,7,4])
col = np.array([1,3,2,1])
data = np.array([1,1,1,1])
mat = scipy.sparse.coo_matrix((data, (row, col)))
print assess(mat)
I am looking to see if there's any way to design this better so that it performs much faster. Essentially, the process_row function takes (row, col) index pairs (from a, b) and does some math using another sparse matrix and returns a result. I don't have the option to change this function but it can actually process different row/col pairs and is not restricted to processing everything from the same row.
Your problem looks similar to this other recent SO question:
Calculate the euclidean distance in scipy csr matrix
In my answer I sketched a way of iterating over the rows of a sparse matrix. I think it is faster to convert the array to lil, and construct the dense rows directly from its sublists. This avoids the overhead of creating a new sparse matrix for each row. But I haven't done time tests.
https://stackoverflow.com/a/36559702/901925
Maybe this applies to your case.

Reverse sort and argsort in python

I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?
I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).
The double sort can be skipped by doing
v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]
Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is
v = (tfidf * tfidf[idx, :]).transpose()

Categories

Resources