Linear dependent rows: Huge Sparse Matrix

Linear dependent rows: Huge Sparse Matrix - python

I have a huge sparse matrix A
<5000x5000 sparse matrix of type '<type 'numpy.float64'>'
with 14979 stored elements in Compressed Sparse Column format>
for whom I need to delete linearly dependent rows. I have a prior that j rows will be dependent. I need to
find out which sets of rows are linearly dependent
for each set, keep one arbitrary row and remove the others
I was trying to follow this question, but the corresponding method for sparse matrices, scipy.sparse.linalg.eigs says that
k: The number of eigenvalues and eigenvectors desired. k must be smaller than N. It is not possible to compute all eigenvectors of a
matrix.
How should I proceed?

scipy.sparse.linalg.eigs uses implicitly restarted Arnoldi iteration. The algorithm is meant for finding a few eigenvectors quickly, and can't find all of them.
5000x5000, however, is not that large. Have you considered just using numpy.linalg.eig or scipy.linalg.eig? It will probably take a few minutes, but it isn't completely infeasible. You don't gain anything by using a sparse matrix, but I'm not sure there's an algorithm for efficiently finding all eigenvectors of a sparse matrix.

Related

Given a sparse matrix with shape (num_samples, num_features), how do I estimate the co-occurrence matrix?

The sparse matrix has only 0 and 1 at each entry (i,j) (1 stands for sample i has feature j). How can I estimate the co-occurrence matrix for each feature given this sparse representation of data points? Especially, I want to find pairs of features that co-occur in at least 50 samples. I realize it might be hard to produce the exact result, is there any approximated algorithm in data mining that allows me to do that?

This can be solved reasonably easily if you go to a transposed matrix.
Of any two features (now rows, originally columns) you compute the intersection. If it's larger than 50, you have a frequent cooccurrence.
If you use an appropriate sparse encoding (now of rows, but originally of columns - so you probably need not only to transpose the matrix, but also to reencode it) this operation using O(n+m), where n and m are the number of nonzero values.
If you have an extremely high number of features this make take a while. But 100000 should be feasible.

Multiplying a sparse matrix with sparse vector (efficient way)

I have a sparse matrix M of size N*N with Nd non-zero elements and a sparse vector A of size N*1 with Na non-zero elements. (N is large)
I want to calculate the matrix multiplication B=MA.
I use the sparse matrix representation in scipy.sparse. P=csr_matrix(M). Then I do B=P.dot(A).
I know the complexity of this operation is O(Nd). It seems that A is regarded as a dense vector in the calculation. Because when I change the number Na, the computation time of this multiplication does not change. But the vector A is also sparse. Is there any efficient ways to perform this multiplication with less computation time.
In my simulation, M is fix. The vectors A are differents but they are all sparse.
Thank you very much.

How to define a (n, 0) sparse matrix in scipy or how to assemble a sparse matrix column wise?

I have a loop that in each iteration gives me a column c of a sparse matrix N.
To assemble/grow/accumulate N column by column I thought of using
N = scipy.sparse.hstack([N, c])
To do this it would be nice to initialize the matrix with with rows of length 0. However,
N = scipy.sparse.csc_matrix((4,0))
raises a ValueError: invalid shape.
Any suggestions, how to do this right?

You can't. Sparse matrices are restricted compared to NumPy arrays and in particular don't allow 0 for any axis. All sparse matrix constructors check for this, so if and when you do manage to build such a matrix, you're exploiting a SciPy bug and your script is likely to break when you upgrade SciPy.
That being said, I don't see why you'd need an n × 0 sparse matrix since an n × 0 NumPy array is allowed and takes practically no storage space.
Turns out sparse.hstack cannot handle a NumPy array with a zero axis, so disregard my previous comment. However, what I think you should do is collect all the columns in a list, then hstack them in one call. That's better than your loop since append'ing to a list takes amortized constant time, while hstack takes linear time. So your proposed algorithm takes quadratic time while it could be linear.

You must use at least 1 in your shape.
N = scipy.sparse.csc_matrix((4,1))
Which you can stack:
print scipy.sparse.hstack( (N,N) )
#<4x2 sparse matrix of type '<type 'numpy.float64'>'
# with 0 stored elements in COOrdinate format>

Python Sparse matrix inverse and laplacian calculation

I have two sparse matrix A (affinity matrix) and D (Diagonal matrix) with dimension 100000*100000. I have to compute the Laplacian matrix L = D^(-1/2)*A*D^(-1/2). I am using scipy CSR format for sparse matrix.
I didnt find any method to find inverse of sparse matrix. How to find L and inverse of sparse matrix? Also suggest that is it efficient to do so by using python or shall i call matlab function for calculating L?

In general the inverse of a sparse matrix is not sparse which is why you won't find sparse matrix inverters in linear algebra libraries. Since D is diagonal, D^(-1/2) is trivial and the Laplacian matrix calculation is thus trivial to write down. L has the same sparsity pattern as A but each value A_{ij} is multiplied by (D_i*D_j)^{-1/2}.
Regarding the issue of the inverse, the standard approach is always to avoid calculating the inverse itself. Instead of calculating L^-1, repeatedly solve Lx=b for the unknown x. All good matrix solvers will allow you to decompose L which is expensive and then back-substitute (which is cheap) repeatedly for each value of b.

proximity matrix in python

What is the best way to compute the distance/proximity matrix for very large sparse vectors?
For example you are given the following design matrix, where each row is 68771 dimensional sparse vector.
designMatrix
<5830x68771 sparse matrix of type ''
with 1229041 stored elements in Compressed Sparse Row format>

Have you tried the routines in scipy.spatial.distance?
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
If this forces you to go to a dense representation, then you may be better off rolling your own, depending on the density of nonzero elements. You could squeeze out the zeros while retaining a map between the new and original indices, calculate the pairwise distances on the remaining nonzero elements and then use the indexing to map things back.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Linear dependent rows: Huge Sparse Matrix - python

Related

Given a sparse matrix with shape (num_samples, num_features), how do I estimate the co-occurrence matrix?

Multiplying a sparse matrix with sparse vector (efficient way)

How to define a (n, 0) sparse matrix in scipy or how to assemble a sparse matrix column wise?

Python Sparse matrix inverse and laplacian calculation

proximity matrix in python

Categories

Resources