Median for sparse matrix in numpy - python

What is the best way to obtain the median (along the row and column) of a sparse.csr_matrix matrix in python?
PS: The webpage doesnt have any function of median

If you are after the median of the column entries of a sparse matrix, sklearn has an implementation for CSC matrices
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs.py#L441
As mentioned in the comments, median of nnz elements makes more sense here, since for a sufficient sparse matrix row/column median is zero.

Related

The order of columns in a sparse matrix multiplication with a vector

I wish to multiply a huge sparse matrix A with a binary vector y, i.e., A.dot(y). However, in order to calculate y I need to know the order of the columns in A -- to make sure columns correspond both in A and y before the multiplication. I couldn't find a way to figure out what would be "the order of columns" in the dot() operation, so I need to verify y is ordered the same as A. How can I do that efficiently?
A is a CSR matrix in the format csr_matrix(data, indices, indptr). Working with a dense A is not possible. I have a solution with looping, but I want to avoid it if possible since A has 11M rows and 8M columns.

Given a sparse matrix with shape (num_samples, num_features), how do I estimate the co-occurrence matrix?

The sparse matrix has only 0 and 1 at each entry (i,j) (1 stands for sample i has feature j). How can I estimate the co-occurrence matrix for each feature given this sparse representation of data points? Especially, I want to find pairs of features that co-occur in at least 50 samples. I realize it might be hard to produce the exact result, is there any approximated algorithm in data mining that allows me to do that?
This can be solved reasonably easily if you go to a transposed matrix.
Of any two features (now rows, originally columns) you compute the intersection. If it's larger than 50, you have a frequent cooccurrence.
If you use an appropriate sparse encoding (now of rows, but originally of columns - so you probably need not only to transpose the matrix, but also to reencode it) this operation using O(n+m), where n and m are the number of nonzero values.
If you have an extremely high number of features this make take a while. But 100000 should be feasible.

Linear dependent rows: Huge Sparse Matrix

I have a huge sparse matrix A
<5000x5000 sparse matrix of type '<type 'numpy.float64'>'
with 14979 stored elements in Compressed Sparse Column format>
for whom I need to delete linearly dependent rows. I have a prior that j rows will be dependent. I need to
find out which sets of rows are linearly dependent
for each set, keep one arbitrary row and remove the others
I was trying to follow this question, but the corresponding method for sparse matrices, scipy.sparse.linalg.eigs says that
k: The number of eigenvalues and eigenvectors desired. k must be smaller than N. It is not possible to compute all eigenvectors of a
matrix.
How should I proceed?
scipy.sparse.linalg.eigs uses implicitly restarted Arnoldi iteration. The algorithm is meant for finding a few eigenvectors quickly, and can't find all of them.
5000x5000, however, is not that large. Have you considered just using numpy.linalg.eig or scipy.linalg.eig? It will probably take a few minutes, but it isn't completely infeasible. You don't gain anything by using a sparse matrix, but I'm not sure there's an algorithm for efficiently finding all eigenvectors of a sparse matrix.

How do you get the mean and std of a column in a csr_matrix?

I have a sparse 988x1 vector (a column in a csr_matrix) created through scipy.sparse. Is there a way to gets its mean and standard deviation without having to convert the sparse matrix to a dense one?
numpy.mean seems to only work for dense vectors.
Since you are performing column slicing, it may be better to store the matrix using CSC rather than CSR. But that would depend on what else you are doing with the matrix.
To calculate the mean of a column in a CSC matrix you can use the mean() function of the matrix.
To calculate the standard deviation efficiently is going to involve just a bit more effort. First of all, suppose you get your sparse column like this:
col = A.getcol(colindex)
Then calculate the variance like so:
N = col.shape[0]
sqr = col.copy() # take a copy of the col
sqr.data **= 2 # square the data, i.e. just the non-zero data
variance = sqr.sum()/N - col.mean()**2

proximity matrix in python

What is the best way to compute the distance/proximity matrix for very large sparse vectors?
For example you are given the following design matrix, where each row is 68771 dimensional sparse vector.
designMatrix
<5830x68771 sparse matrix of type ''
with 1229041 stored elements in Compressed Sparse Row format>
Have you tried the routines in scipy.spatial.distance?
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
If this forces you to go to a dense representation, then you may be better off rolling your own, depending on the density of nonzero elements. You could squeeze out the zeros while retaining a map between the new and original indices, calculate the pairwise distances on the remaining nonzero elements and then use the indexing to map things back.

Categories

Resources