What is the best way to compute the distance/proximity matrix for very large sparse vectors?
For example you are given the following design matrix, where each row is 68771 dimensional sparse vector.
designMatrix
<5830x68771 sparse matrix of type ''
with 1229041 stored elements in Compressed Sparse Row format>
Have you tried the routines in scipy.spatial.distance?
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
If this forces you to go to a dense representation, then you may be better off rolling your own, depending on the density of nonzero elements. You could squeeze out the zeros while retaining a map between the new and original indices, calculate the pairwise distances on the remaining nonzero elements and then use the indexing to map things back.
Related
I wish to multiply a huge sparse matrix A with a binary vector y, i.e., A.dot(y). However, in order to calculate y I need to know the order of the columns in A -- to make sure columns correspond both in A and y before the multiplication. I couldn't find a way to figure out what would be "the order of columns" in the dot() operation, so I need to verify y is ordered the same as A. How can I do that efficiently?
A is a CSR matrix in the format csr_matrix(data, indices, indptr). Working with a dense A is not possible. I have a solution with looping, but I want to avoid it if possible since A has 11M rows and 8M columns.
What is the best way to obtain the median (along the row and column) of a sparse.csr_matrix matrix in python?
PS: The webpage doesnt have any function of median
If you are after the median of the column entries of a sparse matrix, sklearn has an implementation for CSC matrices
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs.py#L441
As mentioned in the comments, median of nnz elements makes more sense here, since for a sufficient sparse matrix row/column median is zero.
I have a sparse matrix M of size N*N with Nd non-zero elements and a sparse vector A of size N*1 with Na non-zero elements. (N is large)
I want to calculate the matrix multiplication B=MA.
I use the sparse matrix representation in scipy.sparse. P=csr_matrix(M). Then I do B=P.dot(A).
I know the complexity of this operation is O(Nd). It seems that A is regarded as a dense vector in the calculation. Because when I change the number Na, the computation time of this multiplication does not change. But the vector A is also sparse. Is there any efficient ways to perform this multiplication with less computation time.
In my simulation, M is fix. The vectors A are differents but they are all sparse.
Thank you very much.
I have a huge sparse matrix A
<5000x5000 sparse matrix of type '<type 'numpy.float64'>'
with 14979 stored elements in Compressed Sparse Column format>
for whom I need to delete linearly dependent rows. I have a prior that j rows will be dependent. I need to
find out which sets of rows are linearly dependent
for each set, keep one arbitrary row and remove the others
I was trying to follow this question, but the corresponding method for sparse matrices, scipy.sparse.linalg.eigs says that
k: The number of eigenvalues and eigenvectors desired. k must be smaller than N. It is not possible to compute all eigenvectors of a
matrix.
How should I proceed?
scipy.sparse.linalg.eigs uses implicitly restarted Arnoldi iteration. The algorithm is meant for finding a few eigenvectors quickly, and can't find all of them.
5000x5000, however, is not that large. Have you considered just using numpy.linalg.eig or scipy.linalg.eig? It will probably take a few minutes, but it isn't completely infeasible. You don't gain anything by using a sparse matrix, but I'm not sure there's an algorithm for efficiently finding all eigenvectors of a sparse matrix.
I have a numpy ndarray object with the following shape:
(3, 256, 170, 256).
So, basically this represents an array of 3-dimensional vectors. The dimension of the vector is the first element as it enables one to write something like: array[0] for the relevant vector component.
Now, I am trying to use scipy pdist function, which computes the distance between the entries. So, I need to modify this array, so that it can be represented as a two dimensional matrix, where the number of rows is 256*170*256 and the number of columns is 3 and pdist should return me the matrix where each element is the squared distance between the corresponding 3 dimensional vectors (if I have interpreted the documentation correctly).
Can someone tell me how I can get a view into this numpy array, so that I can generate this matrix. I do not want to copy the data again (as these matrices can be quite large), so looking for some efficient solutions.