Scipy CSR sparse matrix is actually COO? - python

I've been recently dealing with sparse matrices. My aim is to somehow convert an adjacency list for a graph into the CSR format, defined here: http://devblogs.nvidia.com/parallelforall/wp-content/uploads/2014/07/CSR.png.
One possible option I see, is that I simply first construct a NumPy matrix and convert it using scipy.sparse.csr_matrix. The problem is, that the CSR in SciPy is somewhat different to the one discussed in the link. My question is, is this just a discrepancy, and I need to write my own parser, or can SciPy in fact convert into CSR defined in the link.
A bit more about the problem, let's say I have a matrix:
matrix([[1, 1, 0],
[0, 0, 1],
[1, 0, 1]])
CSR format for this consists of two arrays, Column(C) and row(R). And i strive for looks like:
C: [0,1,2,0,2]
R: [0,2,3,5]
SciPy returns the:
(0, 0) 1
(0, 1) 1
(1, 2) 1
(2, 0) 1
(2, 2) 1
where second column is the same as my C, yet this is to my understanding the COO format, not the CSR. (this was done using csr_matrix(adjacency_matrix) function).

There is a difference in what is stored internally and what you see when you simply print the matrix via print(A) (where A is a csr_matrix).
In the documentation the attributes are listed. Among others there are the following three attributes:
data CSR format data array of the matrix
indices CSR format index array of the matrix
indptr CSR format index pointer array of the matrix
You can access (and manipulate) them through A.data, A.indices and A.indptr.
Bottom line: The CSR format in scipy is a "real" CSR format and you do not need to write your own parser (as long as you don't care about the in your case unnecessary data array).
Also note: A matrix in CSR format is always represented by three arrays, not two.

Related

How do you store explicit 0 values in a scipy sparse lil_matrix?

scipy.sparse.lil_matrix objects do not seem to store explicitly-set 0 values. Other sparse matrices, such as the csr_matrix, do.
Consider the following example:
In [1]: from scipy.sparse import lil_matrix
In [2]: import numpy as np
In [3]: x = lil_matrix((5, 5), dtype=np.float32)
In [4]: x[3, 3] = 0
In [5]: x
Out[5]:
<5x5 sparse matrix of type '<class 'numpy.float32'>'
with 0 stored elements in LInked List format>
This is bad because sometimes there will be a 0 distance between elements of a graph (e.g., duplicates of a datapoint). If I pass a lil_matrix to, e.g., scipy.sparse.csgraph.connected_components, it will detect the incorrect number of connected components because the explicit 0 is converted back to "sparsity" and therefore treated as infinite distance.
I cannot use csr_matrix because it is very inefficient to assign elements to it. However, it will store explicitly-set 0 values unlike lil_matrix. Replace lil_matrix with csr_matrix in the above code and the output changes to:
<5x5 sparse matrix of type '<class 'numpy.float32'>'
with 1 stored elements in Compressed Sparse Row format>
Does anyone know how to store explicit 0 values in lil_matrix objects?
Thanks.
lil __setitem__ uses a compiled lil_fancy_set function. It's docs say:
In [320]: sparse._csparsetools.lil_fancy_set?
Docstring:
Set multiple items to a LIL matrix.
Checks for zero elements and deletes them.
Parameters
----------
M, N, rows, data
LIL matrix data
i_idx, j_idx
Indices of elements to insert to the new LIL matrix.
values
Values of items to set.
Type: builtin_function_or_method
csr matrix has eliminate_zeros method:
Signature: M.eliminate_zeros()
Source:
def eliminate_zeros(self):
"""Remove zero entries from the matrix
This is an *in place* operation
"""
M, N = self._swap(self.shape)
_sparsetools.csr_eliminate_zeros(M, N, self.indptr, self.indices,
self.data)
self.prune() # nnz may have changed
File: /usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py
Type: method
It also has sum_duplicates method. This is used when converting coo format to csr, and facilitates creating a matrix from overlapping submatrices.

Overwrite instead of add for duplicate triplets when creating sparse matrix in scipy

In scipy, to create a sparse matrix from triple format data (row, col and data arrays), the default behavior is to sum the data values for all duplicates. Can I change this behavior to overwrite (or do nothing) instead?
For example:
import scipy.sparse as sparse
rows = [0, 0]
cols = [0, 0]
data = [1, 1]
S = sparse.coo_matrix((data, (rows, cols)))
Here, S.todense() is equal to matrix([[2]]) but I would wish it to be matrix([[1]]).
In the documentation of sparse.coo_matrix, it reads
By default when converting to CSR or CSC format, duplicate (i,j)
entries will be summed together. This facilitates efficient
construction of finite element matrices and the like.
It appears from that formulation that there might be other options than the default.
I've seen discussion on the scipy github about giving more control over this summing, but I don't know of any production changes. As the docs indicate, there's a long standing tradition over summing the duplicates.
As created, the coo matrix does not sum; it just assigns your parameters to its attributes:
In [697]: S = sparse.coo_matrix((data, (rows, cols)))
In [698]: S.data
Out[698]: array([1, 1])
In [699]: S.row
Out[699]: array([0, 0], dtype=int32)
In [700]: S.col
Out[700]: array([0, 0], dtype=int32)
Converting to dense (or to csr/csc) does sum - but doesn't change S itself:
In [701]: S.A
Out[701]: array([[2]])
In [702]: S.data
Out[702]: array([1, 1])
You can performing the summing inplace with:
In [703]: S.sum_duplicates()
In [704]: S.data
Out[704]: array([2], dtype=int32)
I don't know of a way of either removing the duplicates or bypassing that action. I may look up the relevant issue.
=================
S.todok() does an inplace sum (that is, changes S). Looking at that code I see that it calls self.sum_duplicates. The following replicates that without the sum:
In [727]: dok=sparse.dok_matrix((S.shape),dtype=S.dtype)
In [728]: dok.update(zip(zip(S.row,S.col),S.data))
In [729]: dok
Out[729]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Dictionary Of Keys format>
In [730]: print(dok)
(0, 0) 1
In [731]: S
Out[731]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in COOrdinate format>
In [732]: dok.A
Out[732]: array([[1]])
It's a dictionary update, so the final value is the last of the duplicates. I found elsewhere that dok.update is a pretty fast way of adding values to a sparse matrix.
tocsr inherently does the sum; tolil uses tocsr; so this todok approach may be simplest.
If you want only values of 1:
S.sum_duplicates()
S.data[:]=1

scipy sparse matrix: remove the rows whose all elements are zero

I have a sparse matrix which is transformed from sklearn tfidfVectorier. I believe that some rows are all-zero rows. I want to remove them. However, as far as I know, the existing built-in functions, e.g. nonzero() and eliminate_zero(), focus on zero entries, rather than rows.
Is there any easy way to remove all-zero rows from a sparse matrix?
Example:
What I have now (actually in sparse format):
[ [0, 0, 0]
[1, 0, 2]
[0, 0, 1] ]
What I want to get:
[ [1, 0, 2]
[0, 0, 1] ]
Slicing + getnnz() does the trick:
M = M[M.getnnz(1)>0]
Works directly on csr_array.
You can also remove all 0 columns without changing formats:
M = M[:,M.getnnz(0)>0]
However if you want to remove both you need
M = M[M.getnnz(1)>0][:,M.getnnz(0)>0] #GOOD
I am not sure why but
M = M[M.getnnz(1)>0, M.getnnz(0)>0] #BAD
does not work.
There aren't existing functions for this, but it's not too bad to write your own:
def remove_zero_rows(M):
M = scipy.sparse.csr_matrix(M)
First, convert the matrix to CSR (compressed sparse row) format. This is important because CSR matrices store their data as a triple of (data, indices, indptr), where data holds the nonzero values, indices stores column indices, and indptr holds row index information. The docs explain better:
the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]].
So, to find rows without any nonzero values, we can just look at successive values of M.indptr. Continuing our function from above:
num_nonzeros = np.diff(M.indptr)
return M[num_nonzeros != 0]
The second benefit of CSR format here is that it's relatively cheap to slice rows, which simplifies the creation of the resulting matrix.
Thanks for your reply, #perimosocordiae
I just find another solution by myself. I am posting here in case someone may need it in the future.
def remove_zero_rows(X)
# X is a scipy sparse matrix. We want to remove all zero rows from it
nonzero_row_indice, _ = X.nonzero()
unique_nonzero_indice = numpy.unique(nonzero_row_indice)
return X[unique_nonzero_indice]

Matlab cell2mat function in Python Numpy?

Does numpy have the cell2mat function? Here is the link to matlab. I found an implementation of something similar but it only works when we can split it evenly. Here is the link.
In a sense Python has had 'cells' at lot longer than MATLAB - list. a python list is a direct substitute for a 1d cell (or rather, cell with size 1 dimension). A 2d cell could be represented as a nested list. numpy arrays with dtype object also work. I believe that is what scipy.io.loadmat uses to render cells in .mat files.
np.array() converts a list, or lists of lists, etc, to a ndarray. Sometimes it needs help specifying the dtype. It also tries to render the input to as high a dimensional array as possible.
np.array([1,2,3])
np.array(['1',2,'abc'],dtype=object)
np.array([[1,2,3],[1,2],[3]])
np.array([[1,2],[3,4]])
And MATLAB structures map onto Python dictionaries or objects.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
loadmat can also represent structures as numpy structured (record) arrays.
There is np.concatenate that takes a list of arrays, and its convenience derivatives vstack, hstack, dstack. Mostly they tweak the dimensions of the arrays, and then concatenate on one axis.
Here's a rough approximation to the MATLAB cell2mat example:
C = {[1], [2 3 4];
[5; 9], [6 7 8; 10 11 12]}
construct ndarrays with same shapes
In [61]: c11=np.array([[1]])
In [62]: c12=np.array([[2,3,4]])
In [63]: c21=np.array([[5],[9]])
In [64]: c22=np.array([[6,7,8],[10,11,12]])
Join them with a combination of hstack and vstack - i.e. concatenate along the matching axes.
In [65]: A=np.vstack([np.hstack([c11,c12]),np.hstack([c21,c22])])
# or A=np.hstack([np.vstack([c11,c21]),np.vstack([c12,c22])])
producing:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Or more generally (and compactly)
In [75]: C=[[c11,c12],[c21,c22]]
In [76]: np.vstack([np.hstack(c) for c in C])
I usually use object arrays as a replacement for Matlab's cell arrays. For example:
cell_array = np.array([[np.arange(10)],
[np.arange(30,40)] ],
dtype='object')
Is a 2x1 object array containing length 10 numpy array vectors. I can perform the cell2mat functionality by:
arr = np.concatenate(cell_array).astype('int')
This returns a 2x10 int array. You can change .astype('int') to be whatever data type you need, or you could grab it from one of the objects in your cell_array,
arr = np.concatenate(cell_array).astype(cell_array[0].dtype)
Good luck!

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.

Categories

Resources