Understand the csr format - python

I am trying to undersand how scipy CSR works.
https://docs.scipy.org/doc/scipy/reference/sparse.html
For example, of the following matrix on https://en.wikipedia.org/wiki/Sparse_matrix
( 0 0 0 0 )
( 5 8 0 0 )
( 0 0 3 0 )
( 0 6 0 0 )
it says the CSR representation is the following.
Must V list one row after another with non-zero elements in a row list from left to right?
I can understand COL_INDEX is the column index (column 1 is indexed as 0) corresponding to elements in V.
I don't understand ROW_INDEX. Could anybody show me how the ROW_INDEX was created from the original matrix? Thanks.
V = [ 5 8 3 6 ]
COL_INDEX = [ 0 1 2 1 ]
ROW_INDEX = [ 0 0 2 3 4 ]

From the scipy manual:
csr_matrix((data, indices, indptr), [shape=(M, N)]) is the standard
CSR representation where the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not
supplied, the matrix dimensions are inferred from the index arrays.
indptr is the same as ROW_INDEX and indicies is the same as COL_INDEX.
Here is an example of a naive way to create the indices and value array. Essentially ROW_INDICES[i + 1] is the total number of non-zero entires from row 0 to i inclusive with the last entry being the total number of non-zero entries.
ROW_INDICES = [0]
COL_INDICES = []
VALS = []
for i in range(num_rows):
ROW_INDICES.append(ROW_INDICES[i])
for j in range(num_cols):
if m[i, j] > 0:
ROW_INDICES[i + 1] += 1
COL_INDICES.append(j)
VALS.append(m[i, j])

coo format
I think it's best to start with the coo definition. It's easier to understand, and widely used:
In [90]: A = np.array([[0,0,0,0],[5,8,0,0],[0,0,3,0],[0,6,0,0]])
In [91]: M = sparse.coo_matrix(A)
The values are stored in 3 attributes:
In [92]: M.row
Out[92]: array([1, 1, 2, 3], dtype=int32)
In [93]: M.col
Out[93]: array([0, 1, 2, 1], dtype=int32)
In [94]: M.data
Out[94]: array([5, 8, 3, 6])
We can make a new matrix from those 3 arrays:
In [95]: sparse.coo_matrix((_94, (_92, _93))).A
Out[95]:
array([[0, 0, 0],
[5, 8, 0],
[0, 0, 3],
[0, 6, 0]])
oops, I need to add a shape, since one column is all 0s:
In [96]: sparse.coo_matrix((_94, (_92, _93)), shape=(4,4)).A
Out[96]:
array([[0, 0, 0, 0],
[5, 8, 0, 0],
[0, 0, 3, 0],
[0, 6, 0, 0]])
Another way to display this matrix:
In [97]: print(M)
(1, 0) 5
(1, 1) 8
(2, 2) 3
(3, 1) 6
np.where(A) gives the same non-zero coordinates.
In [108]: np.where(A)
Out[108]: (array([1, 1, 2, 3]), array([0, 1, 2, 1]))
conversion to csr
Once we have coo, we can easily convert it to csr. In fact sparse often does that for us:
In [98]: Mr = M.tocsr()
In [99]: Mr.data
Out[99]: array([5, 8, 3, 6], dtype=int64)
In [100]: Mr.indices
Out[100]: array([0, 1, 2, 1], dtype=int32)
In [101]: Mr.indptr
Out[101]: array([0, 0, 2, 3, 4], dtype=int32)
Sparse does several things - it sorts the indices, sums duplicates, and replaces the row with a indptr array. Here it is actually longer than the original, but in general it will be shorter, since it has just one value per row (plus 1). But perhaps more important, most of the fast calculation routines, especially matrix multiplication, have been written using the csr format.
I've used this package a lot. MATLAB as well, where the default definition is in the coo style, but the internal storage is csc (but not as exposed to users as in scipy). But I've never tried to derive indptr from scratch. I could, but I don't need to.
csr_matrix accepts inputs in the coo format, but also in the indptr etc format. I wouldn't recommend it, unless you already have those inputs calculated (say from another matrix). It's more error prone, and probably not much faster.
Iteration with indptr
However sometimes it is useful to iterate on intptr, and perform calculations directly on the data. Often this is faster than working with the provided methods.
For example we can list the nonzero values by row:
In [104]: for i in range(Mr.shape[0]):
...: pt = slice(Mr.indptr[i], Mr.indptr[i+1])
...: print(i, Mr.indices[pt], Mr.data[pt])
...:
0 [] []
1 [0 1] [5 8]
2 [2] [3]
3 [1] [6]
Keeping the initial 0 makes this iteration easier. When the matrix is (10000,90000) there's not much incentive to reduces the size of indptr by 1.
lil format
The lil format stores the matrix in a similar manner:
In [105]: Ml = M.tolil()
In [106]: Ml.data
Out[106]: array([list([]), list([5, 8]), list([3]), list([6])], dtype=object)
In [107]: Ml.rows
Out[107]: array([list([]), list([0, 1]), list([2]), list([1])], dtype=object)
In [110]: for i,(r,d) in enumerate(zip(Ml.rows, Ml.data)):
...: print(i, r, d)
...:
0 [] []
1 [0, 1] [5, 8]
2 [2] [3]
3 [1] [6]
Because of how rows are stored, lil actually allows us to fetch a view:
In [167]: Ml.getrowview(2)
Out[167]:
<1x4 sparse matrix of type '<class 'numpy.longlong'>'
with 1 stored elements in List of Lists format>
In [168]: for i in range(Ml.shape[0]):
...: print(Ml.getrowview(i))
...:
(0, 0) 5
(0, 1) 8
(0, 2) 3
(0, 1) 6

Related

Why isn't eliminate_zeros() removing the zero entries?

Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.

Create an sparse matrix from a list of tuples having the indexes of the column where is a 1

Problem:
I have a list of tuples, which each tuple represents a column of a 2D-array and each element of the tuple represents the index of that column of the array that is a 1; the other entries that aren't in that tuple, are 0.
I want to create an sparse matrix with this list of tuples in an efficient way (trying to not use for loops).
Example:
# init values
list_tuples = [
(0, 2, 4),
(0, 2, 3),
(1, 3, 4)
]
n = length(list_tuples) + 1
m = 5 # arbritrary, however n >= max([ei for ei in list_tuples]) + 1
# what I need is a function which accepts this tuples and give the shape of the array
# (at least the row size, because the column size can be infered from the list of tuples)
A = some_function(list_tuples, array_shape = (m, n))
Then what I expect to have is an array of the form:
[
[1, 1, 0]
[0, 0, 1]
[1, 1, 0]
[0, 1, 1]
[1, 0, 1]
]
Your values are the indices that are required for the compressed sparse column format. You'll also need the indptr array, which for your data is the cumulative sum of the lengths of the tuples (prepended with 0). The data array would be an array of ones with the same length as the sum of the lengths of the tuples, which you can get from the last element of the cumulative sum. Here's how that looks with your example:
In [45]: from scipy.sparse import csc_matrix
In [46]: list_tuples = [
...: (0, 2, 4),
...: (0, 2, 3),
...: (1, 3, 4)
...: ]
In [47]: indices = sum(list_tuples, ()) # Flatten the tuples into one sequence.
In [48]: indptr = np.cumsum([0] + [len(t) for t in list_tuples])
In [49]: a = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr))
In [50]: a
Out[50]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [51]: a.A
Out[51]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])
Note that csc_matrix inferred the number of rows from the maximum that it found in the indices. You can use the shape parameter to override that, e.g.
In [52]: b = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr), shape=(7, len(list_tuples)))
In [53]: b
Out[53]:
<7x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [54]: b.A
Out[54]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1],
[0, 0, 0],
[0, 0, 0]])
You can also generate a coo_matrix pretty easily. The flattened list_tuples gives the row indices, and np.repeat can be used to create the column indices:
In [63]: from scipy.sparse import coo_matrix
In [64]: i = sum(list_tuples, ()) # row indices
In [65]: j = np.repeat(range(len(list_tuples)), [len(t) for t in list_tuples])
In [66]: c = coo_matrix((np.ones(len(i), dtype=int), (i, j)))
In [67]: c
Out[67]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in COOrdinate format>
In [68]: c.A
Out[68]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])

python column major and row major matrix

how do i get the 1-D index of an element of a matrix?
for example:
b=np.array([1, 2, 3, 4, 5, 6])
c = b.reshape(2,3,order='F')#colmaj
d = b.reshape(2,3)#rowmaj
this is c:
([[1, 3, 5],
[2, 4, 6]])
this is d:
([[1, 2, 3],
[4, 5, 6]])
if i do c[1,2] i get the element 6, and i need to get the index of the 1-D array which would be 5. i can do this mentally but if i have a large matrix and need to select an element at random i won't be able to. i need to write functions to do this for both colmajor and rowmajor matrices.
def linearize_colmajor(i, j, m, n):
"""
Returns the linear index for the `(i, j)` entry of
an `m`-by-`n` matrix stored in column-major order.
"""
Simply scale the row index by the number of columns and add column index for the row-major order. For the col-major order, use number of rows to scale the row-index and add column index again.
Hence, to get the flattened index for rowmaj version -
i*n+j
To get the flattened index for colmaj version -
i+m*j
where :
i = row index
j = col index
m = number of rows in the matrix
n = number of columns in the matrix
Putting into function format -
def linearize(i, j, m, n, order='C'):
if order=='C': # rowmaj
return i*n+j
elif order=='F': # colmaj
return i+m*j
else:
raise Exception("Invalid order value")
Sample run -
In [42]: linearize(i=1, j=1, m=2, n=3, order='C')
Out[42]: 4 # element : 5 in rowmaj array, d
In [43]: linearize(i=1, j=1, m=2, n=3, order='F')
Out[43]: 3 # element : 4 in colmaj array, c
np.ravel_multi_index converts n-d index to a flat one, with the option of specifying order:
In [152]: np.ravel_multi_index((0,2),(2,3),order='C')
Out[152]: 2
In [153]: c[0,2], c.flat[2]
Out[153]: (5, 5)
Application to the order='F' case is a bit trickier:
In [154]: np.ravel_multi_index([0,2],[2,3],order='F')
Out[154]: 4
In [155]: d[0,2], d.flat[4], d.ravel(order='F')[4]
Out[155]: (3, 5, 3)
In [156]: d.ravel()
Out[156]: array([1, 2, 3, 4, 5, 6])
In [157]: d.ravel(order='F')
Out[157]: array([1, 4, 2, 5, 3, 6])
The [1,2] element is the same in both orders, the last '6'.
Comparison with #Divakar's example:
In [160]: np.ravel_multi_index([1,1],[2,3],order='C')
Out[160]: 4
In [161]: np.ravel_multi_index([1,1],[2,3],order='F')
Out[161]: 3

Consecutive elements in a Sparse matrix row

I am working on a sparse matrix stored in COO format. What would be the fastest way to get the number of consecutive elements per each row.
For example consider the following matrix:
a = [[0,1,2,0],[1,0,0,2],[0,0,0,0],[1,0,1,0]]
Its COO representation would be
(0, 1) 1
(0, 2) 2
(1, 0) 1
(1, 3) 2
(3, 0) 1
(3, 2) 1
I need the result to be [1,2,0,2]. The first row contains two Non-zero elements that lies nearby. Hence its a group or set. In the second row we have two non-zero elements,but they dont lie nearby, and hence we can say that it forms two groups. The third row there are no non-zeroes and hence no groups. The fourth row has again two non-zeroes but separated by zeroes nad hence we consider as two groups. It would be like the number of clusters per row. Iterating through the rows are an option but only if there is no faster solution. Any help in this regard is appreciated.
Another simple example: consider the following row:
[1,2,3,0,0,0,2,0,0,8,7,6,0,0]
The above row should return [3] sine there are three groups of non-zeroes getting separated by zeroes.
Convert it to a dense array, and apply your logic row by row.
you want the number of groups per row
zeros count when defining groups
row iteration is faster with arrays
In coo format your matrix looks like:
In [623]: M=sparse.coo_matrix(a)
In [624]: M.data
Out[624]: array([1, 2, 1, 2, 1, 1])
In [625]: M.row
Out[625]: array([0, 0, 1, 1, 3, 3], dtype=int32)
In [626]: M.col
Out[626]: array([1, 2, 0, 3, 0, 2], dtype=int32)
This format does not implement row indexing; csr and lil do
In [627]: M.tolil().data
Out[627]: array([[1, 2], [1, 2], [], [1, 1]], dtype=object)
In [628]: M.tolil().rows
Out[628]: array([[1, 2], [0, 3], [], [0, 2]], dtype=object)
So the sparse information for the 1st row is a list of nonzero data values, [1,2], and list of their column numbers, [1,2]. Compare that with the row of the dense array, [0, 1, 2, 0]. Which is easier to analyze?
Your first task is to write a function that analyzes one row. I haven't studied your logic enough to say whether the dense form is better than the sparse one or not. It is easy to get the column information from the dense form with M.A[0,:].nonzero().
In your last example, I can get the nonzero indices:
In [631]: np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])
Out[631]: (array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32),)
In [632]: idx=np.nonzero([1,2,3,0,0,0,2,0,0,8,7,6,0,0])[0]
In [633]: idx
Out[633]: array([ 0, 1, 2, 6, 9, 10, 11], dtype=int32)
In [634]: np.diff(idx)
Out[634]: array([1, 1, 4, 3, 1, 1], dtype=int32)
We may be able to get the desired count from the number of diff values >1, though I'd have to look at more examples to define the details.
Extension of the analysis to multiple rows depends on first thoroughly understanding the single row case.
With the help of #hpauljs comment I came up with following snippet to do this:
M = m.tolil()
r = []
for i in range(M.shape[0]):
sumx=0
idx= M.rows[i]
if (len(idx) > 2):
tempidx = np.diff(idx)
if (1 in tempidx):
temp = filter(lambda a: a != 1, tempidx)
sumx=1
counts = len(temp)
r.append(counts+sumx)
elif (len(idx) == 2):
tempidx = np.diff(idx)
if(tempidx[0]==1):
counts = 1
r.append(counts)
else:
counts = 2
r.append(counts)
elif (len(idx) == 1):
counts = 1
r.append(counts)
else:
counts = 0
r.append(counts)
tempcluster = np.sum(r)/float(M.shape[0])
cluster.append(tempcluster)

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.
The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Categories

Resources