Why isn't eliminate_zeros() removing the zero entries? - python

Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])

The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.

To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.

Related

Understand the csr format

I am trying to undersand how scipy CSR works.
https://docs.scipy.org/doc/scipy/reference/sparse.html
For example, of the following matrix on https://en.wikipedia.org/wiki/Sparse_matrix
( 0 0 0 0 )
( 5 8 0 0 )
( 0 0 3 0 )
( 0 6 0 0 )
it says the CSR representation is the following.
Must V list one row after another with non-zero elements in a row list from left to right?
I can understand COL_INDEX is the column index (column 1 is indexed as 0) corresponding to elements in V.
I don't understand ROW_INDEX. Could anybody show me how the ROW_INDEX was created from the original matrix? Thanks.
V = [ 5 8 3 6 ]
COL_INDEX = [ 0 1 2 1 ]
ROW_INDEX = [ 0 0 2 3 4 ]
From the scipy manual:
csr_matrix((data, indices, indptr), [shape=(M, N)]) is the standard
CSR representation where the column indices for row i are stored in
indices[indptr[i]:indptr[i+1]] and their corresponding values are
stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not
supplied, the matrix dimensions are inferred from the index arrays.
indptr is the same as ROW_INDEX and indicies is the same as COL_INDEX.
Here is an example of a naive way to create the indices and value array. Essentially ROW_INDICES[i + 1] is the total number of non-zero entires from row 0 to i inclusive with the last entry being the total number of non-zero entries.
ROW_INDICES = [0]
COL_INDICES = []
VALS = []
for i in range(num_rows):
ROW_INDICES.append(ROW_INDICES[i])
for j in range(num_cols):
if m[i, j] > 0:
ROW_INDICES[i + 1] += 1
COL_INDICES.append(j)
VALS.append(m[i, j])
coo format
I think it's best to start with the coo definition. It's easier to understand, and widely used:
In [90]: A = np.array([[0,0,0,0],[5,8,0,0],[0,0,3,0],[0,6,0,0]])
In [91]: M = sparse.coo_matrix(A)
The values are stored in 3 attributes:
In [92]: M.row
Out[92]: array([1, 1, 2, 3], dtype=int32)
In [93]: M.col
Out[93]: array([0, 1, 2, 1], dtype=int32)
In [94]: M.data
Out[94]: array([5, 8, 3, 6])
We can make a new matrix from those 3 arrays:
In [95]: sparse.coo_matrix((_94, (_92, _93))).A
Out[95]:
array([[0, 0, 0],
[5, 8, 0],
[0, 0, 3],
[0, 6, 0]])
oops, I need to add a shape, since one column is all 0s:
In [96]: sparse.coo_matrix((_94, (_92, _93)), shape=(4,4)).A
Out[96]:
array([[0, 0, 0, 0],
[5, 8, 0, 0],
[0, 0, 3, 0],
[0, 6, 0, 0]])
Another way to display this matrix:
In [97]: print(M)
(1, 0) 5
(1, 1) 8
(2, 2) 3
(3, 1) 6
np.where(A) gives the same non-zero coordinates.
In [108]: np.where(A)
Out[108]: (array([1, 1, 2, 3]), array([0, 1, 2, 1]))
conversion to csr
Once we have coo, we can easily convert it to csr. In fact sparse often does that for us:
In [98]: Mr = M.tocsr()
In [99]: Mr.data
Out[99]: array([5, 8, 3, 6], dtype=int64)
In [100]: Mr.indices
Out[100]: array([0, 1, 2, 1], dtype=int32)
In [101]: Mr.indptr
Out[101]: array([0, 0, 2, 3, 4], dtype=int32)
Sparse does several things - it sorts the indices, sums duplicates, and replaces the row with a indptr array. Here it is actually longer than the original, but in general it will be shorter, since it has just one value per row (plus 1). But perhaps more important, most of the fast calculation routines, especially matrix multiplication, have been written using the csr format.
I've used this package a lot. MATLAB as well, where the default definition is in the coo style, but the internal storage is csc (but not as exposed to users as in scipy). But I've never tried to derive indptr from scratch. I could, but I don't need to.
csr_matrix accepts inputs in the coo format, but also in the indptr etc format. I wouldn't recommend it, unless you already have those inputs calculated (say from another matrix). It's more error prone, and probably not much faster.
Iteration with indptr
However sometimes it is useful to iterate on intptr, and perform calculations directly on the data. Often this is faster than working with the provided methods.
For example we can list the nonzero values by row:
In [104]: for i in range(Mr.shape[0]):
...: pt = slice(Mr.indptr[i], Mr.indptr[i+1])
...: print(i, Mr.indices[pt], Mr.data[pt])
...:
0 [] []
1 [0 1] [5 8]
2 [2] [3]
3 [1] [6]
Keeping the initial 0 makes this iteration easier. When the matrix is (10000,90000) there's not much incentive to reduces the size of indptr by 1.
lil format
The lil format stores the matrix in a similar manner:
In [105]: Ml = M.tolil()
In [106]: Ml.data
Out[106]: array([list([]), list([5, 8]), list([3]), list([6])], dtype=object)
In [107]: Ml.rows
Out[107]: array([list([]), list([0, 1]), list([2]), list([1])], dtype=object)
In [110]: for i,(r,d) in enumerate(zip(Ml.rows, Ml.data)):
...: print(i, r, d)
...:
0 [] []
1 [0, 1] [5, 8]
2 [2] [3]
3 [1] [6]
Because of how rows are stored, lil actually allows us to fetch a view:
In [167]: Ml.getrowview(2)
Out[167]:
<1x4 sparse matrix of type '<class 'numpy.longlong'>'
with 1 stored elements in List of Lists format>
In [168]: for i in range(Ml.shape[0]):
...: print(Ml.getrowview(i))
...:
(0, 0) 5
(0, 1) 8
(0, 2) 3
(0, 1) 6

Why doesn't scipy.sparse.csc_matrix preserve the indexing order of my np.array?

I am writing code to remove multiple columns from several large, parallel scipy sparse.csc matrices (meaning all matrices have the same dim, and all nnz elements are in the same places) simultaneously and efficiently. I am doing this by indexing to only the columns I want to keep for one matrix and then reusing the indices and indptr lists for the others. However, when I index the csc matrix by a list, it reorders the data list, so I cannot reuse the indices. Is there a way to force scipy to keep the data list in the original order? Why is it reordering only when indexing by a list?
import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5],
[1,0,1,0,0,0],
[0,0,0,4,0,1],
[0,3,0,1,0,4]]))
print mat[:,3].data
returns array([4, 1])
print mat[:,[3]].data
returns array([1, 4])
In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
...: 0,1],[0,3,0,1,0,4]]))
...:
...:
In [44]: mat
Out[44]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Column format>
In [45]: mat.data
Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
In [46]: mat.indices
Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
In [47]: mat.indptr
Out[47]: array([ 0, 2, 3, 4, 6, 7, 10], dtype=int32)
scalar selection:
In [48]: m1 = mat[:,3]
In [49]: m1
Out[49]:
<4x1 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Column format>
In [50]: m1.data
Out[50]: array([4, 1])
In [51]: m1.indices
Out[51]: array([2, 3], dtype=int32)
In [52]: m1.indptr
Out[52]: array([0, 2], dtype=int32)
list indexing:
In [53]: m2 = mat[:,[3]]
In [54]: m2.data
Out[54]: array([1, 4], dtype=int64)
In [55]: m2.indices
Out[55]: array([3, 2], dtype=int32)
In [56]: m2.indptr
Out[56]: array([0, 2], dtype=int32)
sorting:
In [57]: m2.sort_indices()
In [58]: m2.data
Out[58]: array([4, 1], dtype=int64)
In [59]: m2.indices
Out[59]: array([2, 3], dtype=int32)
csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.
csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.

Zero several columns in csr_matrix

Assume I have a sparse matrix:
>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I want to zero column 0 and 2. Below is what I want to get:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]])
Below is what I tried:
sp_mat = csr_matrix((data, indices, indptr), shape=(3, 3))
zero_cols = np.array([0, 2])
sp_mat[:, zero_cols] = 0
However, I get a warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Since the sp_mat I have is large, converting to lil_matrix is very slow.
What is an efficient way?
In [87]: >>> indptr = np.array([0, 2, 3, 6])
...: >>> indices = np.array([0, 2, 2, 0, 1, 2])
...: >>> data = np.array([1, 2, 3, 4, 5, 6])
...: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [88]: M
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Look at what happens with the csr assignment:
In [89]: M[:, [0, 2]] = 0
/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py:746: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [90]: M
Out[90]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [91]: M.data
Out[91]: array([0, 0, 0, 0, 0, 5, 0])
In [92]: M.indices
Out[92]: array([0, 2, 0, 2, 0, 1, 2], dtype=int32)
Not only does it give a warning, but it actually increases the number of 'sparse' terms, though most now have a 0 value. Those are only removed when we clean up:
In [93]: M.eliminate_zeros()
In [94]: M
Out[94]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In the indexed assignment, csr isn't distinguishing between setting 0s and other values. It treats all the same.
I should note that the efficiency warning is given primarily to keep users from using it repeatedly (as in an iteration). For one-time actions it is overly alarmistic.
For indexed assignment, lil format is more efficient (or at least it doesn't warn about efficiency). But converting to/from that format is time consuming.
Another option is to find and set the new 0s directly, followed by a eliminate_zeros).
Another is to use a matrix multiply. I think a diagonal sparse with 0's in the right columns will do the trick.
In [103]: M
Out[103]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [104]: D = sparse.diags([0,1,0], dtype=M.dtype)
In [105]: D
Out[105]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements (1 diagonals) in DIAgonal format>
In [106]: D.A
Out[106]:
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]])
In [107]: M1 = M*D
In [108]: M1
Out[108]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [110]: M1.A
Out[110]:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]], dtype=int64)
If you multiply the matrix in-place, you don't get the efficiency warning. It's only changing the values of existing non-zero term, so isn't changing the sparsity of the matrix (at least not until you eliminate zeros):
In [111]: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [112]: M[:,[0,2]] *= 0
In [113]: M
Out[113]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [114]: M.eliminate_zeros()
In [115]: M
Out[115]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
Matrix multiplication is the way to go.
For my very large CSR matrix (size 2M*2M), direct assignment with sp_mat[:, zero_cols] = 0 results in out of memory error. Suppose the indices of zeroed columns are marked as True in the boolean array zero_mask, then multiplying a diagonal matrix can do the job efficiently (within 3 seconds).
import scipy.sparse as sp
sp_mat=sp_mat#sp.diags((~node_mask).astype(int))
Here (~node_mask).astype(int) gives an 1-d array of 0s and 1s that specifies which columns should be kept (1) and which should be zeroed (0).

efficient way to iterate through coo_matrix elements ordered by column?

I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
Couldn't find a way to iterate the matrix column-based. Is there?
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)
Thanks!
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
That print, as well as the nonzero return the row and col arrays:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
I was going to say conversion to csc orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.
coo doesn't implement indexing or iteration. csr and lil implement those.

Efficient accessing in sparse matrices

I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices.
In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations.
for u, j in zip(*ratings.nonzero()):
items_rated_by_u = ratings[u, :].nonzero()[1]
users_who_rated_j = ratings[:, j].nonzero()[0]
# More code...
Extra:
Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days.
When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code).
I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose:
In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]:
array([[0, 1, 0],
[1, 0, 0],
[0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)
You could also access these values in the csr format, but it's a bit more complicated
In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)

Categories

Resources