In scipy, when I multiply a slice of a sparse matrix with an array containing only zeros, the result is a matrix that is less or equally sparse than before, even though it should be more or equally sparse. The same holds for setting parts of the matrix to 0 or False:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
This is actually computationally expensive for large matrices, because it iterates over all cells in the slice, not just the nonzero ones.
From a mathematical / logical point of view, when multiplying a sparse matrix or vector, all empty cells are certain to remain empty as 0*x == 0. The same holds for setting to zero: zero-cells do not need to be explicitely set to zero.
What is the best way to deal with this?
I am using scipy version 0.17.0
In working with sparse matrices, changing the sparsity pattern is generally a very expensive operation, and so scipy does not do this silently.
If you want to remove explicitly stored zeros from a sparse matrix, you should use the eliminate_zeros() method; for example:
>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99657 stored elements in Compressed Sparse Row format>
Scipy could call the eliminate_zeros routine automatically after doing this kind of operation, but the developers chose to give the user more flexibility and control when doing something as expensive as changing the sparsity structure.
To recreate your code (using int type for a more compact display):
In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]:
array([[0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data # show nonzero values by row
Out[19]:
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
Setting a row (or column) explicitly. Note the efficiency warning:
In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]:
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
list([1, 1]), list([]), list([1, 1]), list([1, 1]),
list([1, 1, 1, 1, 1, 1])], dtype=object)
So yes it has set all values in the row to the specified value. And it doesn't attempt to distinguish between setting 0s as opposed to 1s. You can see the code used in M.__setitem__ and M._set_many (this is where the efficiency warning is generated).
As #jakevpd shows, you need to explicitly tell it to eliminate excess 0s. It does not attempt to do that during the assignment.
In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]:
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
In general setting values of a matrix explicitly is discouraged, especially with csr. It's not even allowed with coo. lil is the recommended format if you need to do that.
In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]:
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
lil does take care to eliminate 0s.
Multiplication of a row by an array of 0s does not change the sparsity. Nor does it act in-place. It produces a new matrix:
In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0., 0., 0., 0., 0., 0., 0., 0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)
Multiplication with a sparse matrix does eliminate 0s (again, not in-place):
In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
(Notice also the different in format between Out[29] and Out[32].)
As a general rule, multiplication, both element-wise and matrix, is the most efficient operation with csr matrices, especially if the other is also sparse. In fact row/column sums are performed with matrix multiplication, as is advanced indexing.
Related
Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.
Assume I have a sparse matrix:
>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I want to zero column 0 and 2. Below is what I want to get:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]])
Below is what I tried:
sp_mat = csr_matrix((data, indices, indptr), shape=(3, 3))
zero_cols = np.array([0, 2])
sp_mat[:, zero_cols] = 0
However, I get a warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Since the sp_mat I have is large, converting to lil_matrix is very slow.
What is an efficient way?
In [87]: >>> indptr = np.array([0, 2, 3, 6])
...: >>> indices = np.array([0, 2, 2, 0, 1, 2])
...: >>> data = np.array([1, 2, 3, 4, 5, 6])
...: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [88]: M
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Look at what happens with the csr assignment:
In [89]: M[:, [0, 2]] = 0
/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py:746: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [90]: M
Out[90]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [91]: M.data
Out[91]: array([0, 0, 0, 0, 0, 5, 0])
In [92]: M.indices
Out[92]: array([0, 2, 0, 2, 0, 1, 2], dtype=int32)
Not only does it give a warning, but it actually increases the number of 'sparse' terms, though most now have a 0 value. Those are only removed when we clean up:
In [93]: M.eliminate_zeros()
In [94]: M
Out[94]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In the indexed assignment, csr isn't distinguishing between setting 0s and other values. It treats all the same.
I should note that the efficiency warning is given primarily to keep users from using it repeatedly (as in an iteration). For one-time actions it is overly alarmistic.
For indexed assignment, lil format is more efficient (or at least it doesn't warn about efficiency). But converting to/from that format is time consuming.
Another option is to find and set the new 0s directly, followed by a eliminate_zeros).
Another is to use a matrix multiply. I think a diagonal sparse with 0's in the right columns will do the trick.
In [103]: M
Out[103]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [104]: D = sparse.diags([0,1,0], dtype=M.dtype)
In [105]: D
Out[105]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements (1 diagonals) in DIAgonal format>
In [106]: D.A
Out[106]:
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]])
In [107]: M1 = M*D
In [108]: M1
Out[108]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [110]: M1.A
Out[110]:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]], dtype=int64)
If you multiply the matrix in-place, you don't get the efficiency warning. It's only changing the values of existing non-zero term, so isn't changing the sparsity of the matrix (at least not until you eliminate zeros):
In [111]: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [112]: M[:,[0,2]] *= 0
In [113]: M
Out[113]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [114]: M.eliminate_zeros()
In [115]: M
Out[115]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
Matrix multiplication is the way to go.
For my very large CSR matrix (size 2M*2M), direct assignment with sp_mat[:, zero_cols] = 0 results in out of memory error. Suppose the indices of zeroed columns are marked as True in the boolean array zero_mask, then multiplying a diagonal matrix can do the job efficiently (within 3 seconds).
import scipy.sparse as sp
sp_mat=sp_mat#sp.diags((~node_mask).astype(int))
Here (~node_mask).astype(int) gives an 1-d array of 0s and 1s that specifies which columns should be kept (1) and which should be zeroed (0).
I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
Couldn't find a way to iterate the matrix column-based. Is there?
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)
Thanks!
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
That print, as well as the nonzero return the row and col arrays:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
I was going to say conversion to csc orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.
coo doesn't implement indexing or iteration. csr and lil implement those.
I have a list of integers called cluster0Rand corresponding to certain index's in a scipy sparse matrix called data.
I want to create a new scipy matrix consisting of only the row's which's index is in the list?
For example,
data = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
cluster0Rand = [0,1]
The desired output would be:
csr_matrix([[1, 2, 0], [0, 0, 3]])
How can I do this efficently given that the real list is made up of thousands of indexs and the scipy matrix is (10000, 100000)
Given your example, plain indexing does the job:
In [300]: data = sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
In [301]: idx = [0,1]
In [302]: data[idx,:]
Out[302]:
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>
In [303]: _.A
Out[303]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)
This kind of indexing is slower with sparse matrices than dense arrays. But it uses a sparse matrix strength, matrix multiplication. It turns the idx into a selector matrix.
In [313]: (sparse.csr_matrix([[1,0,0],[0,1,0]])*data).A
Out[313]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)
I have 3 sparse matrices:
In [39]:
mat1
Out[39]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 878048 stored elements in Compressed Sparse Row format>
In [37]:
mat2
Out[37]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 744315 stored elements in Compressed Sparse Row format>
In [35]:
mat3
Out[35]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 788618 stored elements in Compressed Sparse Row format>
From the documentation, I read that it is possible to hstack, vstack, and concatenate them such type of matrices. So I tried to hstack them:
import numpy as np
matrix1 = np.hstack([[address_feature, dayweek_feature]]).T
matrix2 = np.vstack([[matrix1, pddis_feature]]).T
X = matrix2
However, the dimensions do not match:
In [41]:
X_combined_features.shape
Out[41]:
(2, 1)
Note that I am stacking such matrices since I would like to use them with a scikit-learn classification algorithm. Therefore, How should I hstack a number of different sparse matrices?.
Use the sparse versions of vstack. As general rule you need to use sparse functions and methods, not the numpy ones with similar name. sparse matrices are not subclasses of numpy ndarray.
But, your 3 three matrices do not look sparse. They are 1x878049. One has 878048 nonzero elements - that means just one 0 element.
So you could just as well turned them into dense arrays (with .toarray() or .A) and use np.hstack or np.vstack.
np.hstack([address_feature.A, dayweek_feature.A])
And don't use the double brackets. All concatenate functions take a simple list or tuple of the arrays. And that list can have more than 2 arrays
In [296]: A=sparse.csr_matrix([0,1,2,0,0,1])
In [297]: B=sparse.csr_matrix([0,0,0,1,0,1])
In [298]: C=sparse.csr_matrix([1,0,0,0,1,0])
In [299]: sparse.vstack([A,B,C])
Out[299]:
<3x6 sparse matrix of type '<class 'numpy.int32'>'
with 7 stored elements in Compressed Sparse Row format>
In [300]: sparse.vstack([A,B,C]).A
Out[300]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)
In [301]: sparse.hstack([A,B,C]).A
Out[301]: array([[0, 1, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]], dtype=int32)
In [302]: np.vstack([A.A,B.A,C.A])
Out[302]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)