Assume I have a sparse matrix:
>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I want to zero column 0 and 2. Below is what I want to get:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]])
Below is what I tried:
sp_mat = csr_matrix((data, indices, indptr), shape=(3, 3))
zero_cols = np.array([0, 2])
sp_mat[:, zero_cols] = 0
However, I get a warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Since the sp_mat I have is large, converting to lil_matrix is very slow.
What is an efficient way?
In [87]: >>> indptr = np.array([0, 2, 3, 6])
...: >>> indices = np.array([0, 2, 2, 0, 1, 2])
...: >>> data = np.array([1, 2, 3, 4, 5, 6])
...: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [88]: M
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Look at what happens with the csr assignment:
In [89]: M[:, [0, 2]] = 0
/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py:746: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [90]: M
Out[90]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [91]: M.data
Out[91]: array([0, 0, 0, 0, 0, 5, 0])
In [92]: M.indices
Out[92]: array([0, 2, 0, 2, 0, 1, 2], dtype=int32)
Not only does it give a warning, but it actually increases the number of 'sparse' terms, though most now have a 0 value. Those are only removed when we clean up:
In [93]: M.eliminate_zeros()
In [94]: M
Out[94]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In the indexed assignment, csr isn't distinguishing between setting 0s and other values. It treats all the same.
I should note that the efficiency warning is given primarily to keep users from using it repeatedly (as in an iteration). For one-time actions it is overly alarmistic.
For indexed assignment, lil format is more efficient (or at least it doesn't warn about efficiency). But converting to/from that format is time consuming.
Another option is to find and set the new 0s directly, followed by a eliminate_zeros).
Another is to use a matrix multiply. I think a diagonal sparse with 0's in the right columns will do the trick.
In [103]: M
Out[103]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [104]: D = sparse.diags([0,1,0], dtype=M.dtype)
In [105]: D
Out[105]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements (1 diagonals) in DIAgonal format>
In [106]: D.A
Out[106]:
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]])
In [107]: M1 = M*D
In [108]: M1
Out[108]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [110]: M1.A
Out[110]:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]], dtype=int64)
If you multiply the matrix in-place, you don't get the efficiency warning. It's only changing the values of existing non-zero term, so isn't changing the sparsity of the matrix (at least not until you eliminate zeros):
In [111]: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [112]: M[:,[0,2]] *= 0
In [113]: M
Out[113]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [114]: M.eliminate_zeros()
In [115]: M
Out[115]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
Matrix multiplication is the way to go.
For my very large CSR matrix (size 2M*2M), direct assignment with sp_mat[:, zero_cols] = 0 results in out of memory error. Suppose the indices of zeroed columns are marked as True in the boolean array zero_mask, then multiplying a diagonal matrix can do the job efficiently (within 3 seconds).
import scipy.sparse as sp
sp_mat=sp_mat#sp.diags((~node_mask).astype(int))
Here (~node_mask).astype(int) gives an 1-d array of 0s and 1s that specifies which columns should be kept (1) and which should be zeroed (0).
Related
Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.
I am writing code to remove multiple columns from several large, parallel scipy sparse.csc matrices (meaning all matrices have the same dim, and all nnz elements are in the same places) simultaneously and efficiently. I am doing this by indexing to only the columns I want to keep for one matrix and then reusing the indices and indptr lists for the others. However, when I index the csc matrix by a list, it reorders the data list, so I cannot reuse the indices. Is there a way to force scipy to keep the data list in the original order? Why is it reordering only when indexing by a list?
import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5],
[1,0,1,0,0,0],
[0,0,0,4,0,1],
[0,3,0,1,0,4]]))
print mat[:,3].data
returns array([4, 1])
print mat[:,[3]].data
returns array([1, 4])
In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
...: 0,1],[0,3,0,1,0,4]]))
...:
...:
In [44]: mat
Out[44]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Column format>
In [45]: mat.data
Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
In [46]: mat.indices
Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
In [47]: mat.indptr
Out[47]: array([ 0, 2, 3, 4, 6, 7, 10], dtype=int32)
scalar selection:
In [48]: m1 = mat[:,3]
In [49]: m1
Out[49]:
<4x1 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Column format>
In [50]: m1.data
Out[50]: array([4, 1])
In [51]: m1.indices
Out[51]: array([2, 3], dtype=int32)
In [52]: m1.indptr
Out[52]: array([0, 2], dtype=int32)
list indexing:
In [53]: m2 = mat[:,[3]]
In [54]: m2.data
Out[54]: array([1, 4], dtype=int64)
In [55]: m2.indices
Out[55]: array([3, 2], dtype=int32)
In [56]: m2.indptr
Out[56]: array([0, 2], dtype=int32)
sorting:
In [57]: m2.sort_indices()
In [58]: m2.data
Out[58]: array([4, 1], dtype=int64)
In [59]: m2.indices
Out[59]: array([2, 3], dtype=int32)
csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.
csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.
In scipy, when I multiply a slice of a sparse matrix with an array containing only zeros, the result is a matrix that is less or equally sparse than before, even though it should be more or equally sparse. The same holds for setting parts of the matrix to 0 or False:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
This is actually computationally expensive for large matrices, because it iterates over all cells in the slice, not just the nonzero ones.
From a mathematical / logical point of view, when multiplying a sparse matrix or vector, all empty cells are certain to remain empty as 0*x == 0. The same holds for setting to zero: zero-cells do not need to be explicitely set to zero.
What is the best way to deal with this?
I am using scipy version 0.17.0
In working with sparse matrices, changing the sparsity pattern is generally a very expensive operation, and so scipy does not do this silently.
If you want to remove explicitly stored zeros from a sparse matrix, you should use the eliminate_zeros() method; for example:
>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99657 stored elements in Compressed Sparse Row format>
Scipy could call the eliminate_zeros routine automatically after doing this kind of operation, but the developers chose to give the user more flexibility and control when doing something as expensive as changing the sparsity structure.
To recreate your code (using int type for a more compact display):
In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]:
array([[0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data # show nonzero values by row
Out[19]:
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
Setting a row (or column) explicitly. Note the efficiency warning:
In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]:
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
list([1, 1]), list([]), list([1, 1]), list([1, 1]),
list([1, 1, 1, 1, 1, 1])], dtype=object)
So yes it has set all values in the row to the specified value. And it doesn't attempt to distinguish between setting 0s as opposed to 1s. You can see the code used in M.__setitem__ and M._set_many (this is where the efficiency warning is generated).
As #jakevpd shows, you need to explicitly tell it to eliminate excess 0s. It does not attempt to do that during the assignment.
In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]:
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
In general setting values of a matrix explicitly is discouraged, especially with csr. It's not even allowed with coo. lil is the recommended format if you need to do that.
In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]:
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
lil does take care to eliminate 0s.
Multiplication of a row by an array of 0s does not change the sparsity. Nor does it act in-place. It produces a new matrix:
In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0., 0., 0., 0., 0., 0., 0., 0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)
Multiplication with a sparse matrix does eliminate 0s (again, not in-place):
In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
(Notice also the different in format between Out[29] and Out[32].)
As a general rule, multiplication, both element-wise and matrix, is the most efficient operation with csr matrices, especially if the other is also sparse. In fact row/column sums are performed with matrix multiplication, as is advanced indexing.
I have the following code in python:
import numpy as np
from scipy.sparse import csr_matrix
M = csr_matrix(np.ones([2, 2],dtype=np.int32))
print(M)
print(M.data.shape)
for i in range(np.shape(mat)[0]):
for j in range(np.shape(mat)[1]):
if i==j:
M[i,j] = 0
print(M)
print(M.data.shape)
The output of the first 2 prints is:
(0, 0) 1
(0, 1) 1
(1, 0) 1
(1, 1) 1
(4,)
The code is changing the value of the same index (i==j) and setting the value to zero.
After executing the loops then the output of the last 2 prints is:
(0, 0) 0
(0, 1) 1
(1, 0) 1
(1, 1) 0
(4,)
If I understand the concept of sparse matrices correctly, it should not be the case. It should not show me the zero values and the output of last 2 prints should be like this:
(0, 1) 1
(1, 0) 1
(2,)
Does anyone have explanation for this? Am I doing something wrong?
Yes, you are trying to change elements of the matrix one by one. :)
Ok, it does work that way, though if you changed things the other way (setting a 0 to nonzero) you will get an Efficiency warning.
To keep your kind of change fast, it only changes the value in the M.data array, and does not recalculate the indices. You have to invoke a separate csr_matrix.eliminate_zeros method the clean up the matrix. To get best speed call this once at the end of the loop.
There is a csr_matrix.setdiag method that lets you set the whole diagonal with one call. It still needs the cleanup.
In [1633]: M=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [1634]: M
Out[1634]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [1635]: M.A
Out[1635]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]], dtype=int32)
In [1636]: M.setdiag(0)
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [1637]: M
Out[1637]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [1638]: M.A
Out[1638]:
array([[0, 1, 2],
[3, 0, 5],
[6, 7, 0]])
In [1639]: M.data
Out[1639]: array([0, 1, 2, 3, 0, 5, 6, 7, 0])
In [1640]: M.eliminate_zeros()
In [1641]: M
Out[1641]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
In [1642]: M.data
Out[1642]: array([1, 2, 3, 5, 6, 7])
I have 3 sparse matrices:
In [39]:
mat1
Out[39]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 878048 stored elements in Compressed Sparse Row format>
In [37]:
mat2
Out[37]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 744315 stored elements in Compressed Sparse Row format>
In [35]:
mat3
Out[35]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 788618 stored elements in Compressed Sparse Row format>
From the documentation, I read that it is possible to hstack, vstack, and concatenate them such type of matrices. So I tried to hstack them:
import numpy as np
matrix1 = np.hstack([[address_feature, dayweek_feature]]).T
matrix2 = np.vstack([[matrix1, pddis_feature]]).T
X = matrix2
However, the dimensions do not match:
In [41]:
X_combined_features.shape
Out[41]:
(2, 1)
Note that I am stacking such matrices since I would like to use them with a scikit-learn classification algorithm. Therefore, How should I hstack a number of different sparse matrices?.
Use the sparse versions of vstack. As general rule you need to use sparse functions and methods, not the numpy ones with similar name. sparse matrices are not subclasses of numpy ndarray.
But, your 3 three matrices do not look sparse. They are 1x878049. One has 878048 nonzero elements - that means just one 0 element.
So you could just as well turned them into dense arrays (with .toarray() or .A) and use np.hstack or np.vstack.
np.hstack([address_feature.A, dayweek_feature.A])
And don't use the double brackets. All concatenate functions take a simple list or tuple of the arrays. And that list can have more than 2 arrays
In [296]: A=sparse.csr_matrix([0,1,2,0,0,1])
In [297]: B=sparse.csr_matrix([0,0,0,1,0,1])
In [298]: C=sparse.csr_matrix([1,0,0,0,1,0])
In [299]: sparse.vstack([A,B,C])
Out[299]:
<3x6 sparse matrix of type '<class 'numpy.int32'>'
with 7 stored elements in Compressed Sparse Row format>
In [300]: sparse.vstack([A,B,C]).A
Out[300]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)
In [301]: sparse.hstack([A,B,C]).A
Out[301]: array([[0, 1, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]], dtype=int32)
In [302]: np.vstack([A.A,B.A,C.A])
Out[302]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)