How to hstack several sparse matrices (feature matrices)? - python

I have 3 sparse matrices:
In [39]:
mat1
Out[39]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 878048 stored elements in Compressed Sparse Row format>
In [37]:
mat2
Out[37]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 744315 stored elements in Compressed Sparse Row format>
In [35]:
mat3
Out[35]:
(1, 878049)
<1x878049 sparse matrix of type '<type 'numpy.int64'>'
with 788618 stored elements in Compressed Sparse Row format>
From the documentation, I read that it is possible to hstack, vstack, and concatenate them such type of matrices. So I tried to hstack them:
import numpy as np
matrix1 = np.hstack([[address_feature, dayweek_feature]]).T
matrix2 = np.vstack([[matrix1, pddis_feature]]).T
X = matrix2
However, the dimensions do not match:
In [41]:
X_combined_features.shape
Out[41]:
(2, 1)
Note that I am stacking such matrices since I would like to use them with a scikit-learn classification algorithm. Therefore, How should I hstack a number of different sparse matrices?.

Use the sparse versions of vstack. As general rule you need to use sparse functions and methods, not the numpy ones with similar name. sparse matrices are not subclasses of numpy ndarray.
But, your 3 three matrices do not look sparse. They are 1x878049. One has 878048 nonzero elements - that means just one 0 element.
So you could just as well turned them into dense arrays (with .toarray() or .A) and use np.hstack or np.vstack.
np.hstack([address_feature.A, dayweek_feature.A])
And don't use the double brackets. All concatenate functions take a simple list or tuple of the arrays. And that list can have more than 2 arrays
In [296]: A=sparse.csr_matrix([0,1,2,0,0,1])
In [297]: B=sparse.csr_matrix([0,0,0,1,0,1])
In [298]: C=sparse.csr_matrix([1,0,0,0,1,0])
In [299]: sparse.vstack([A,B,C])
Out[299]:
<3x6 sparse matrix of type '<class 'numpy.int32'>'
with 7 stored elements in Compressed Sparse Row format>
In [300]: sparse.vstack([A,B,C]).A
Out[300]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)
In [301]: sparse.hstack([A,B,C]).A
Out[301]: array([[0, 1, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]], dtype=int32)
In [302]: np.vstack([A.A,B.A,C.A])
Out[302]:
array([[0, 1, 2, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]], dtype=int32)

Related

Why isn't eliminate_zeros() removing the zero entries?

Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.

Create an sparse matrix from a list of tuples having the indexes of the column where is a 1

Problem:
I have a list of tuples, which each tuple represents a column of a 2D-array and each element of the tuple represents the index of that column of the array that is a 1; the other entries that aren't in that tuple, are 0.
I want to create an sparse matrix with this list of tuples in an efficient way (trying to not use for loops).
Example:
# init values
list_tuples = [
(0, 2, 4),
(0, 2, 3),
(1, 3, 4)
]
n = length(list_tuples) + 1
m = 5 # arbritrary, however n >= max([ei for ei in list_tuples]) + 1
# what I need is a function which accepts this tuples and give the shape of the array
# (at least the row size, because the column size can be infered from the list of tuples)
A = some_function(list_tuples, array_shape = (m, n))
Then what I expect to have is an array of the form:
[
[1, 1, 0]
[0, 0, 1]
[1, 1, 0]
[0, 1, 1]
[1, 0, 1]
]
Your values are the indices that are required for the compressed sparse column format. You'll also need the indptr array, which for your data is the cumulative sum of the lengths of the tuples (prepended with 0). The data array would be an array of ones with the same length as the sum of the lengths of the tuples, which you can get from the last element of the cumulative sum. Here's how that looks with your example:
In [45]: from scipy.sparse import csc_matrix
In [46]: list_tuples = [
...: (0, 2, 4),
...: (0, 2, 3),
...: (1, 3, 4)
...: ]
In [47]: indices = sum(list_tuples, ()) # Flatten the tuples into one sequence.
In [48]: indptr = np.cumsum([0] + [len(t) for t in list_tuples])
In [49]: a = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr))
In [50]: a
Out[50]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [51]: a.A
Out[51]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])
Note that csc_matrix inferred the number of rows from the maximum that it found in the indices. You can use the shape parameter to override that, e.g.
In [52]: b = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr), shape=(7, len(list_tuples)))
In [53]: b
Out[53]:
<7x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
In [54]: b.A
Out[54]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1],
[0, 0, 0],
[0, 0, 0]])
You can also generate a coo_matrix pretty easily. The flattened list_tuples gives the row indices, and np.repeat can be used to create the column indices:
In [63]: from scipy.sparse import coo_matrix
In [64]: i = sum(list_tuples, ()) # row indices
In [65]: j = np.repeat(range(len(list_tuples)), [len(t) for t in list_tuples])
In [66]: c = coo_matrix((np.ones(len(i), dtype=int), (i, j)))
In [67]: c
Out[67]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in COOrdinate format>
In [68]: c.A
Out[68]:
array([[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]])

Zero several columns in csr_matrix

Assume I have a sparse matrix:
>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I want to zero column 0 and 2. Below is what I want to get:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]])
Below is what I tried:
sp_mat = csr_matrix((data, indices, indptr), shape=(3, 3))
zero_cols = np.array([0, 2])
sp_mat[:, zero_cols] = 0
However, I get a warning:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Since the sp_mat I have is large, converting to lil_matrix is very slow.
What is an efficient way?
In [87]: >>> indptr = np.array([0, 2, 3, 6])
...: >>> indices = np.array([0, 2, 2, 0, 1, 2])
...: >>> data = np.array([1, 2, 3, 4, 5, 6])
...: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [88]: M
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Look at what happens with the csr assignment:
In [89]: M[:, [0, 2]] = 0
/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py:746: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [90]: M
Out[90]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [91]: M.data
Out[91]: array([0, 0, 0, 0, 0, 5, 0])
In [92]: M.indices
Out[92]: array([0, 2, 0, 2, 0, 1, 2], dtype=int32)
Not only does it give a warning, but it actually increases the number of 'sparse' terms, though most now have a 0 value. Those are only removed when we clean up:
In [93]: M.eliminate_zeros()
In [94]: M
Out[94]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In the indexed assignment, csr isn't distinguishing between setting 0s and other values. It treats all the same.
I should note that the efficiency warning is given primarily to keep users from using it repeatedly (as in an iteration). For one-time actions it is overly alarmistic.
For indexed assignment, lil format is more efficient (or at least it doesn't warn about efficiency). But converting to/from that format is time consuming.
Another option is to find and set the new 0s directly, followed by a eliminate_zeros).
Another is to use a matrix multiply. I think a diagonal sparse with 0's in the right columns will do the trick.
In [103]: M
Out[103]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [104]: D = sparse.diags([0,1,0], dtype=M.dtype)
In [105]: D
Out[105]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements (1 diagonals) in DIAgonal format>
In [106]: D.A
Out[106]:
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]])
In [107]: M1 = M*D
In [108]: M1
Out[108]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [110]: M1.A
Out[110]:
array([[0, 0, 0],
[0, 0, 0],
[0, 5, 0]], dtype=int64)
If you multiply the matrix in-place, you don't get the efficiency warning. It's only changing the values of existing non-zero term, so isn't changing the sparsity of the matrix (at least not until you eliminate zeros):
In [111]: M = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [112]: M[:,[0,2]] *= 0
In [113]: M
Out[113]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [114]: M.eliminate_zeros()
In [115]: M
Out[115]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
Matrix multiplication is the way to go.
For my very large CSR matrix (size 2M*2M), direct assignment with sp_mat[:, zero_cols] = 0 results in out of memory error. Suppose the indices of zeroed columns are marked as True in the boolean array zero_mask, then multiplying a diagonal matrix can do the job efficiently (within 3 seconds).
import scipy.sparse as sp
sp_mat=sp_mat#sp.diags((~node_mask).astype(int))
Here (~node_mask).astype(int) gives an 1-d array of 0s and 1s that specifies which columns should be kept (1) and which should be zeroed (0).

Multiply slice of scipy sparse matrix without changing sparsity

In scipy, when I multiply a slice of a sparse matrix with an array containing only zeros, the result is a matrix that is less or equally sparse than before, even though it should be more or equally sparse. The same holds for setting parts of the matrix to 0 or False:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
This is actually computationally expensive for large matrices, because it iterates over all cells in the slice, not just the nonzero ones.
From a mathematical / logical point of view, when multiplying a sparse matrix or vector, all empty cells are certain to remain empty as 0*x == 0. The same holds for setting to zero: zero-cells do not need to be explicitely set to zero.
What is the best way to deal with this?
I am using scipy version 0.17.0
In working with sparse matrices, changing the sparsity pattern is generally a very expensive operation, and so scipy does not do this silently.
If you want to remove explicitly stored zeros from a sparse matrix, you should use the eliminate_zeros() method; for example:
>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99657 stored elements in Compressed Sparse Row format>
Scipy could call the eliminate_zeros routine automatically after doing this kind of operation, but the developers chose to give the user more flexibility and control when doing something as expensive as changing the sparsity structure.
To recreate your code (using int type for a more compact display):
In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]:
array([[0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data # show nonzero values by row
Out[19]:
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
Setting a row (or column) explicitly. Note the efficiency warning:
In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]:
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
list([1, 1]), list([]), list([1, 1]), list([1, 1]),
list([1, 1, 1, 1, 1, 1])], dtype=object)
So yes it has set all values in the row to the specified value. And it doesn't attempt to distinguish between setting 0s as opposed to 1s. You can see the code used in M.__setitem__ and M._set_many (this is where the efficiency warning is generated).
As #jakevpd shows, you need to explicitly tell it to eliminate excess 0s. It does not attempt to do that during the assignment.
In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]:
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
In general setting values of a matrix explicitly is discouraged, especially with csr. It's not even allowed with coo. lil is the recommended format if you need to do that.
In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]:
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
lil does take care to eliminate 0s.
Multiplication of a row by an array of 0s does not change the sparsity. Nor does it act in-place. It produces a new matrix:
In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0., 0., 0., 0., 0., 0., 0., 0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)
Multiplication with a sparse matrix does eliminate 0s (again, not in-place):
In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
(Notice also the different in format between Out[29] and Out[32].)
As a general rule, multiplication, both element-wise and matrix, is the most efficient operation with csr matrices, especially if the other is also sparse. In fact row/column sums are performed with matrix multiplication, as is advanced indexing.

Delete rows in scipy matrix from list

I have a list of integers called cluster0Rand corresponding to certain index's in a scipy sparse matrix called data.
I want to create a new scipy matrix consisting of only the row's which's index is in the list?
For example,
data = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
cluster0Rand = [0,1]
The desired output would be:
csr_matrix([[1, 2, 0], [0, 0, 3]])
How can I do this efficently given that the real list is made up of thousands of indexs and the scipy matrix is (10000, 100000)
Given your example, plain indexing does the job:
In [300]: data = sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
In [301]: idx = [0,1]
In [302]: data[idx,:]
Out[302]:
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>
In [303]: _.A
Out[303]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)
This kind of indexing is slower with sparse matrices than dense arrays. But it uses a sparse matrix strength, matrix multiplication. It turns the idx into a selector matrix.
In [313]: (sparse.csr_matrix([[1,0,0],[0,1,0]])*data).A
Out[313]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)

Categories

Resources