I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
Couldn't find a way to iterate the matrix column-based. Is there?
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)
Thanks!
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
That print, as well as the nonzero return the row and col arrays:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
I was going to say conversion to csc orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.
coo doesn't implement indexing or iteration. csr and lil implement those.
Related
First of all, I want to summarize how I arrived at this particular problem. I wanted to create a song recommender using collaborative filtering method. But the problem is that I have a very large dataset at hand, 1m rows x 2.2m columns. If my understanding is correct, I needed to create a sparse matrix in order to move forward with my idea, since I do not know of anything that can hold a matrix with the size of 1m x 2.2m.* Hence, sparse matrix.
Now, since this matrix will only contain 1s or 0s in the cells, I've somehow mapped out which cells should have 1 if I were to create a hypothetical monstrous matrix. The information I have looks like this;
rows
locations
row1
[56110, 78999, 1508886, 2090010]
row2
[1123, 976554]
...
...
row1000000
[334555, 2200100]
The problem is that I don't know how to create a sparse matrix using this information. I've checked many sources but couldn't find any viable solution. If you could help me, I would very much appreciate it. Also, if you have any notes on collaborative filtering methods that utilize sparse matrices I would also be very grateful.
There are several ways you could do this. Here is one that creates a csr_matrix, since the data that you show is close to this format. (That docstring has a terse explanation of the csr_matrix attributes data, indices and indptr.) Whether or not this is the best method (for some definition of "best") depends on the actual "raw" form of your data (among other things).
I assume you can put the data that you show in the locations column into a list of lists, called locations. It is important that there is an entry in locations for each row, even if the list is empty. I also assume that the values given in locations are 0-based indices that correspond to the column of the matrix. Here's an example, for an array that has shape (5, 8).
In [23]: locations = [[2, 3], [], [1, 3, 5], [0, 1, 7], [7]]
To form indptr, we compute the cumulative sum of the lengths of the lists, and prepend a 0:
In [28]: lengths = np.array([len(t) for t in locations])
In [29]: lengths
Out[29]: array([2, 0, 3, 3, 1])
In [30]: indptr = np.concatenate(([0], lengths.cumsum()))
In [31]: indptr
Out[31]: array([0, 2, 2, 5, 8, 9])
indices is just the flattened version of locations. Note that sum() in the following is the Python builtin sum() function, not np.sum. That function call concatenates all the lists in locations.
In [32]: indices = sum(locations, start=[])
In [33]: indices
Out[33]: [2, 3, 1, 3, 5, 0, 1, 7, 7]
The data for the array is an array of 1s that is the same length as indices:
In [38]: data = np.ones_like(indices)
We now have all the pieces we need to create a SciPy csr_matrix:
In [39]: from scipy.sparse import csr_matrix
In [40]: A = csr_matrix((data, indices, indptr))
In [41]: A
Out[41]:
<5x8 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
In [42]: A.toarray()
Out[42]:
array([[0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1]])
Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.
I am writing code to remove multiple columns from several large, parallel scipy sparse.csc matrices (meaning all matrices have the same dim, and all nnz elements are in the same places) simultaneously and efficiently. I am doing this by indexing to only the columns I want to keep for one matrix and then reusing the indices and indptr lists for the others. However, when I index the csc matrix by a list, it reorders the data list, so I cannot reuse the indices. Is there a way to force scipy to keep the data list in the original order? Why is it reordering only when indexing by a list?
import scipy.sparse
import numpy as np
mat = scipy.sparse.csc_matrix(np.array([[1,0,0,0,2,5],
[1,0,1,0,0,0],
[0,0,0,4,0,1],
[0,3,0,1,0,4]]))
print mat[:,3].data
returns array([4, 1])
print mat[:,[3]].data
returns array([1, 4])
In [43]: mat = sparse.csc_matrix(np.array([[1,0,0,0,2,5],[1,0,1,0,0,0],[0,0,0,4,
...: 0,1],[0,3,0,1,0,4]]))
...:
...:
In [44]: mat
Out[44]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Column format>
In [45]: mat.data
Out[45]: array([1, 1, 3, 1, 4, 1, 2, 5, 1, 4], dtype=int64)
In [46]: mat.indices
Out[46]: array([0, 1, 3, 1, 2, 3, 0, 0, 2, 3], dtype=int32)
In [47]: mat.indptr
Out[47]: array([ 0, 2, 3, 4, 6, 7, 10], dtype=int32)
scalar selection:
In [48]: m1 = mat[:,3]
In [49]: m1
Out[49]:
<4x1 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Column format>
In [50]: m1.data
Out[50]: array([4, 1])
In [51]: m1.indices
Out[51]: array([2, 3], dtype=int32)
In [52]: m1.indptr
Out[52]: array([0, 2], dtype=int32)
list indexing:
In [53]: m2 = mat[:,[3]]
In [54]: m2.data
Out[54]: array([1, 4], dtype=int64)
In [55]: m2.indices
Out[55]: array([3, 2], dtype=int32)
In [56]: m2.indptr
Out[56]: array([0, 2], dtype=int32)
sorting:
In [57]: m2.sort_indices()
In [58]: m2.data
Out[58]: array([4, 1], dtype=int64)
In [59]: m2.indices
Out[59]: array([2, 3], dtype=int32)
csc indexing with a list uses matrix multiplication. It constructs an extractor matrix based on the index, and then does the dot multiply. So it's a brand new sparse matrix; not just a subset of the csc data and index attributes.
csc matrices have a method to ensure the indicies values are ordered (within a column). Applying that might help to ensure the arrays are sorted in the same way.
In scipy, when I multiply a slice of a sparse matrix with an array containing only zeros, the result is a matrix that is less or equally sparse than before, even though it should be more or equally sparse. The same holds for setting parts of the matrix to 0 or False:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix as csr
>>> M = csr(np.random.random((8,8))>0.9)
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 6 stored elements in Compressed Sparse Row format>
>>> M[:,0] = False
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
>>> M[:,0].multiply(np.array([[False] for i in xrange(8)]))
>>> M
<8x8 sparse matrix of type '<type 'numpy.bool_'>'
with 12 stored elements in Compressed Sparse Row format>
This is actually computationally expensive for large matrices, because it iterates over all cells in the slice, not just the nonzero ones.
From a mathematical / logical point of view, when multiplying a sparse matrix or vector, all empty cells are certain to remain empty as 0*x == 0. The same holds for setting to zero: zero-cells do not need to be explicitely set to zero.
What is the best way to deal with this?
I am using scipy version 0.17.0
In working with sparse matrices, changing the sparsity pattern is generally a very expensive operation, and so scipy does not do this silently.
If you want to remove explicitly stored zeros from a sparse matrix, you should use the eliminate_zeros() method; for example:
>>> M = csr(np.random.random((1000,1000))>0.9, dtype=float)
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M[:, 0] *= 0
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99740 stored elements in Compressed Sparse Row format>
>>> M.eliminate_zeros()
>>> M
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 99657 stored elements in Compressed Sparse Row format>
Scipy could call the eliminate_zeros routine automatically after doing this kind of operation, but the developers chose to give the user more flexibility and control when doing something as expensive as changing the sparsity structure.
To recreate your code (using int type for a more compact display):
In [16]: M = sparse.csr_matrix(np.random.random((8,8))>.7).astype(int)
In [17]: M
Out[17]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [18]: M.A
Out[18]:
array([[0, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 1, 1, 1, 1, 0, 1]])
In [19]: M.tolil().data # show nonzero values by row
Out[19]:
array([list([1, 1]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
Setting a row (or column) explicitly. Note the efficiency warning:
In [20]: M[0,:] = 0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [21]: M.tolil().data
Out[21]:
array([list([0, 0, 0, 0, 0, 0, 0, 0]), list([1, 1]), list([1]),
list([1, 1]), list([]), list([1, 1]), list([1, 1]),
list([1, 1, 1, 1, 1, 1])], dtype=object)
So yes it has set all values in the row to the specified value. And it doesn't attempt to distinguish between setting 0s as opposed to 1s. You can see the code used in M.__setitem__ and M._set_many (this is where the efficiency warning is generated).
As #jakevpd shows, you need to explicitly tell it to eliminate excess 0s. It does not attempt to do that during the assignment.
In [22]: M.eliminate_zeros()
In [23]: M.tolil().data
Out[23]:
array([list([]), list([1, 1]), list([1]), list([1, 1]), list([]),
list([1, 1]), list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
In general setting values of a matrix explicitly is discouraged, especially with csr. It's not even allowed with coo. lil is the recommended format if you need to do that.
In [24]: Ml = M.tolil()
In [25]: Ml[1,:] = 0
In [26]: Ml.data
Out[26]:
array([list([]), list([]), list([1]), list([1, 1]), list([]), list([1, 1]),
list([1, 1]), list([1, 1, 1, 1, 1, 1])], dtype=object)
lil does take care to eliminate 0s.
Multiplication of a row by an array of 0s does not change the sparsity. Nor does it act in-place. It produces a new matrix:
In [29]: M[1,:].multiply(np.zeros((1,8)))
Out[29]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
In [30]: _.A
Out[30]: array([[ 0., 0., 0., 0., 0., 0., 0., 0.]])
In [31]: M[1,:].A
Out[31]: array([[1, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)
Multiplication with a sparse matrix does eliminate 0s (again, not in-place):
In [32]: M[1,:].multiply(sparse.csr_matrix(np.zeros((1,8))))
Out[32]:
<1x8 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
(Notice also the different in format between Out[29] and Out[32].)
As a general rule, multiplication, both element-wise and matrix, is the most efficient operation with csr matrices, especially if the other is also sparse. In fact row/column sums are performed with matrix multiplication, as is advanced indexing.
I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices.
In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations.
for u, j in zip(*ratings.nonzero()):
items_rated_by_u = ratings[u, :].nonzero()[1]
users_who_rated_j = ratings[:, j].nonzero()[0]
# More code...
Extra:
Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days.
When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code).
I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose:
In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]:
array([[0, 1, 0],
[1, 0, 0],
[0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)
You could also access these values in the csr format, but it's a bit more complicated
In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)