I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices.
In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations.
for u, j in zip(*ratings.nonzero()):
items_rated_by_u = ratings[u, :].nonzero()[1]
users_who_rated_j = ratings[:, j].nonzero()[0]
# More code...
Extra:
Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days.
When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code).
I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose:
In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]:
array([[0, 1, 0],
[1, 0, 0],
[0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)
You could also access these values in the csr format, but it's a bit more complicated
In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)
Related
First of all, I want to summarize how I arrived at this particular problem. I wanted to create a song recommender using collaborative filtering method. But the problem is that I have a very large dataset at hand, 1m rows x 2.2m columns. If my understanding is correct, I needed to create a sparse matrix in order to move forward with my idea, since I do not know of anything that can hold a matrix with the size of 1m x 2.2m.* Hence, sparse matrix.
Now, since this matrix will only contain 1s or 0s in the cells, I've somehow mapped out which cells should have 1 if I were to create a hypothetical monstrous matrix. The information I have looks like this;
rows
locations
row1
[56110, 78999, 1508886, 2090010]
row2
[1123, 976554]
...
...
row1000000
[334555, 2200100]
The problem is that I don't know how to create a sparse matrix using this information. I've checked many sources but couldn't find any viable solution. If you could help me, I would very much appreciate it. Also, if you have any notes on collaborative filtering methods that utilize sparse matrices I would also be very grateful.
There are several ways you could do this. Here is one that creates a csr_matrix, since the data that you show is close to this format. (That docstring has a terse explanation of the csr_matrix attributes data, indices and indptr.) Whether or not this is the best method (for some definition of "best") depends on the actual "raw" form of your data (among other things).
I assume you can put the data that you show in the locations column into a list of lists, called locations. It is important that there is an entry in locations for each row, even if the list is empty. I also assume that the values given in locations are 0-based indices that correspond to the column of the matrix. Here's an example, for an array that has shape (5, 8).
In [23]: locations = [[2, 3], [], [1, 3, 5], [0, 1, 7], [7]]
To form indptr, we compute the cumulative sum of the lengths of the lists, and prepend a 0:
In [28]: lengths = np.array([len(t) for t in locations])
In [29]: lengths
Out[29]: array([2, 0, 3, 3, 1])
In [30]: indptr = np.concatenate(([0], lengths.cumsum()))
In [31]: indptr
Out[31]: array([0, 2, 2, 5, 8, 9])
indices is just the flattened version of locations. Note that sum() in the following is the Python builtin sum() function, not np.sum. That function call concatenates all the lists in locations.
In [32]: indices = sum(locations, start=[])
In [33]: indices
Out[33]: [2, 3, 1, 3, 5, 0, 1, 7, 7]
The data for the array is an array of 1s that is the same length as indices:
In [38]: data = np.ones_like(indices)
We now have all the pieces we need to create a SciPy csr_matrix:
In [39]: from scipy.sparse import csr_matrix
In [40]: A = csr_matrix((data, indices, indptr))
In [41]: A
Out[41]:
<5x8 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
In [42]: A.toarray()
Out[42]:
array([[0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1]])
Code
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat.toarray())
Output
[[0 0 0]
[0 0 1]
[1 0 2]]
According to the documentation, this method removes the zero entries from the matrix. However, why are there still zeros?
From this website, I've gathered the following:
eliminate_zeros removes all zeros in your matrix from the sparsity pattern (ie. there is no value stored for that position, when before there was a vlaue stored, but it was 0).
I can still access those zero entries.
print(mat[0, 0])
The documentation should probably be more explicit. eliminate_zeros doesn't affect the logical contents of a sparse matrix at all.
eliminate_zeros changes the underlying representation of a sparse matrix without affecting its logical contents. It removes explicitly stored zeros from the data array backing the sparse matrix. It's used to reduce space consumption, and to prepare a sparse matrix for algorithms that assume there will be no explicitly stored zeros.
It does not remove logical zeros from the sparse matrix. That wouldn't be possible - you can't have a sparse matrix with a bunch of data-less holes in it. It's not like a masked array.
To complement the other answer, I'll show the underlying data storage of your sparse matrix.
In [147]: from scipy import sparse
In [148]: arr = np.array([[0,0,0], [0,0,1], [1,0,1]])
The coo format is easiest to understand
In [149]: M = sparse.coo_matrix(arr)
In [150]: M
Out[150]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
In [151]: print(M)
(1, 2) 1
(2, 0) 1
(2, 2) 1
The values are actually stored in 3 arrays:
In [152]: M.data,M.row,M.col
Out[152]:
(array([1, 1, 1]),
array([1, 2, 2], dtype=int32),
array([2, 0, 2], dtype=int32))
csr format changes the row/col arrays:
In [153]: Mr = M.tocsr()
In [154]: Mr.data, Mr.indices, Mr.indptr
Out[154]:
(array([1, 1, 1]),
array([2, 0, 2], dtype=int32),
array([0, 0, 1, 3], dtype=int32))
Now let's change one element of the data array:
In [155]: Mr.data[1] = 0
In [156]: Mr.data
Out[156]: array([1, 0, 1])
eliminate_zeros finds that 0, and removes it from the data structure:
In [157]: Mr.eliminate_zeros()
In [158]: Mr.data
Out[158]: array([1, 1])
In [159]: Mr.indices
Out[159]: array([2, 2], dtype=int32)
In [160]: Mr.A
Out[160]:
array([[0, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [161]: print(Mr) # show the coo style values
(1, 2) 1
(2, 2) 1
Changing the indices and indptr of a csr (changing the "the sparsity pattern") is more work than simply assigning 0 to the data. So the csr format lets you make a bunch of changes to data, and cleaning up afterwards.
Anyways, this eliminate_zeros is not something a beginning user is likely to need.
I'm aware that MATLAB has a function to store 2D matrices in array cells, but how can I do this on Python? I need to store 4X4 matrices in each column of a 1X5 array. Is this possible? Thanks
I think it's possible you can build array for each 4x4 matrix and create another matrix where you can reference 5 different 4x4 matrix.
a = np.array([[0,0,0,0],
[0,0,0,0],
[0,0,0,0],
[0,0,0,0]])
create 5 different array like this as per your requirement and then finally reference them all in one array.
mat = np.block([[a],[b],[c],[d],[e]])
try something similar and you will find your expected results. refer documentation of array and block form numpy.
While MATLAB matrices are close to numpy arrays, the equivalent of cell is ambiguous.
One option is a list, especially when the cell is (1,5) shape, the size 1 dimension just being an artifact of MATLAB's "everything is 2d".
Another is object dtype array. Like a list, the elements of such array are references to objects else where in memory. This is what scipy.io.loadmat uses when loading a MATLAB .mat file. But creating such an array can be tricky, especially if all the component arrays have the same shape.
Yet another option is to make a high-dimension array, e.g. (1,5,4,4). There was a time when MATLAB only allowed 2d, but numpy has allowed this from the beginning. Remember in numpy the leading dimensions are outermost (with the default 'C' ordering).
In [407]: alist = [np.ones((2,2),int) for _ in range(3)]
In [408]: alist
Out[408]:
[array([[1, 1],
[1, 1]]),
array([[1, 1],
[1, 1]]),
array([[1, 1],
[1, 1]])]
In [409]: arr = np.array(alist)
In [410]: arr
Out[410]:
array([[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]])
In [411]: arr.shape
Out[411]: (3, 2, 2)
In [412]: arr1 = np.empty(3, object)
In [413]: arr1
Out[413]: array([None, None, None], dtype=object)
In [414]: arr1[:] = alist
In [415]: arr1
Out[415]:
array([array([[1, 1],
[1, 1]]), array([[1, 1],
[1, 1]]),
array([[1, 1],
[1, 1]])], dtype=object)
I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
Couldn't find a way to iterate the matrix column-based. Is there?
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)
Thanks!
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
That print, as well as the nonzero return the row and col arrays:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
I was going to say conversion to csc orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.
coo doesn't implement indexing or iteration. csr and lil implement those.
I have a large file where each line has a pair of 8 character strings. Something like:
ab1234gh iu9240gh
on each line.
This file really represents a graph and each string is a node id. I would like to read in the file and directly make a scipy sparse adjacency matrix. I will then run PCA on this matrix using one of the many tools available in python
Is there a neat way to do this or do I need to first make a graph in RAM and then convert that into a sparse matrix? As the file is large I would like to avoid intermediate steps if possible.
Ultimately I will feed the sparse adjacency matrix into http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD .
I think this is a regular task in sklearn, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.
But just working from my knowledge of numpy and sparse, where what I'd do:
Make a sample 2d array - N rows, 2 columns with character values:
In [638]: A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')])
In [639]: A
Out[639]:
array([['a', 'b'],
['b', 'd'],
['a', 'd'],
['b', 'c'],
['d', 'e']],
dtype='<U1')
Use np.unique to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.
In [640]: k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
In [641]: k1
Out[641]:
array(['a', 'b', 'c', 'd', 'e'],
dtype='<U1')
In [642]: k2
Out[642]: array([0, 1, 7, 3, 9], dtype=int32)
In [643]: k3
Out[643]: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)
I can reshape that inverse array to identify the row and col for each entry in A.
In [644]: rows,cols=k3.reshape(A.shape).T
In [645]: rows
Out[645]: array([0, 1, 0, 1, 3], dtype=int32)
In [646]: cols
Out[646]: array([1, 3, 3, 2, 4], dtype=int32)
with those it is trivial to construct a sparse matrix that has 1 at each 'intersection`.
In [648]: M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
In [649]: M
Out[649]:
<4x5 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
In [650]: M.A
Out[650]:
array([[0, 1, 0, 1, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]])
the first row, a has values in the 2nd and 4th col, b and d. and so on.
============================
Originally I had:
In [648]: M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))
This is wrong. The data array should match rows and cols in shape. Here it didn't raise an error because k1 happens to have the same size. But with a different mix unique values could raise an error.
====================
This approach assumes the whole data base, A can be loaded into memory. unique probably requires similar memory usage. Initially a coo matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to csr or other format will make further copies.
I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a coo matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.