How can I create a sparse matrix from coordinate information? - python

First of all, I want to summarize how I arrived at this particular problem. I wanted to create a song recommender using collaborative filtering method. But the problem is that I have a very large dataset at hand, 1m rows x 2.2m columns. If my understanding is correct, I needed to create a sparse matrix in order to move forward with my idea, since I do not know of anything that can hold a matrix with the size of 1m x 2.2m.* Hence, sparse matrix.
Now, since this matrix will only contain 1s or 0s in the cells, I've somehow mapped out which cells should have 1 if I were to create a hypothetical monstrous matrix. The information I have looks like this;
rows
locations
row1
[56110, 78999, 1508886, 2090010]
row2
[1123, 976554]
...
...
row1000000
[334555, 2200100]
The problem is that I don't know how to create a sparse matrix using this information. I've checked many sources but couldn't find any viable solution. If you could help me, I would very much appreciate it. Also, if you have any notes on collaborative filtering methods that utilize sparse matrices I would also be very grateful.

There are several ways you could do this. Here is one that creates a csr_matrix, since the data that you show is close to this format. (That docstring has a terse explanation of the csr_matrix attributes data, indices and indptr.) Whether or not this is the best method (for some definition of "best") depends on the actual "raw" form of your data (among other things).
I assume you can put the data that you show in the locations column into a list of lists, called locations. It is important that there is an entry in locations for each row, even if the list is empty. I also assume that the values given in locations are 0-based indices that correspond to the column of the matrix. Here's an example, for an array that has shape (5, 8).
In [23]: locations = [[2, 3], [], [1, 3, 5], [0, 1, 7], [7]]
To form indptr, we compute the cumulative sum of the lengths of the lists, and prepend a 0:
In [28]: lengths = np.array([len(t) for t in locations])
In [29]: lengths
Out[29]: array([2, 0, 3, 3, 1])
In [30]: indptr = np.concatenate(([0], lengths.cumsum()))
In [31]: indptr
Out[31]: array([0, 2, 2, 5, 8, 9])
indices is just the flattened version of locations. Note that sum() in the following is the Python builtin sum() function, not np.sum. That function call concatenates all the lists in locations.
In [32]: indices = sum(locations, start=[])
In [33]: indices
Out[33]: [2, 3, 1, 3, 5, 0, 1, 7, 7]
The data for the array is an array of 1s that is the same length as indices:
In [38]: data = np.ones_like(indices)
We now have all the pieces we need to create a SciPy csr_matrix:
In [39]: from scipy.sparse import csr_matrix
In [40]: A = csr_matrix((data, indices, indptr))
In [41]: A
Out[41]:
<5x8 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
In [42]: A.toarray()
Out[42]:
array([[0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1]])

Related

Fast computation of matrix rank over GF(2)

For my current project I need to be able to calculate the rank of 64*64 matrices with entries from GF(2). I was wondering if anyone had a good solution.
I've been using pyfinite for this, but it is rather slow since it's a pure python implementation. I've also tried to cythonise the code I've been using, but have had issues due to relying on pyfinite.
My next idea would be to write my own class in cython, but that seems a bit overkill for what I need.
I need the following functionality
matrix = GF2Matrix(size=64) # creating a 64*64 matrix
matrix.setRow(i, [1,0,1....,1]) # set row using list
matrix += matrix2 # addition of matrices
rank(matrix) # then computing the rank
Thanks for any ideas.
One way to efficiently represent a matrix over GF(2) is to store the rows as integers, interpreting each integer as a bit-string. So for example, the 4-by-4 matrix
[0 1 1 0]
[1 0 1 1]
[0 0 1 0]
[1 0 0 1]
(which has rank 3) could be represented as a list [6, 13, 4, 9] of integers. Here I'm thinking of the first column as corresponding to the least significant bit of the integer, and the last to the most significant bit, but the reverse convention would also work.
With this representation, row operations can be performed efficiently using Python's bitwise integer operations: ^ for addition, & for multiplication.
Then you can compute the rank using a standard Gaussian elimination approach.
Here's some reasonably efficient code. Given a collection rows of nonnegative integers representing a matrix as above, we repeatedly remove the last row in the list, and then use that row to eliminate all 1 entries from the column corresponding to its least significant bit. If the row is zero then it has no least significant bit and doesn't contribute to the rank, so we simply discard it and move on.
def gf2_rank(rows):
"""
Find rank of a matrix over GF2.
The rows of the matrix are given as nonnegative integers, thought
of as bit-strings.
This function modifies the input list. Use gf2_rank(rows.copy())
instead of gf2_rank(rows) to avoid modifying rows.
"""
rank = 0
while rows:
pivot_row = rows.pop()
if pivot_row:
rank += 1
lsb = pivot_row & -pivot_row
for index, row in enumerate(rows):
if row & lsb:
rows[index] = row ^ pivot_row
return rank
Let's run some timings for random 64-by-64 matrices over GF2. random_matrices is a function to create a collection of random 64-by-64 matrices:
import random
def random_matrix():
return [random.getrandbits(64) for row in range(64)]
def random_matrices(count):
return [random_matrix() for _ in range(count)]
and here's the timing code:
import timeit
count = 1000
number = 10
timer = timeit.Timer(
setup="ms = random_matrices({})".format(count),
stmt="[gf2_rank(m.copy()) for m in ms]",
globals=globals())
print(min(timer.repeat(number=number)) / count / number)
The result printed on my machine (2.7 GHz Intel Core i7, macOS 10.14.5, Python 3.7) is 0.0001984686384, so that's a touch under 200µs for a single rank computation.
200µs is quite respectable for a pure Python rank computation, but in case this isn't fast enough, we can follow your suggestion to use Cython. Here's a Cython function that takes a 1d NumPy array of dtype np.uint64, again thinking of each element of the array as a row of your 64-by-64 matrix over GF2, and returns the rank of that matrix.
# cython: language_level=3, boundscheck=False
from libc.stdint cimport uint64_t, int64_t
def gf2_rank(uint64_t[:] rows):
"""
Find rank of a matrix over GF2.
The matrix can have no more than 64 columns, and is represented
as a 1d NumPy array of dtype `np.uint64`. As before, each integer
in the array is thought of as a bit-string to give a row of the
matrix over GF2.
This function modifies the input array.
"""
cdef size_t i, j, nrows, rank
cdef uint64_t pivot_row, row, lsb
nrows = rows.shape[0]
rank = 0
for i in range(nrows):
pivot_row = rows[i]
if pivot_row:
rank += 1
lsb = pivot_row & -pivot_row
for j in range(i + 1, nrows):
row = rows[j]
if row & lsb:
rows[j] = row ^ pivot_row
return rank
Running equivalent timings for 64-by-64 matrices, now represented as NumPy arrays of dtype np.uint64 and shape (64,), I get an average rank-computation time of 7.56µs, over 25 times faster than the pure Python version.
I wrote a Python package galois that extends NumPy arrays over Galois fields. Linear algebra on Galois field matrices is one of the intended use cases. It is written in Python but JIT compiled using Numba for speed. It is quite fast and most linear algebra routines are also compiled. (One exception, as of 08/11/2021 the row reduction routine hasn't been JIT compiled, but that could be added.)
Here is an example using the galois library to do what you are describing.
Create a GF(2) array class and create an explicit array and a random array.
In [1]: import numpy as np
In [2]: import galois
In [3]: GF = galois.GF(2)
In [4]: A = GF([[0, 0, 1, 0], [0, 1, 1, 1], [1, 0, 1, 0], [1, 0, 1, 0]]); A
Out[4]:
GF([[0, 0, 1, 0],
[0, 1, 1, 1],
[1, 0, 1, 0],
[1, 0, 1, 0]], order=2)
In [5]: B = GF.Random((4,4)); B
Out[5]:
GF([[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 0]], order=2)
You can update an entire row (as you requested) like this.
In [6]: B[0,:] = [1,0,0,0]; B
Out[6]:
GF([[1, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 0]], order=2)
Matrix arithmetic works with normal binary operators. Here is matrix addition and matrix multiplication.
In [7]: A + B
Out[7]:
GF([[1, 0, 1, 0],
[1, 0, 0, 1],
[0, 1, 1, 0],
[1, 0, 0, 0]], order=2)
In [8]: A # B
Out[8]:
GF([[1, 1, 0, 0],
[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]], order=2)
There is an added method to the NumPy arrays called row_reduce() which performs Gaussian elimination on the matrix. You can also call the standard NumPy linear algebra functions on a Galois field array and get the correct result.
In [9]: A.row_reduce()
Out[9]:
GF([[1, 0, 0, 0],
[0, 1, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0]], order=2)
In [10]: np.linalg.matrix_rank(A)
Out[10]: 3
Hope this helps! If there is additional functionality desired, please open an issue on GitHub.

efficient way to iterate through coo_matrix elements ordered by column?

I have a scipy.sparse.coo_matrix matrix which I want to convert to bitsets per column for further calculation. (for the purpose of the example, I'm testing on 100Kx1M).
I'm currently doing something like this:
bitsets = [ intbitset() for _ in range(matrix.shape[1]) ]
for i,j in itertools.izip(matrix.row, matrix.col):
bitsets[j].add(i)
That works, but COO matrix iterates the values by row. Ideally, I'd like to iterate by columns and then just build the bitset at once instead of adding to a different bitset every time.
Couldn't find a way to iterate the matrix column-based. Is there?
I don't mind converting to other sparse formats, but couldn't find a way to efficiently iterate the matrix there. (using nonzero() on CSC matrix has been proven to be extremely not efficient...)
Thanks!
Make a small sparse matrix:
In [82]: M = sparse.random(5,5,.2, 'coo')*2
In [83]: M
Out[83]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [84]: print(M)
(1, 3) 0.03079661961875302
(0, 2) 0.722023291734881
(0, 3) 0.547594065264775
(1, 0) 1.1021150713641839
(1, 2) 0.585848976928308
That print, as well as the nonzero return the row and col arrays:
In [85]: M.nonzero()
Out[85]: (array([1, 0, 0, 1, 1], dtype=int32), array([3, 2, 3, 0, 2], dtype=int32))
Conversion to csr orders the rows (but not necessarily the columns). nonzero converts back to coo and returns the row and col, with the new order.
In [86]: M.tocsr().nonzero()
Out[86]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
I was going to say conversion to csc orders the columns, but it doesn't look like that:
In [87]: M.tocsc().nonzero()
Out[87]: (array([0, 0, 1, 1, 1], dtype=int32), array([2, 3, 0, 2, 3], dtype=int32))
Transpose of csr produces a csc:
In [88]: M.tocsr().T.nonzero()
Out[88]: (array([0, 2, 2, 3, 3], dtype=int32), array([1, 0, 1, 0, 1], dtype=int32))
I don't fully follow what you are trying to do, or why you want a column sort, but the lil format might help:
In [90]: M.tolil().rows
Out[90]:
array([list([2, 3]), list([0, 2, 3]), list([]), list([]), list([])],
dtype=object)
In [91]: M.tolil().T.rows
Out[91]:
array([list([1]), list([]), list([0, 1]), list([0, 1]), list([])],
dtype=object)
In general iteration on sparse matrices is slow. Matrix multiplication in the csr and csc formats is the fastest operation. And many other operations make use of that indirectly (e.g. row sum). Another relatively fast set of operations are ones that can work directly with the data attribute, without paying attention to row or column values.
coo doesn't implement indexing or iteration. csr and lil implement those.

Delete rows in scipy matrix from list

I have a list of integers called cluster0Rand corresponding to certain index's in a scipy sparse matrix called data.
I want to create a new scipy matrix consisting of only the row's which's index is in the list?
For example,
data = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
cluster0Rand = [0,1]
The desired output would be:
csr_matrix([[1, 2, 0], [0, 0, 3]])
How can I do this efficently given that the real list is made up of thousands of indexs and the scipy matrix is (10000, 100000)
Given your example, plain indexing does the job:
In [300]: data = sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
In [301]: idx = [0,1]
In [302]: data[idx,:]
Out[302]:
<2x3 sparse matrix of type '<class 'numpy.int32'>'
with 3 stored elements in Compressed Sparse Row format>
In [303]: _.A
Out[303]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)
This kind of indexing is slower with sparse matrices than dense arrays. But it uses a sparse matrix strength, matrix multiplication. It turns the idx into a selector matrix.
In [313]: (sparse.csr_matrix([[1,0,0],[0,1,0]])*data).A
Out[313]:
array([[1, 2, 0],
[0, 0, 3]], dtype=int32)

Transfer matrix elements to another matrix's diagonal

I want to do something similar to here (in Python):
How to convert a column or row matrix to a diagonal matrix in Python?
that is :
1) set all elements of matrix A onto the diagonal of matrix B (all other elements of B should be 0) and 2) after performing some operation on B, I want to recreate matrix A, so take the elements off B's diagonal , in the same order as was performed in the first step, and put them back in A.
Can you not do just unravel your matrix onto the diagonal of another?
In [29]: import numpy as np
In [30]: a = np.array([[1,2],[3,4]])
In [31]: b = np.diag(a.ravel())
In [32]: b
Out[32]:
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
Then, to go back:
In [33]: b.diagonal().reshape((2,2))
Out[33]:
array([[1, 2],
[3, 4]])

Adding sparse matrices with explicit column and row order and different shapes

Let's say I have two sparse matrices, scipy.sparse.csr_matrix to be precise, that I would like to add element-wise, with the added problem that they have an ID to each row and column, corresponding to a word.
For instance, one matrix might have columns and rows that correspond to ['cat', 'hat'], in that particular order. Another matrix could then have columns and rows that correspond to ['cat', 'mat', 'hat']. This means that when adding these matrices, I need to take into account the following things:
The matrices might have corresponding columns and rows in different orders.
The matrices might not be of the same shape.
Some columns and rows in one matrix might be not be present in the other.
I have trouble coming up with a solution to this merging problem, and would hope that you could help me come up with an answer.
For added clarity, here's an example:
import scipy.sparse as sp
mat1_id2column = ['cat', 'hat']
mat1_id2row = ['cat', 'hat']
mat2_id2column = ['cat', 'mat', 'hat']
mat2_id2row = ['cat', 'mat', 'hat']
mat1 = sp.csr_matrix([[1, 0], [0, 1]])
mat2 = sp.csr_matrix([[1, 0, 1], [1, 0, 0], [0, 0, 1]])
merge(mat1, mat2)
#Expected output:
id2column = ['cat', 'hat', 'mat']
id2row = ['cat', 'hat', 'mat']
merged = sp.csr_matrix([[2, 1, 0], [0, 1, 0], [1, 0, 0]])
Any help is much appreciated!
Start by building a row id index that will be the union of the row ids of the 2 matrices. Then do the same for columns. Using this, you can now translate from coordinates in the old matrices to coordinates in the new result matrix.
Do you see how to finish it from there or should I be more explicit?
In one way of other you have to work out a unique mapping from your strings and row/column indexes.
A start using dictionaries is:
from collections import defaultdict
def foo(dd,mat):
for ij,v in mat[0].todok().iteritems():
dd[(mat[1][ij[0]],mat[2][ij[1]])] += v
dd=defaultdict(int)
foo(dd,(mat1,mat1_id2row,mat1_id2column))
foo(dd,(mat2,mat2_id2row,mat2_id2column))
print dd
produces
defaultdict(<type 'int'>, {('cat', 'hat'): 1,
('hat', 'hat'): 2,
('mat', 'cat'): 1,
('cat', 'cat'): 2})
dd could then be turned back into a dok
A different approach would take advantage of the way coo_matrix handles duplicates - they are added together when it is converted to a csr.
In this example take ['cat', 'mat', 'hat'] as the master index.
The 2 defining arrays for mat2 are then
data: array([1, 1, 1, 1])
row : array([0, 0, 1, 2])
col : array([0, 2, 0, 2])
for mat1 they would be (I haven't worked out the code to do this yet)
data: array([1, 1])
row : array([0, 2])
col : array([0, 2])
concatenate the respective arrays, and create a new coo matrix merged
data: array([1, 1, 1, 1, 1, 1])
row : array([0, 0, 1, 2, 0, 2])
col : array([0, 2, 0, 2, 0, 2])
merged.A would be
array([[2, 0, 1],
[1, 0, 0],
[0, 0, 2]])
another option is to use matrix multiplication to map the arrays on to larger ones that can be added. Again I'm leaving the details of how to generate the mapping unspecified. You'd have to generate a separate T for each different sequence of words. That may require the same amount of iterative work as the other approaches.
T1 = sp.csr_matrix(np.array([[1,0],[0,0],[0,1]]))
T2 = T1.T # same mapping for row and cols of mat1
T1.dot(mat1).dot(T2) + mat2

Categories

Resources