slice a csr matrix by a list of indexes - python - python

I 'm struggling to understand behaviour of slicing sparse matrix
I have this csr matrix say M
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(0, 24104) 3
(0, 28061) 2
Now I extracted index (column) and i want to slice it.
From that matrix I want a matrix
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(0, 24104) 3
and
(0, 28061) 2
Now if i do
M[0, training_set_index]
where training_set_index=[4136,5553,9089, 24104], I get
(0, 3) 3
(0, 2) 1
(0, 1) 1
(0, 0) 1
I just want to have a copy of the original matrix ( preserving indexes) with only the indexes specified in training_set_index list. Is it possible? what's wrong?
Thanks

When I hear sparsity matrix - the first thing that comes to my mind is a lot of zeros:)
One of approach is convert sparse matrix to numpy array -> do some fancy slicing -> turn back to the sparse matrix :
# create sparse matrix
training_set_index = [5553,24104] # for example I need this index
row = np.array([0, 0, 0, 0, 0])
col = np.array([4136, 5553, 9089, 24104, 28061])
data = np.array([1, 1, 1, 3, 2])
dim_0 = (1,28062)
S = csr_matrix((data, (row, col)), shape=dim_0)
print(S)
#(0, 4136) 1
#(0, 5553) 1
#(0, 9089) 1
#(0, 24104) 3
#(0, 28061) 2
# convert to numpy array
M = S.toarray()
# create np.zeros arrays and fill them with based on training_set_index
x = np.zeros((28061, ),dtype=int)
y = np.zeros((28061, ),dtype=int)
np.add.at(x, training_set_index, M[0,training_set_index])
np.add.at(y, training_set_index, M[0,28061])
# new sparse matrix
S_training = csr_matrix(x)
print(S_training)
#(0, 5553) 1
#(0, 24104) 3
Have a nice slicing!

Related

Accessing column in sparse CSR matrix

Having some issues with accessing the last column in the sparse CSR matrix. Ideally, I would like to convert the last column into some sort of array that can be used as my label set. My CSR matrix looks like this:
(0, 1976) 1
(0, 2916) 1
(0, 3871) 1
(0, 4437) 1
(0, 8202) 1
(0, 9458) 1
(0, 10597) 1
(1, 4801) 1
(1, 6903) 1
(1, 7525) 1
(2, 873) 1
(2, 1017) 1
(2, 1740) 1
(2, 1925) 1
(3, 1976) 1
(3, 5606) 1
(3, 6898) 1
I want to access the last column, which contains all the '1'. Is there a way in which I can do this?
CSR matrix has indicies and indptr properties, see below example which converts matrix to list using these properties:
def sparse_to_string_list(matrix: csr_matrix):
res = []
indptr = matrix.indptr
indices = matrix.indices
for row in range(matrix.shape[0]):
arr = [k for k in indices[indptr[row]: indptr[row + 1]]]
arr.sort()
res.append(' '.join([str(k) for k in arr]))
return res

Get ting the index of the non zero connections in scipy csr graph

Say I have a scipy csr graph. When I call the print routine on the graph object it returns the index and the value.
G_dense = np.array([[0, 2, 1],
[2, 0, 0],
[1, 0, 0]])
G_masked = np.ma.masked_values(G_dense, 0)
from scipy.sparse import csr_matrix
G_sparse = csr_matrix(G_dense)
print(G_sparse)
# Return
# (0, 1) 2
# (0, 2) 1
# (1, 0) 2
# (2, 0) 1
If I use print(G_sparse.data) it shows the [2 1 2 1](only the weights).
I wanted to know, if I can retrieve the indices
i.e (0,1), (0,2), (1,0), (2,0)
using some function.
Thank you in advance!

SciPy: Symmetric permutation of sparse CSR matrix

I'd like to symmetrically permute a sparse matrix, permuting rows and columns in the same way. For example, I would like to rotate the rows and columns, which takes:
1 2 3
0 1 0
0 0 1
to
1 0 0
0 1 0
2 3 1
In Octave or MATLAB, one can do this concisely with matrix indexing:
A = sparse([1 2 3; 0 1 0; 0 0 1]);
perm = [2 3 1];
Aperm = A(perm,perm);
I am interested in doing this in Python, with NumPy/SciPy. Here is an attempt:
#!/usr/bin/env python
import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 0, 1, 2])
col = np.array([0, 1, 2, 1, 2])
data = np.array([1, 2, 3, 1, 1])
A = csr_matrix((data, (row, col)), shape=(3, 3))
p = np.array([1, 2, 0])
#Aperm = A[p,p] # gives [1,1,1], the permuted diagonal
Aperm = A[:,p][p,:] # works, but more verbose
Is there a cleaner way to accomplish this sort of symmetric permutation of a matrix?
(I'm more interested in concise syntax than I am in performance)
In MATLAB
A(perm,perm)
is a block operation. In numpy A[perm,perm] selects elements on the diagonal.
A[perm[:,None], perm]
is the block indexing. The MATLAB diagonal requires something like sub2ind. What's concise in one is more verbose in the other, and v.v.
Actually numpy is using the same logic in both cases. It 'broadcasts' one index against the other, A (n,) against (n,) in the diagonal case, and (n,1) against (1,n) in the block case. The results are (n,) and (n,n) shaped.
This numpy indexing works with sparse matrices as well, though it isn't as fast. It actually uses matrix multiplication to do this sort of indexing - with an 'extractor' matrix based on the indices (maybe 2, M*A*M.T).
MATLAB's documentation about a permutation matrix:
https://www.mathworks.com/help/matlab/math/sparse-matrix-operations.html#f6-13070

Iterate over sparse matrix and concatenate data and indicies for each row

I have a scenario where I have a dataframe and vocabulary file which I am trying to fit to the dataframe string columns. I am using scikit learn countVectorizer which produces a sparse matrix. I need to take the output of the sparse matrix and merge it with the dataframe for corresponding row in dataframe.
code:-
from sklearn.feature_extraction.text import CountVectorizer
docs = ["You can catch more flies with honey than you can with vinegar.",
"You can lead a horse to water, but you can't make him drink.",
"search not cleaning up on hard delete",
"updating firmware version failed",
"increase not service topology s memory",
"Nothing Matching Here"
]
vocabulary = ["catch more","lead a horse", "increase service", "updating" , "search", "vinegar", "drink", "failed", "not"]
vectorizer = CountVectorizer(analyzer=u'word', vocabulary=vocabulary,lowercase=True,ngram_range=(0,19))
SpraseMatrix = vectorizer.fit_transform(docs)
Below is sparse matrix output -
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
Now, What I am looking to do is build a string for each row from sparse matrix and add it to the corresponding document.
Ex:- for doc 3 ("Updating firmware version failed") , I am looking to get "3:1 7:1 " from sparse matrix (i.e updating & failed column index and their frequency) and add this to doc's data frame's row 3.
I tried below , and it produces flatten output where as I am looking to get the submatrix based on the row index, loop through it and build a concated string for each row such as "3:1 7:1" , and finally then add this string as a new column to data frame for each corresponding row.
cx = SpraseMatrix .tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
print((i,j,v))
(0, 0, 1)
(0, 5, 1)
(1, 6, 1)
(2, 4, 1)
(2, 8, 1)
(3, 3, 1)
(3, 7, 1)
(4, 8, 1)
I'm not entirely following what you want, but maybe the lil format will be easier to work with:
In [1122]: M = sparse.coo_matrix(([1,1,1,1,1,1,1,1],([0,0,1,2,2,3,3,4],[0,5,6,4,
...: 8,3,7,8])))
In [1123]: M
Out[1123]:
<5x9 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>
In [1124]: print(M)
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
In [1125]: Ml = M.tolil()
In [1126]: Ml.data
Out[1126]: array([list([1, 1]), list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)
In [1127]: Ml.rows
Out[1127]: array([list([0, 5]), list([6]), list([4, 8]), list([3, 7]), list([8])], dtype=object)
It's attributes are organized by row, which appears to be how you want it.
In [1130]: Ml.rows[3]
Out[1130]: [3, 7]
In [1135]: for i,(rd) in enumerate(zip(Ml.rows, Ml.data)):
...: print(' '.join(['%s:%s'%ij for ij in zip(*rd)]))
...:
0:1 5:1
6:1
4:1 8:1
3:1 7:1
8:1
You can also iterate through the rows of the csr format, but that requires a bit more math with the .indptr attribute.

importing a python sparse matrix into MATLAB

I've a Sparse matrix in CSR Sparse format in python and I want to import it to MATLAB. MATLAB does not have a CSR Sparse format. It has only 1 Sparse format for all kind of matrices. Since the matrix is very large in the dense format I was wondering how could I import it as a MATLAB sparse matrix?
The scipy.io.savemat saves sparse matrices in a MATLAB compatible format:
In [1]: from scipy.io import savemat, loadmat
In [2]: from scipy import sparse
In [3]: M = sparse.csr_matrix(np.arange(12).reshape(3,4))
In [4]: savemat('temp', {'M':M})
In [8]: x=loadmat('temp.mat')
In [9]: x
Out[9]:
{'M': <3x4 sparse matrix of type '<type 'numpy.int32'>'
with 11 stored elements in Compressed Sparse Column format>,
'__globals__': [],
'__header__': 'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Sep 8 09:34:54 2014',
'__version__': '1.0'}
In [10]: x['M'].A
Out[10]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Note that savemat converted it to csc. It also transparently takes care of the index starting point difference.
And in Octave:
octave:4> load temp.mat
octave:5> M
M =
Compressed Column Sparse (rows = 3, cols = 4, nnz = 11 [92%])
(2, 1) -> 4
(3, 1) -> 8
(1, 2) -> 1
(2, 2) -> 5
...
octave:8> full(M)
ans =
0 1 2 3
4 5 6 7
8 9 10 11
The Matlab and Scipy sparse matrix formats are compatible. You need to get the data, indices and matrix size of the matrix in Scipy and use them to create a sparse matrix in Matlab. Here's an example:
from scipy.sparse import csr_matrix
from scipy import array
# create a sparse matrix
row = array([0,0,1,2,2,2])
col = array([0,2,2,0,1,2])
data = array([1,2,3,4,5,6])
mat = csr_matrix( (data,(row,col)), shape=(3,4) )
# get the data, shape and indices
(m,n) = mat.shape
s = mat.data
i = mat.tocoo().row
j = mat.indices
# display the matrix
print mat
Which prints out:
(0, 0) 1
(0, 2) 2
(1, 2) 3
(2, 0) 4
(2, 1) 5
(2, 2) 6
Use the values m, n, s, i, and j from Python to create a matrix in Matlab:
m = 3;
n = 4;
s = [1, 2, 3, 4, 5, 6];
% Index from 1 in Matlab.
i = [0, 0, 1, 2, 2, 2] + 1;
j = [0, 2, 2, 0, 1, 2] + 1;
S = sparse(i, j, s, m, n, m*n)
Which gives the same Matrix, only indexed from 1.
(1,1) 1
(3,1) 4
(3,2) 5
(1,3) 2
(2,3) 3
(3,3) 6

Categories

Resources