SciPy/NumPy: Normalize a csr_matrix - python

I'm trying to normalize a csr_matrix:
<5400x6845 sparse matrix of type '<type 'numpy.float64'> with 91833 stored elements in Compressed Sparse Row format>
What I tried was this:
import numpy as np
from scipy import sparse
# ve is my csr_matrix
ve_sum = ve.sum(axis=1)
ve_sums = sparse.csr_matrix(np.tile(ve_sum, (1, ve.shape[1]))) # <-- here I get MemoryError
n_ve = ve/ve_sums
This is obviously not the correct way of doing this kind of easy normalization.
What is the correct way?

# Normalize the rows of ve.
row_sums = np.array(ve.sum(axis=1))[:,0]
row_indices, col_indices = ve.nonzero()
ve.data /= row_sums[row_indices]
A quick google search reveals this also.

Related

How to concatenate more sparse matrices into one in python

I have a problem in python where i would like to merge some sparse matrices into one. The sparse matrices are of csr_matrix type and have same amount of rows. When I use hstack to stack them together I obtain an array of matrices, but I would like to obtain a single matrix with the number of rows (which is the same for every matrix) and as the number of columns the sum of the columns number of every matrix.
Thanks for support.
You can do this using scipy.sparse.hstack. For example:
import numpy as np
from scipy import sparse
x = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
y = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
xy = sparse.hstack([x, y])
print(xy.shape)
# (10, 20)
print(type(xy))
# <class 'scipy.sparse.coo.coo_matrix'>

Sparse matrix of zeros

I have this code :
M=np.zeros((N,N),dtype=complex)
M=sparse.bsr_matrix(M)
M[0][0]=complex(1,1)
print(M)
I am trying to create an NxN sparse matrix of zeros that I can then add numbers into. Could someone please tell me why it is giving me an error? Thanks!
Since bsr_matrix represents a block sparse matrix, you can't change its elements by index. Perhaps you were looking for something like csr_matrix? With that, paying attention to the warning it produces, you can do what you are trying to do:
In [192]: M = sparse.csr_matrix((N, N), dtype=np.complex)
In [193]: M[0, 0] = np.complex(1, 1)
C:\Users\<user>\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py:742: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [194]: print(M)
(0, 0) (1+1j)
Try this:
M = np.zeros((N,N),dtype=complex)
M[0][0] = complex(1,1)
M = sparse.bsr_matrix(M)
print(M)
# (0, 0) (1+1j)
You cannot set the values of a sparse matrix directly, but you can set the values of a numpy array and then convert it to a sparse matrix.

pairwise distance fails on a sparse matrix with an uninformative error message

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from scipy.spatial import distance
X = CountVectorizer().fit_transform(docs)
X = TfidfTransformer(use_idf=False).fit_transform(X)
print (X.shape) #prints (100, 1760)
However when I try to calculate the pair-distance I get this error:
distance.pdist(X, metric='cosine')
ValueError: A 2-dimensional array must be passed.
The shape indicates that X is a 2-dimensional array, what could be the issue ?
=====Update July 6th 2017======
This is a bug in scipy, sklearn has the correct implementation for sparse matrices.
I have proposed a code-change to the scipy repository here:
https://github.com/scipy/scipy/pull/7566
=====Update Feb 23rd 2018======
If you got here, you probably encountered that issue as well.
It's been more than 8 months since a one-line fix I proposed was pushed to the scipy repository.
Please comment here or here, to get some attention from the scipy maintainers.
pdist starts with:
def pdist(X, metric='euclidean', p=2, w=None, V=None, VI=None):
....
X = np.asarray(X, order='c')
# The C code doesn't do striding.
X = _copy_array_if_base_present(X)
s = X.shape
if len(s) != 2:
raise ValueError('A 2-dimensional array must be passed.')
But if I make a scipy.sparse matrix, and apply asarray I don't get a 2d array:
In [258]: from scipy import sparse
In [259]: M = sparse.random(100,100, format='csr')
In [260]: M
Out[260]:
<100x100 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
In [263]: np.asarray(M)
Out[263]:
array(<100x100 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>, dtype=object)
In [264]: _.shape
Out[264]: ()
pdist is not designed to accept a sparse matrix. A sparse matrix is not a subclass of ndarray. You have to make it dense first.
In [266]: np.asarray(M.toarray()).shape
Out[266]: (100, 100)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html

Python: L1-norm of a sparse non-square matrix

I have one problem while try to computing the 1-norm of a sparse matrix. I am using the function scipy.sparse.linalg.onenormest but it gives me an error because the operator can act only onto square matrix.
Here a code example:
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
onenormest(A)
this is the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\linalg\_onenormest.py", line 76, in onenormest
raise ValueError('expected the operator to act like a square matrix')
ValueError: expected the operator to act like a square matrix
The operator onenormest works if I define A as a square matrix, but this is not what I want.
Anyone knows how to calculate the 1-norm of a sparse non-square matrix?
I think that you want numpy.linalg.norm instead;
from numpy import linalg
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print linalg.norm(A.todense(), ord=1) #15
It does not work to call A.data, since .data of a sparse matrix object is just the data - it appears as a vector instead.
If your sparse matrix is only small, then this is fine. If it is large, then obviously this is a problem. In which case, you can write your own routine.
If you are only interested in the L^1-norm, and casting to dense is not possible, then you could do it via something like this:
def sparseL1Norm = lambda A: max([numpy.abs(A).getcol(i).sum() for i in range(A.shape[1])])
This finds the L1-norm of each column:
from scipy import sparse
import numpy as np
row = np.array([0,2,2,0,1,2])
col = np.array([0,0,1,2,2,2])
data = np.array([1,2,3,-4,-5,-6]) # made negative to exercise abs
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print(abs(A).sum(axis=0))
yields
[[ 3 3 15]]
You could then take the max to find the L1-norm of the matrix:
print(abs(A).sum(axis=0).max())
# 15
abs(A) is a sparse matrix:
In [29]: abs(A)
Out[29]:
<5x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>
and sum and max are methods of the sparse matrix, so abs(A).sum(axis=0).max() computes the L1-norm without densifying the matrix.
Note: Most NumPy functions (such a np.abs) are not designed to work with sparse matrices. Although np.abs(A) returns the correct result, it arrives there through an indirect route. The more direct route is to use abs(A) which calls A.__abs__(). Thanks to pv. for point this out.

Calculate Similarity of Sparse Matrix

I am using Python with numpy, scipy and scikit-learn module.
I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000)
The values in the matrix are equal to 0 or 1. The only thing I have is the index of value = 1.
a = [1,3,5,7,9]
b = [2,4,6,8,10]
which means
a = [0,1,0,1,0,1,0,1,0,1,0]
b = [0,0,1,0,1,0,1,0,1,0,1]
How can I change the index array to the sparse array in scipy ?
How can I classify those array quickly ?
Thank you very much.
If you choose the sparse coo_matrix you can create it passing the indices like:
from scipy.sparse import coo_matrix
import scipy
nrows = 100000
ncols = 100000
row = scipy.array([1,3,5,7,9])
col = scipy.array([2,4,6,8,10])
values = scipy.ones(col.size)
m = coo_matrix((values, (row,col)), shape=(nrows, ncols), dtype=float)

Categories

Resources