Python: L1-norm of a sparse non-square matrix - python

I have one problem while try to computing the 1-norm of a sparse matrix. I am using the function scipy.sparse.linalg.onenormest but it gives me an error because the operator can act only onto square matrix.
Here a code example:
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
onenormest(A)
this is the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\linalg\_onenormest.py", line 76, in onenormest
raise ValueError('expected the operator to act like a square matrix')
ValueError: expected the operator to act like a square matrix
The operator onenormest works if I define A as a square matrix, but this is not what I want.
Anyone knows how to calculate the 1-norm of a sparse non-square matrix?

I think that you want numpy.linalg.norm instead;
from numpy import linalg
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print linalg.norm(A.todense(), ord=1) #15
It does not work to call A.data, since .data of a sparse matrix object is just the data - it appears as a vector instead.
If your sparse matrix is only small, then this is fine. If it is large, then obviously this is a problem. In which case, you can write your own routine.
If you are only interested in the L^1-norm, and casting to dense is not possible, then you could do it via something like this:
def sparseL1Norm = lambda A: max([numpy.abs(A).getcol(i).sum() for i in range(A.shape[1])])

This finds the L1-norm of each column:
from scipy import sparse
import numpy as np
row = np.array([0,2,2,0,1,2])
col = np.array([0,0,1,2,2,2])
data = np.array([1,2,3,-4,-5,-6]) # made negative to exercise abs
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print(abs(A).sum(axis=0))
yields
[[ 3 3 15]]
You could then take the max to find the L1-norm of the matrix:
print(abs(A).sum(axis=0).max())
# 15
abs(A) is a sparse matrix:
In [29]: abs(A)
Out[29]:
<5x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>
and sum and max are methods of the sparse matrix, so abs(A).sum(axis=0).max() computes the L1-norm without densifying the matrix.
Note: Most NumPy functions (such a np.abs) are not designed to work with sparse matrices. Although np.abs(A) returns the correct result, it arrives there through an indirect route. The more direct route is to use abs(A) which calls A.__abs__(). Thanks to pv. for point this out.

Related

Converting from Numpy.zeros(100,100) to using a Scipy.sparse.lil_matrix(100,100) Error

I am creating a finite volume solver, and had success using numpy.zero to create a zero matrix and used a for loop to fill specific locations of the matrix with values that I wish to calculate.
However, I need to use a larger matrix specifically numpy.zeros(102400,102400) but I get the error "Array too Big" I can do a numpy.zeros(10000,10000) matrix but that seems like the limit of my system (6 GB Ram).
I was told changing the matrix into a sparse matrix would free space for my code, and all me to do the calculations. However my code that initially was created to fill a zero matrix can not be used on this sparse matrix, and I don't know why.
import numpy as np
import scipy as sp
from scipy import sparse
matA = sp.sparse.lil_matrix(m, m)
matb = sp.sparse.lil_matrix(m, 1)
i = 0
for row in range(Lrow):
for column in range(Lcol):
if row == 0 and column == 0:
matA[i, i + 1] = -k * (delY / delX)
matA[i, i + Lcol] = -k * (delX / delY)
matA[i, i] = -(3 * matA[i, i + 1] + matA[i, i + Lcol])
edit: my m = 100000 and i get iterated each time at the end of the if statement by i = i + 1
You are initializing your sparse matrices incorrectly. Take a look a the documentation for lil_matrix. Given that m is your shape parameter you actually want to initialize the matrix as follows (note that the first argument is a tuple):
matA = scipy.sparse.lil_matrix((m, m))
matA
<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in LInked List format>
The way you are doing it you end up with a 1x1 matrix, which I assume is not your intent:
matA = scipy.sparse.lil_matrix(m, m)
matA
<1x1 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in LInked List format>
The reason for this is that the first argument for lil_matrix is looking for an array, or array-like, another sparse matrix, or a tuple. When you enter lil_matrix(m,m) it essentially ignores the second argument, as the first is interpreted as an array-like, and just initializes a 1x1 array with the value set to m.

Value Error:Setting an array element with sequence

def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=mat.zeros((d,d))
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
Q[:,i]=sc.divide(A[:,i],V[0,i])
return Q
Input A is an adjacency matrix of sparse format.I am getting the above as error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in get_column_normalized_matrix
ValueError: setting an array element with a sequence.
The problem you have is that you are trying to assign a sparse matrix into a dense one. This is not done automatically. It is rather simple to fix, though, by turning the sparse matrix into a dense one, using .todense():
import scipy.sparse as sp
import numpy.matlib as mat
import scipy as sc
def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=mat.zeros((d,d))
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
# Explicitly turn the sparse matrix into a dense one:
Q[:,i]=sc.divide(A[:,i],V[0,i]).todense()
return Q
If you instead want the output to be sparse, then you have to ensure that your output matrix Q is sparse to begin with. That can be achieved as follows:
def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=sp.csr_matrix(A) # Create sparse output matrix
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
# Update sparse matrix
Q[:,i]=sc.divide(A[:,i],V[0,i])
return Q
As can be seen, Q is created as a copy of A. This makes the same element being non zero in both matrices, which ensures efficient updating, since no new elements will be added.

scipy cdist with sparse matrices

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.
I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:
distances = sp.spatial.distance.cdist(source_matrix, target_matrix)
And I end up getting the following partial exception traceback:
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.
Any advice?
I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.
Take two random vectors for example
a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output
Or even if a is a matrix and b is a vector:
a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
[ 3.33862248],
[ 3.45803465],
[ 3.15453179],
...
Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.
Hope that helps

Performing PCA on large sparse matrix by using sklearn

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format.
Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0 (<1e-15), but not 0.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
PCA(X) is SVD(X-mean(X)).
Even If X is a sparse matrix, X-mean(X) is always a dense matrix.
Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix.
However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X).
The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism :
https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size :
https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent :
https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

Scipy Sparse - distance matrix (Scikit or Scipy)

I am trying to compute nearest neighbour clustering on a Scipy sparse matrix returned from scikit-learn's DictVectorizer. However, when I try to compute the distance matrix with scikit-learn I get an error message using 'euclidean' distance through both pairwise.euclidean_distances and pairwise.pairwise_distances. I was under the impression that scikit-learn could calculate these distance matrices.
My matrix is highly sparse with a shape of: <364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>.
I have also tried methods such as pdist and kdtree in Scipy but have received other errors of not being able to process the result.
Can anyone please point me to a solution that would effectively allow me calculate the distance matrix and/or the nearest neighbour result?
Some example code:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Similarly, if I run:
scipy.spatial.distance.pdist(X,'euclidean')
I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Finally, running NearestNeighbor in scikit-learn results in a memory error using:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
First, you can't use KDTree and pdist with sparse matrix, you have to convert it to dense (your choice whether it's your option):
>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Second, from the docs:
Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.
You might want to try 'ball_tree' algorithm and see if it can handle your data.
From your comment:
Since it is a sparse matrix, I would expect there to be solutions to intelligently calculate the distances and store the result in a similarly sparse matrix.
Basic math shows that this is only possible in the case that your input matrix contains a massive number of duplicates, because Euclidean distance is only zero for two exactly equal points (this is actually one of the axioms of distance). So if you remove duplicates this might work.
Otherwise, depending on your problem, you might be able to use sklearn.metrics.pairwise_distances_argmin_min or cosine similarity, X * X.T, which has the reverse ordering compared to Euclidean distance.

Categories

Resources