scipy cdist with sparse matrices - python

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.
I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:
distances = sp.spatial.distance.cdist(source_matrix, target_matrix)
And I end up getting the following partial exception traceback:
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.
Any advice?

I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.
Take two random vectors for example
a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output
Or even if a is a matrix and b is a vector:
a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
[ 3.33862248],
[ 3.45803465],
[ 3.15453179],
...
Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.
Hope that helps

Related

Using a Sparse Matrix with sklearn Affinity Propagation

I am having problems with using a scipy COO sparse matrix as an input for Affinity propagation, but it works perfectly fine with a numpy array.
Just an example, say my similarity matrix is:
[[1.0, 0.9, 0.2]
[0.9, 1.0, 0.0]
[0.2, 0.0, 1.0]]
Numpy matrix version
import numpy as np
import sklearn.cluster
simnp = np.array([[1,0.9,0.2],[0.9,1,0],[0.2,0,1]])
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed")
affprop.fit(simnp)
works as expected.
Sparse Matrix version
import scipy.sparse as sps
import sklearn.cluster
simsps = sps.coo_matrix(([1,1,1,0.9,0.9,0.2,0.2],([0,1,2,0,1,0,2],[0,1,2,1,0,2,0])),(3,3))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed")
affprop.fit(simsps)
returns the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python\Python27\lib\site-packages\sklearn\cluster\affinity_propagation_.py", line 301, in fit
copy=self.copy, verbose=self.verbose, return_n_iter=True)
File "C:\Python\Python27\lib\site-packages\sklearn\cluster\affinity_propagation_.py", line 90, in affinity_propagation
preference = np.median(S)
File "C:\Python\Python27\lib\site-packages\numpy\lib\function_base.py", line 3084, in median
overwrite_input=overwrite_input)
File "C:\Python\Python27\lib\site-packages\numpy\lib\function_base.py", line 2997, in _ureduce
r = func(a, **kwargs)
File "C:\Python\Python27\lib\site-packages\numpy\lib\function_base.py", line 3158, in _median
return mean(part[indexer], axis=axis, out=out)
File "C:\Python\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2878, in mean
out=out, keepdims=keepdims)
File "C:\Python\Python27\lib\site-packages\numpy\core\_methods.py", line 70, in _mean
ret = ret.dtype.type(ret / rcount)
ValueError: setting an array element with a sequence.
My laptop does not have enough RAM to take a dense matrix thus wanting to use a sparse matrix.
What am I doing wrong?
Thanks.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html
fit(X, y=None)
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples)
predict(X)
Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
fit(X, y=None)
Parameters:
X : array-like or sparse matrix, shape (n_samples, n_features)
So some of the methods do accept a sparse matrix. But the AffinityPropagation.fit does not make that claim. Is that a documentation omission, or an indication that it does not work with a sparse matrix? Your error indicates the latter - for one reason or other, it has not been adapted to work with sparse.
I'm not a user of scikit-learn, but have answered a few questions about sparse matrices in that package. My impression is the handling sparse is relatively new, and that in some cases they have to use todense() to turn the sparse ones back into dense matrices.
Like I wrote in my comment, numpy code, by itself, does not handle sparse matrices correctly. It only works if it delegates the action to sparse methods. It appears that np.median and np.mean do not properly delegate to sparse.coo_matrix.mean.
Try:
np.median(simnp)
np.mean(simnp)
simnp.mean()
An update on current status of sklearn (June 2019) could be useful.
Already at the time of original question there was a fix of an issue reporting that AffinityPropagation was not working with sparse matrices. Recently (May 2019) it was reported again that AffinityPropagation was not working with sparse matrices.
The summary actually is:
the fit works with sparse matrix only if affinity is not precomputed but Euclidean (since it calls sklearn.metrics.euclidean_distances which works with sparse matrices). This actually gives no advantage in term of memory consumption.
if affinity is precomputed instead the fit will not work with sparse matrix. the current blocking line of code seems to be a computation of median.

Performing PCA on large sparse matrix by using sklearn

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format.
Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0 (<1e-15), but not 0.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
PCA(X) is SVD(X-mean(X)).
Even If X is a sparse matrix, X-mean(X) is always a dense matrix.
Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix.
However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X).
The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism :
https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size :
https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent :
https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

Python: L1-norm of a sparse non-square matrix

I have one problem while try to computing the 1-norm of a sparse matrix. I am using the function scipy.sparse.linalg.onenormest but it gives me an error because the operator can act only onto square matrix.
Here a code example:
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
onenormest(A)
this is the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\linalg\_onenormest.py", line 76, in onenormest
raise ValueError('expected the operator to act like a square matrix')
ValueError: expected the operator to act like a square matrix
The operator onenormest works if I define A as a square matrix, but this is not what I want.
Anyone knows how to calculate the 1-norm of a sparse non-square matrix?
I think that you want numpy.linalg.norm instead;
from numpy import linalg
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print linalg.norm(A.todense(), ord=1) #15
It does not work to call A.data, since .data of a sparse matrix object is just the data - it appears as a vector instead.
If your sparse matrix is only small, then this is fine. If it is large, then obviously this is a problem. In which case, you can write your own routine.
If you are only interested in the L^1-norm, and casting to dense is not possible, then you could do it via something like this:
def sparseL1Norm = lambda A: max([numpy.abs(A).getcol(i).sum() for i in range(A.shape[1])])
This finds the L1-norm of each column:
from scipy import sparse
import numpy as np
row = np.array([0,2,2,0,1,2])
col = np.array([0,0,1,2,2,2])
data = np.array([1,2,3,-4,-5,-6]) # made negative to exercise abs
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print(abs(A).sum(axis=0))
yields
[[ 3 3 15]]
You could then take the max to find the L1-norm of the matrix:
print(abs(A).sum(axis=0).max())
# 15
abs(A) is a sparse matrix:
In [29]: abs(A)
Out[29]:
<5x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>
and sum and max are methods of the sparse matrix, so abs(A).sum(axis=0).max() computes the L1-norm without densifying the matrix.
Note: Most NumPy functions (such a np.abs) are not designed to work with sparse matrices. Although np.abs(A) returns the correct result, it arrives there through an indirect route. The more direct route is to use abs(A) which calls A.__abs__(). Thanks to pv. for point this out.

Scipy Sparse - distance matrix (Scikit or Scipy)

I am trying to compute nearest neighbour clustering on a Scipy sparse matrix returned from scikit-learn's DictVectorizer. However, when I try to compute the distance matrix with scikit-learn I get an error message using 'euclidean' distance through both pairwise.euclidean_distances and pairwise.pairwise_distances. I was under the impression that scikit-learn could calculate these distance matrices.
My matrix is highly sparse with a shape of: <364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>.
I have also tried methods such as pdist and kdtree in Scipy but have received other errors of not being able to process the result.
Can anyone please point me to a solution that would effectively allow me calculate the distance matrix and/or the nearest neighbour result?
Some example code:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Similarly, if I run:
scipy.spatial.distance.pdist(X,'euclidean')
I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Finally, running NearestNeighbor in scikit-learn results in a memory error using:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
First, you can't use KDTree and pdist with sparse matrix, you have to convert it to dense (your choice whether it's your option):
>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Second, from the docs:
Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.
You might want to try 'ball_tree' algorithm and see if it can handle your data.
From your comment:
Since it is a sparse matrix, I would expect there to be solutions to intelligently calculate the distances and store the result in a similarly sparse matrix.
Basic math shows that this is only possible in the case that your input matrix contains a massive number of duplicates, because Euclidean distance is only zero for two exactly equal points (this is actually one of the axioms of distance). So if you remove duplicates this might work.
Otherwise, depending on your problem, you might be able to use sklearn.metrics.pairwise_distances_argmin_min or cosine similarity, X * X.T, which has the reverse ordering compared to Euclidean distance.

Python - sparse vectors/distance calculation

I'm looking for dynamically growing vectors in Python, since I don't know their length in advance. In addition, I would like to calculate distances between these sparse vectors, preferably using the distance functions in scipy.spatial.distance (although any other suggestions are welcome). Any ideas how to do this? (Initially, it doesn't need to be efficient.)
Thanks a lot in advance!
You can use regular python lists (which are dynamic) as vectors. Trivial example follows.
from scipy.spatial.distance import sqeuclidean
a = [1,2,3]
b = [0,0,0]
print sqeuclidean(a,b) # 14
As per aganders3's suggestion, do note that you can also use numpy arrays if needed:
import numpy
a = numpy.array([1,2,3])
If the sparse part of your question is crucial I'd use scipy for that - it has support for sparse matrixes. You can define a 1xn matrix and use it as a vector. This works (the parameter is the size of the matrix, filled with zeroes by default):
sqeuclidean(scipy.sparse.coo_matrix((1,3)),scipy.sparse.coo_matrix((1,3))) # 0
There are many kinds of sparse matrixes, some dictionary based (see comment). You can define a row sparse matrix from a list like this:
scipy.sparse.csr_matrix([1,2,3])
Here is how you can do it in numpy:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([0, 0, 0])
c = np.sum(((a - b) ** 2)) # 14

Categories

Resources