def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=mat.zeros((d,d))
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
Q[:,i]=sc.divide(A[:,i],V[0,i])
return Q
Input A is an adjacency matrix of sparse format.I am getting the above as error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in get_column_normalized_matrix
ValueError: setting an array element with a sequence.
The problem you have is that you are trying to assign a sparse matrix into a dense one. This is not done automatically. It is rather simple to fix, though, by turning the sparse matrix into a dense one, using .todense():
import scipy.sparse as sp
import numpy.matlib as mat
import scipy as sc
def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=mat.zeros((d,d))
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
# Explicitly turn the sparse matrix into a dense one:
Q[:,i]=sc.divide(A[:,i],V[0,i]).todense()
return Q
If you instead want the output to be sparse, then you have to ensure that your output matrix Q is sparse to begin with. That can be achieved as follows:
def get_column_normalized_matrix(A):
d=sp.csr_matrix.get_shape(A)[0]
Q=sp.csr_matrix(A) # Create sparse output matrix
V=mat.zeros((1,d))
sp.csr_matrix.sum(A,axis=0,dtype='int',out=V)
for i in range(0,d):
if V[0,i]!=0:
# Update sparse matrix
Q[:,i]=sc.divide(A[:,i],V[0,i])
return Q
As can be seen, Q is created as a copy of A. This makes the same element being non zero in both matrices, which ensures efficient updating, since no new elements will be added.
Related
I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.
I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:
distances = sp.spatial.distance.cdist(source_matrix, target_matrix)
And I end up getting the following partial exception traceback:
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.
Any advice?
I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.
Take two random vectors for example
a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output
Or even if a is a matrix and b is a vector:
a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
[ 3.33862248],
[ 3.45803465],
[ 3.15453179],
...
Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.
Hope that helps
I'm trying to obtain all the values in a matrix beta VxK to the power of all the values in a column Vx1 that is part of a dense matrix VxN. So each value in beta should be to the power of the corresponding line in the column and this should be done for all K columns in beta. When I use np.power on python for a practice numpy array for beta using:
np.power(head_beta.T, head_matrix[:,0])
I am able to obtain the results I want. The dimensions are (3, 10) for beta and (10,) for head_matrix[:,0] where in this case 3=K and 10=V.
However, if I do this on my actual matrix, which was obtained by using
matrix=csc_matrix((data,(row,col)), shape=(30784,72407) ).todense()
where data, row, and col are arrays, I am unable to do the same operation:
np.power(beta.T, matrix[:,0])
where the dimensions are (10, 30784) for beta and (30784, 1) for matrix where in this case 10=K and 30784=V. I get the following error
ValueError Traceback (most recent call last)
<ipython-input-29-9f55d4cb9c63> in <module>()
----> 1 np.power(beta.T, matrix[:,0])
ValueError: operands could not be broadcast together with shapes (10,30784) (30784,1) `
It seems that the difference is that matrix is a matrix (length,1) and head_matrix is actually a numpy array (length,) that I created. How can I do this same operation with the column of a dense matrix?
In the problem case it can't broadcast (10,30784) and (30784,1). As you note it works when (10,N) is used with (N,). That's because it can expand the (N,) to (1,N) and on to (10,N).
M = sparse.csr_matrix(...).todense()
is np.matrix which is always 2d, so M(:,0) is (N,1). There are several solutons.
np.power(beta.T, M[:,0].T) # change to a (1,N)
np.power(beta, M[:,0]) # line up the expandable dimensions
convert the sparse matrix to an array:
A = sparse.....toarray()
np.power(beta.T, A[:,0])
M[:,0].squeeze() and M[:,0].ravel() both produce a (1,N) matrix. So does M[:,0].reshape(-1). That 2d quality is persistent, as long as it returns a matrix.
M[:,0].A1 produces a (N,) array
From a while back: Numpy matrix to array
You can use the squeeze method on arrays to get rid of this extra dimension.
So
np.power(beta.T, matrix[:,0].squeeze()) should do the trick.
I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format.
Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0 (<1e-15), but not 0.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
PCA(X) is SVD(X-mean(X)).
Even If X is a sparse matrix, X-mean(X) is always a dense matrix.
Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix.
However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X).
The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism :
https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size :
https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent :
https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312
I have one problem while try to computing the 1-norm of a sparse matrix. I am using the function scipy.sparse.linalg.onenormest but it gives me an error because the operator can act only onto square matrix.
Here a code example:
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
onenormest(A)
this is the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\linalg\_onenormest.py", line 76, in onenormest
raise ValueError('expected the operator to act like a square matrix')
ValueError: expected the operator to act like a square matrix
The operator onenormest works if I define A as a square matrix, but this is not what I want.
Anyone knows how to calculate the 1-norm of a sparse non-square matrix?
I think that you want numpy.linalg.norm instead;
from numpy import linalg
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print linalg.norm(A.todense(), ord=1) #15
It does not work to call A.data, since .data of a sparse matrix object is just the data - it appears as a vector instead.
If your sparse matrix is only small, then this is fine. If it is large, then obviously this is a problem. In which case, you can write your own routine.
If you are only interested in the L^1-norm, and casting to dense is not possible, then you could do it via something like this:
def sparseL1Norm = lambda A: max([numpy.abs(A).getcol(i).sum() for i in range(A.shape[1])])
This finds the L1-norm of each column:
from scipy import sparse
import numpy as np
row = np.array([0,2,2,0,1,2])
col = np.array([0,0,1,2,2,2])
data = np.array([1,2,3,-4,-5,-6]) # made negative to exercise abs
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print(abs(A).sum(axis=0))
yields
[[ 3 3 15]]
You could then take the max to find the L1-norm of the matrix:
print(abs(A).sum(axis=0).max())
# 15
abs(A) is a sparse matrix:
In [29]: abs(A)
Out[29]:
<5x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>
and sum and max are methods of the sparse matrix, so abs(A).sum(axis=0).max() computes the L1-norm without densifying the matrix.
Note: Most NumPy functions (such a np.abs) are not designed to work with sparse matrices. Although np.abs(A) returns the correct result, it arrives there through an indirect route. The more direct route is to use abs(A) which calls A.__abs__(). Thanks to pv. for point this out.
I am trying to compute nearest neighbour clustering on a Scipy sparse matrix returned from scikit-learn's DictVectorizer. However, when I try to compute the distance matrix with scikit-learn I get an error message using 'euclidean' distance through both pairwise.euclidean_distances and pairwise.pairwise_distances. I was under the impression that scikit-learn could calculate these distance matrices.
My matrix is highly sparse with a shape of: <364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>.
I have also tried methods such as pdist and kdtree in Scipy but have received other errors of not being able to process the result.
Can anyone please point me to a solution that would effectively allow me calculate the distance matrix and/or the nearest neighbour result?
Some example code:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Similarly, if I run:
scipy.spatial.distance.pdist(X,'euclidean')
I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Finally, running NearestNeighbor in scikit-learn results in a memory error using:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
First, you can't use KDTree and pdist with sparse matrix, you have to convert it to dense (your choice whether it's your option):
>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Second, from the docs:
Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.
You might want to try 'ball_tree' algorithm and see if it can handle your data.
From your comment:
Since it is a sparse matrix, I would expect there to be solutions to intelligently calculate the distances and store the result in a similarly sparse matrix.
Basic math shows that this is only possible in the case that your input matrix contains a massive number of duplicates, because Euclidean distance is only zero for two exactly equal points (this is actually one of the axioms of distance). So if you remove duplicates this might work.
Otherwise, depending on your problem, you might be able to use sklearn.metrics.pairwise_distances_argmin_min or cosine similarity, X * X.T, which has the reverse ordering compared to Euclidean distance.