Python: computing pariwise distances causes memory error

Python: computing pariwise distances causes memory error - python

I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.
from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).
However, this causes a memory error.
Traceback (most recent call last):
File "/home/munichong/git/DomainClassification/NameSuggestion#Verisign/classification_DMOZ/main.py", line 101, in <module>
result_clustering = clf_clustering.getCVResult(shuffle)
File "/home/munichong/git/DomainClassification/NameSuggestion#Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
self.centroids_of_categories(X_train, y_train)
File "/home/munichong/git/DomainClassification/NameSuggestion#Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
File "/home/munichong/git/DomainClassification/NameSuggestion#Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
distance_dict, rho_dict = self.compute_distances_and_rhos(X)
File "/home/munichong/git/DomainClassification/NameSuggestion#Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
pairwise_distances = pdist(X, 'cosine')
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?

Doing matrix-based algorithms on large data sets is prohibitive.
The memory requirements are straightforward to estimate. Even with exploiting symmetry, many implementations will max out at about 65000 instances. But even 64 bit implementations and big machines will eventually give up. A 1000000x1000000 matrix with double precision and exploiting symmetry needs 4 TB of RAM.
Use better algorithms that don't need O(n^2) memory and runtime.

Related

What is rank list in tucker decomposition?

I am going to decompose a 4D tensor using tucker decomposition in python. I found a library, tensorly, to do this.
I only want to perform the decomposition on the first and second dimensions. To perform tucker decomposition on some modes (not all modes) using tensorly I have to use partial_tucker command. This is my code:
F = 256
D = 96
h = 5
w = 6
ranks = [89, 48]
modes = [0, 1]
tensor = tl.tensor((np.arange(F*D*h*w).reshape((F, D, h, w))).astype(np.float64))
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
This code works well, but when I am trying to change the rank list, for example:
ranks = [3,4]
I get an error as follows:
Traceback (most recent call last):
File "D:\PhD_Thessaloniki\Codes\LRF_Convolutional\tucker-decomposition.py", line 49, in <module>
core, factors = partial_tucker(tensor, modes=modes, rank=ranks)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\decomposition\_tucker.py", line 109, in partial_tucker
eigenvecs, _, _ = svd_fun(unfold(core_approximation, mode), n_eigenvecs=rank[index], random_state=random_state)
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\tensorly\backend\core.py", line 913, in partial_svd
S, V = scipy.sparse.linalg.eigsh(
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 1689, in eigsh
params.iterate()
File "C:\Users\Milad\Anaconda3\envs\tensorly\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py", line 571, in iterate
raise ArpackError(self.info, infodict=self.iterate_infodict)
scipy.sparse.linalg._eigen.arpack.arpack.ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.
I don't know if there is a constraint to define the rank in tucker decomposition or not, but when I am trying to perform decomposition on only one dimension, for example:
ranks = [3]
modes = [0]
or
ranks = [4]
modes = [1]
works well again.
I want to know:
Is this an algorithmic or code (tensorly) problem (constraint)?
What is this problem?
What rank lists are valid?
Thanks

Tucker relies on higher-order PCA. The error you are seeing is is in the sparse SVD, applied to an unfolding of the main tensor.
You can try a different SVD function (svd parameter in partial_tucker), you can see the available option using tensorly.tenalg.svd.SVD_FUNS.
You also might want to try a tensor with random elements, using tensorly.random.random_tensor or tensorly.randn.

euclidean distance calculation using Python and Dask

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
#
import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
#
cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
#
import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)
#
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
#dask.delayed
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
#dask.delayed
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
test.append(comp)
look = []
with ProgressBar():
for element in test:
look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.

You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).

I believe that the dask-image package has some dask-enabled distance algorithms.
https://github.com/dask/dask-image

trying to do ward clustering on n by n matrix in scipy

I have a similarity score between 0 and 1 from each entry to every other entry in an 100 by 100 matrix. So e.g. [0,0] would be 1, [0,1] might be .54 etc. The similarity score was generated using Shannon Jensen divergence.
I want to use ward clustering (but am open to other suggestions) to cluster these together and I tried the following code:
x = np.array(mylist)
print x.shape#(100,100)
clustering = scipy.cluster.hierarchy.ward(x)
scipy.cluster.hierarchy.dendrogram(clustering)
but I am getting the error:
Traceback (most recent call last):
File "C:/Python27/ward.py", line 38, in <module>
clustering = scipy.cluster.hierarchy.ward(y)
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 465, in ward
return linkage(y, method='ward', metric='euclidean')
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 620, in linkage
y = _convert_to_double(np.asarray(y, order='c'))
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 928, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Do I need to do some transformation on my array first or use some other method?

scipy.optimize.curvefit() - array must not contain infs or NaNs

I am trying to fit some data to a curve in Python using scipy.optimize.curve_fit. I am running into the error ValueError: array must not contain infs or NaNs.
I don't believe either my x or y data contain infs or NaNs:
>>> x_array = np.asarray_chkfinite(x_array)
>>> y_array = np.asarray_chkfinite(y_array)
>>>
To give some idea of what my x_array and y_array look like at either end (x_array is counts and y_array is quantiles):
>>> type(x_array)
<type 'numpy.ndarray'>
>>> type(y_array)
<type 'numpy.ndarray'>
>>> x_array[:5]
array([0, 0, 0, 0, 0])
>>> x_array[-5:]
array([2919, 2965, 3154, 3218, 3461])
>>> y_array[:5]
array([ 0.9999582, 0.9999163, 0.9998745, 0.9998326, 0.9997908])
>>> y_array[-5:]
array([ 1.67399000e-04, 1.25549300e-04, 8.36995200e-05,
4.18497600e-05, -2.22044600e-16])
And my function:
>>> def func(x,alpha,beta,b):
... return ((x/1)**(-alpha) * ((x+1*b)/(1+1*b))**(alpha-beta))
...
Which I am executing with:
>>> popt, pcov = curve_fit(func, x_array, y_array)
resulting in the error stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 426, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 338, in leastsq
cov_x = inv(dot(transpose(R),R))
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 285, in inv
a1 = asarray_chkfinite(a)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
I'm guessing the error might not be with respect to my arrays, but rather an array created by scipy in an intermediate step? I've had a bit of a dig through the relevant scipy source
files, but things get hairy pretty quickly debugging the problem that way. Is there something obvious I'm doing wrong here? I've seen casually mentioned in other questions that sometimes certain initial parameter guesses (of which I currently don't have any explicit) might result in these kind of errors, but even if this is the case, it would be good to know a) why that is and b) how to avoid it.

Why it is failing
Not your input arrays are entailing nans or infs, but evaluation of your objective function at some X points and for some values of the parameters results in nans or infs: in other words, the array with values func(x,alpha,beta,b) for some x, alpha, beta and b is giving nans or infs over the optimization routine.
Scipy.optimize curve fitting function uses Levenberg-Marquardt algorithm. It is also called damped least square optimization. It is an iterative procedure, and a new estimate for the optimal function parameters is computed at each iteration. Also, at some point during optimization, algorithm is exploring some region of the parameters space where your function is not defined.
How to fix
1/Initial guess
Initial guess for parameters is decisive for the convergence. If initial guess is far from optimal solution, you are more likely to explore some regions where objective function is undefined. So, if you can have a better clue of what your optimal parameters are, and feed your algorithm with this initial guess, error while proceeding might be avoided.
2/Model
Also, you could modify your model, so that it is not returning nans. For those values of the parameters, params where original function func is not defined, you wish that objective function takes huge values, or in other words that func(params) is far from Y values to be fitted.
Also, at points where your objective function is not defined, you may return a big float, for instance AVG(Y)*10e5 with AVG the average (so that you make sure to be much bigger than average of Y values to be fitted).
Link
You could have a look at this post: Fitting data to an equation in python vs gnuplot

Your function has a negative power (x^-alpha) this is the same as (1/x)^(alpha). If x is ever 0 your function will return inf and your curve fit operation will break, I'm surprised a warning/error isn't thrown earlier informing you of a divide by 0.
BTW why are you multiplying and dividing by 1?

I was able to reproduce this error in python2.7 like so:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this produces the error:
Which always produces the Error:
/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py:58: RuntimeWarning: invalid value encountered in sqrt
return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
Traceback (most recent call last):
File "main.py", line 43, in <module>
ica()
File "main.py", line 18, in ica
ica.fit(X)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 523, in fit
self._fit(X, compute_sources=False)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 479, in _fit
compute_sources=compute_sources, return_n_iter=True)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 335, in fastica
W, n_iter = _ica_par(X1, **kwargs)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 108, in _ica_par
- g_wtx[:, np.newaxis] * W)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 55, in _sym_decorrelation
s, u = linalg.eigh(np.dot(W, W.T))
File "/usr/lib64/python2.7/site-packages/scipy/linalg/decomp.py", line 297, in eigh
a1 = asarray_chkfinite(a)
File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py", line 613, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
Solution:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
#this is a column wise normalization function which flattens the
#two dimensional array from very large and very small numbers to
#reasonably sized numbers between roughly -1 and 1
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this works correctly.
Why does that normalization step fix the error?
I found the eureka moment here: sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
What I think is happening is that numpy is being fed gigantic numbers and very tiny numbers, and inside it's tiny brain it's creating NaN's and Inf's. So it's a bug in the sklearn. The work around is to flatten your input data to the algorithm so that there are no very large or very small numbers.
Bad sklearn! NO biscuit!

Create a sparse diagonal matrix from row of a sparse matrix

I process rather large matrices in Python/Scipy. I need to extract rows from large matrix (which is loaded to coo_matrix) and use them as diagonal elements. Currently I do that in the following fashion:
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A[i,:].todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csc")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
What I see from the profile output is that most of the time is consumed by get_csr_submatrix function while extracting diag_elems. That makes me think that I use either inefficient sparse representation of initial data or wrong way of extracting row from a sparse matrix. Can you suggest a better way to extract a row from a sparse matrix and represent it in a diagonal form?
EDIT
The following variant removes bottleneck from the row extraction (notice that simple changing 'csc' to csr is not sufficient, A[i,:] must be replaced with A.getrow(i) as well). However the main question is how to omit the materialization (.todense()) and create the diagonal matrix from the sparse representation of the row.
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A.getrow(i).todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csr")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
If I create DIAgonal matrix from 1-row CSR matrix directly, as follows:
diag_elems = A.getrow(i)
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1])
then I can neither specify format="csc" argument, nor convert ith_diags to CSC format:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/profile.py", line 70, in run
prof = prof.run(statement)
File "/usr/local/lib/python2.6/profile.py", line 456, in run
return self.runctx(cmd, dict, dict)
File "/usr/local/lib/python2.6/profile.py", line 462, in runctx
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "<stdin>", line 4, in computation
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/construct.py", line 56, in spdiags
return dia_matrix((data, diags), shape=(m,n)).asformat(format)
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/base.py", line 211, in asformat
return getattr(self,'to' + format)()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/dia.py", line 173, in tocsc
return self.tocoo().tocsc()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/coo.py", line 263, in tocsc
data = np.empty(self.nnz, dtype=upcast(self.dtype))
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/sputils.py", line 47, in upcast
raise TypeError,'no supported conversion for types: %s' % args
TypeError: no supported conversion for types: object`

Here's what I came up with:
def computation(A):
for i in range(A.shape[0]):
idx_begin = A.indptr[i]
idx_end = A.indptr[i+1]
row_nnz = idx_end - idx_begin
diag_elems = A.data[idx_begin:idx_end]
diag_indices = A.indices[idx_begin:idx_end]
ith_diag = sparse.csc_matrix((diag_elems, (diag_indices, diag_indices)),shape=(A.shape[1], A.shape[1]))
ith_diag.eliminate_zeros()
Python profiler said 1.464 seconds versus 5.574 seconds before. It takes advantage of the underlying dense arrays (indptr, indices, data) that define sparse matrices. Here's my crash course: A.indptr[i]:A.indptr[i+1] defines which elements in the dense arrays correspond to the non-zero values in row i. A.data is a dense 1d array of non-zero the values of A and A.indptr is the column where those values go.
I would do some more testing to make very certain this does the same thing as before. I only checked a few cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: computing pariwise distances causes memory error - python

Related

What is rank list in tucker decomposition?

euclidean distance calculation using Python and Dask

trying to do ward clustering on n by n matrix in scipy

scipy.optimize.curvefit() - array must not contain infs or NaNs

Create a sparse diagonal matrix from row of a sparse matrix

Categories

Resources