Python MemoryError with metric-learn

Python MemoryError with metric-learn - python

I am trying to learn a new distance metric using neighborhood component analysis (using this source)
My data matrix has the shape of 11000x128 (11k elements and 128 features). However I keep getting memory error, I even downsized my input data to 1000x128 but I still get the Memory error. Here is my code snippet. I am using a C4.8xlarge machine on AWS (36 cores, 60 GiB Memory). Is there a way that I can bypass this? How do I make sure that I am using all the available cores?
from metric_learn import NCA
num_el = 1000
nca = NCA(max_iter=1, learning_rate=0.01)
nca.fit(feats_annotated[:num_el], labels_annotated[:num_el])
MemoryError Traceback (most recent call last)
<ipython-input-5-bfd6fe47f16d> in <module>()
1 num_el = 1000
2 nca = NCA(max_iter=1, learning_rate=0.01)
----> 3 nca.fit(feats_annotated[:num_el], labels_annotated[:num_el])
/home/user/anaconda2/lib/python2.7/site-packages/metric_learn/nca.pyc in fit(self, X, labels)
34 # Run NCA
35 dX = X[:,None] - X[None] # shape (n, n, d)
---> 36 tmp = np.einsum('...i,...j->...ij', dX, dX) # shape (n, n, d, d)
37 masks = labels[:,None] == labels[None]
38 learning_rate = self.params['learning_rate']
/home/user/anaconda2/lib/python2.7/site-packages/numpy/core/einsumfunc.pyc in einsum(*operands, **kwargs)
946 # If no optimization, run pure einsum
947 if optimize_arg is False:
--> 948 return c_einsum(*operands, **kwargs)
949
950 valid_einsum_kwargs = ['out', 'dtype', 'order', 'casting']
MemoryError:

Related

gap statistics LinAlgError: Last 2 dimensions of the array must be square

I am trying to find the gap statistics of my clusters like this:
from gap_statistic import OptimalK
# creat function
def KMeans_clustering_func(X, k):
# Include any clustering Algorithm that can return cluster centers
m = KMeans(random_state=11, n_clusters=k)
m.fit(X)
# Return the location of each cluster center and the labels for each point.
return m.cluster_centers_, m.predict(X)
# create a wrapper around OptimalK to extract cluster centers and cluster labels
optimalK = OptimalK(clusterer=KMeans_clustering_func)
# run optimal K on the input data (subset_scaled_interim) and number of clusters
n_clusters = optimalK(X, cluster_array=np.arange(2, 21))
I encounter this error:
inAlgError Traceback (most recent call last)
Input In [211], in <cell line: 19>()
15 optimalK = OptimalK(clusterer=KMeans_clustering_func)
17 # run optimal K on the input data (subset_scaled_interim) and number of clusters
---> 19 n_clusters = optimalK(X, cluster_array=np.arange(2,25))
20 print('Optimal clusters: ', n_clusters)
22 # Gap Statistics data frame
File ~/.local/lib/python3.8/site-packages/gap_statistic/optimalK.py:134, in OptimalK.__call__(self, X, n_refs, cluster_array)
131 engine = self._process_non_parallel
133 # Calculate the gaps for each cluster count.
--> 134 for gap_calc_result in engine(X, n_refs, cluster_array):
135
136 # Assign this loop's gap statistic to gaps
137 gap_df = gap_df.append(
138 {
139 "n_clusters": gap_calc_result.n_clusters,
(...)
147 ignore_index=True,
148 )
149 gap_df["gap_k+1"] = gap_df["gap_value"].shift(-1)
File ~/.local/lib/python3.8/site-packages/gap_statistic/optimalK.py:361, in OptimalK._process_with_joblib(self, X, n_refs, cluster_array)
356 raise EnvironmentError(
357 "joblib is not installed; cannot use joblib as the parallel backend!"
358 )
360 with Parallel(n_jobs=self.n_jobs) as parallel:
--> 361 for gap_calc_result in parallel(
362 delayed(self._calculate_gap)(X, n_refs, n_clusters)
363 for n_clusters in cluster_array
364 ):
365 yield gap_calc_result
LinAlgError: Last 2 dimensions of the array must be square
How to fix it?

skcuda.linalg.PCA's fit_transform throws error

I am trying to run PCA (Principal component analysis) on GPU. I am using skcuda.linalg.PCA for that purpose, but it's not working. From their tutorial (https://scikit-cuda.readthedocs.io/en/latest/generated/skcuda.linalg.PCA.html):
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linalg as linalg
from skcuda.linalg import PCA as cuPCA
pca = cuPCA(n_components=4) # map the data to 4 dimensions
X = np.random.rand(1000,100) # 1000 samples of 100-dimensional data vectors
X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
X_gpu.set(X) # copy data to gpu
T_gpu = pca.fit_transform(X_gpu) # calculate the principal components
When I run it I get the following error:
cublasInternalError Traceback (most recent call last)
<ipython-input-31-02aaf0fa19e4> in <module>
8 X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
9 X_gpu.set(X) # copy data to gpu
---> 10 T_gpu = pca.fit_transform(X_gpu)
/opt/conda/lib/python3.7/site-packages/skcuda/linalg.py in fit_transform(self, X_gpu)
204 cuGemv (self.h, 'n', p, k, -1.0, P_gpu.gpudata, p, U_gpu.gpudata, 1, 1.0, P_gpu[:,k].gpudata, 1)
205
--> 206 l2 = cuNrm2(self.h, p, P_gpu[:,k].gpudata, 1)
207 cuScal(self.h, p, 1.0/l2, P_gpu[:,k].gpudata, 1)
208 cuGemv(self.h, 'n', n, p, 1.0, R_gpu.gpudata, n, P_gpu[:,k].gpudata, 1, 0.0, T_gpu[:,k].gpudata, 1)
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasDnrm2(handle, n, x, incx)
1295 n, int(x), incx,
1296 ctypes.byref(result))
-> 1297 cublasCheckStatus(status)
1298 return np.float64(result.value)
1299
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasCheckStatus(status)
177 raise cublasError
178 else:
--> 179 raise e
180
181 # Helper functions:
cublasInternalError
Initially, I was running on my own data and I got this error. Then I decided to run the example and I got exactly the same error. Does any1 know what's the problem here? I am using Kaggle notebook with Tesla T4 GPU. Thanks.

LinAlgError: 0-dimensional array given. Array must be at least two-dimensional

I am getting the above error, the error message complains about a 0-dimensional array
------------------------------------------------------------------- --------
LinAlgError Traceback (most recent call last)
<ipython-input-110-2e59b52b853b> in <module>()
8
9 # compute det and C_N
---> 10 const = np.sqrt(np.linalg.det((dof-2)/dof)*C_N)
11 print(const)
<__array_function__ internals> in det(*args, **kwargs)
/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in det(a)
2110 """
2111 a = asarray(a)
-> 2112 _assert_stacked_2d(a)
2113 _assert_stacked_square(a)
2114 t, result_t = _commonType(a)
/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in _assert_stacked_2d(*arrays)
205 if a.ndim < 2:
206 raise LinAlgError('%d-dimensional array given. Array must be '
--> 207 'at least two-dimensional' % a.ndim)
208
209 def _assert_stacked_square(*arrays):
LinAlgError: 0-dimensional array given. Array must be at least two-dimensional
My code is:
import numpy as np
npts = 5000
dof = 3
X_r = np.arange(npts)
product = X_r * X_r.transpose()
Rowsum = [np.sum(product[i]) for i in range(npts)]
C_N = np.sum(Rowsum)/(npts - 1)
# compute det and C_N
const = np.sqrt(np.linalg.det(((dof-2)/dof)*C_N))
print(const)
Many thanks to the intrepid who care and dare to read through all this.

Can't differentiate wrt numpy arrays of dtype int64?

I am a newbie to numpy. Today when I use it to work with linear regression, it shows as below:
KeyError Traceback (most recent call
last)
~/anaconda3/lib/python3.6/site-packages/autograd/numpy/numpy_extra.py
in new_array_node(value, tapes)
84 try:
---> 85 return array_dtype_mappings[value.dtype](value, tapes)
86 except KeyError:
KeyError: dtype('int64')
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last)
<ipython-input-4-aebe8f7987b0> in <module>()
24 return cost/float(np.size(y))
25
---> 26 weight_h, cost_h = gradient_descent(least_squares, alpha,
max_its, w)
27
28 # a)
<ipython-input-2-1b74c4f818f4> in gradient_descent(g, alpha, max_its,
w)
12 for k in range(max_its):
13 # evaluate the gradient
---> 14 grad_eval = gradient(w)
15
16 # take gradient descent step
~/anaconda3/lib/python3.6/site-packages/autograd/core.py in
gradfun(*args, **kwargs)
19 #attach_name_and_doc(fun, argnum, 'Gradient')
20 def gradfun(*args,**kwargs):
---> 21 return
backward_pass(*forward_pass(fun,args,kwargs,argnum))
22 return gradfun
23
~/anaconda3/lib/python3.6/site-packages/autograd/core.py in
forward_pass(fun, args, kwargs, argnum)
57 tape = CalculationTape()
58 arg_wrt = args[argnum]
---> 59 start_node = new_node(safe_type(getval(arg_wrt)),
[tape])
60 args = list(args)
61 args[argnum] = merge_tapes(start_node, arg_wrt)
~/anaconda3/lib/python3.6/site-packages/autograd/core.py in
new_node(value, tapes)
185 def new_node(value, tapes=[]):
186 try:
--> 187 return Node.type_mappings[type(value)](value, tapes)
188 except KeyError:
189 return NoDerivativeNode(value, tapes)
~/anaconda3/lib/python3.6/site-packages/autograd/numpy/numpy_extra.py
in new_array_node(value, tapes)
85 return array_dtype_mappings[value.dtype](value, tapes)
86 except KeyError:
---> 87 raise TypeError("Can't differentiate wrt numpy arrays
of dtype {0}".format(value.dtype))
88 Node.type_mappings[anp.ndarray] = new_array_node
89
TypeError: Can't differentiate wrt numpy arrays of dtype int64
I really have no idea about what is happened. I guess it might be related to the structure of array in numpy. Or did I forget to download any packages? Below is my original codes.
# import statements
datapath = 'datasets/'
from autograd import numpy as np
# import automatic differentiator to compute gradient module
from autograd import grad
# gradient descent function
def gradient_descent(g,alpha,max_its,w):
# compute gradient module using autograd
gradient = grad(g)
# run the gradient descent loop
weight_history = [w] # weight history container
cost_history = [g(w)] # cost function history container
for k in range(max_its):
# evaluate the gradient
grad_eval = gradient(w)
# take gradient descent step
w = w - alpha*grad_eval
# record weight and cost
weight_history.append(w)
cost_history.append(g(w))
return weight_history,cost_history
# load in dataset
csvname = datapath + 'kleibers_law_data.csv'
data = np.loadtxt(csvname,delimiter=',')
# get input and output of dataset
x = data[:-1,:]
y = data[-1:,:]
x = np.log(x)
y = np.log(y)
#Data Initiation
alpha = 0.01
max_its = 1000
w = np.array([0,0])
#linear model
def model(x, w):
a = w[0] + np.dot(x.T, w[1:])
return a.T
def least_squares(w):
cost = np.sum((model(x,w)-y)**2)
return cost/float(np.size(y))
weight_h, cost_h = gradient_descent(least_squares, alpha, max_its, w)
# a)
k = np.linspace(-5.5, 7.5, 250)
y = weight_h[max_its][0] + k*weight_h[max_its][1]
plt.figure()
plt.plot(x, y, label='Linear Line', color='g')
plt.xlabel('log of mass')
plt.ylabel('log of metabolic rate')
plt.title("Answer Of a")
plt.legend()
plt.show()
# b)
w0 = weight_h[max_its][0]
w1 = weight_h[max_its][1]
print("Nonlinear relationship between the body mass x and the metabolic
rate y is " /
+ str(w0) + " + " + "log(xp)" + str(w1) + " = " + "log(yp)")
# c)
x2 = np.log(10)
Kj = np.exp(w0 + w1*x2)*1000/4.18
print("It needs " + str(Kj) + " calories")
Could someone help me to figure it out? Thanks a lot.

Here's the important parts of your error:
---> 14 grad_eval = gradient(w)
...
Type Error: Can't differentiate wrt numpy arrays of dtype int64
Your gradient function is saying it doesn't like to differentiate arrays of ints, which makes some sense, since it probably wants more precision than an int can give. You probably need them to be doubles or floats. For a simple solution to this, I believe you can just change your initializer from:
w = np.array([0,0])
which is going to automatically cast those 0s as ints, to:
w = np.array([0.0,0.0])
Those decimals after the 0 will let it know you want floats. There's other ways to go about telling it what kind of array you want (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.array.html), but this is a simple way.

Sparse Matrix Addition yields 'ValueError: setting an array element with a sequence.'

The lines in question are:
# Make efficient matrix that can be built
K = sparse.lil_matrix((N, N))
# Calculate K matrix (<i|pHp|j> in the LGL-nodes basis)
for i in range(Ne):
idx_s, idx_e = i*(Np-1), i*(Np-1)+Np
print(shape(K[idx_s:idx_e, idx_s:idx_e]))
print(shape(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)))
K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
But, currently, Numpy is yielding the error
(8, 8)
(8, 8)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-62-cc7cc21f07e5> in <module>()
22
23 for _ in range(N):
---> 24 ll, q = getLL(Ne, Np, x_d, w_d, dmat_d, x, w, dL, peq*peq, data)
25 peq = (peq*q)
26
<ipython-input-61-a52c13d48b87> in getLL(Ne, Np, x_d, w_d, dmat_d, x, w, dmat, peq, data)
15 print(shape(K[idx_s:idx_e, idx_s:idx_e]))
16 print(shape(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)))
---> 17 K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
18
19 # Re-make matrix for efficient vector products
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/scipy/sparse/lil.py in __iadd__(self, other)
157
158 def __iadd__(self,other):
--> 159 self[:,:] = self + other
160 return self
161
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/scipy/sparse/lil.py in __setitem__(self, index, x)
307
308 # Make x and i into the same shape
--> 309 x = np.asarray(x, dtype=self.dtype)
310 x, _ = np.broadcast_arrays(x, i)
311
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
This is a little cryptic as it seems that the error is happening somewhere inside of the Numpy library---not in my code. But I'm not terribly familiar with numpy, per se, so perhaps I'm indirectly causing the error.
Both slices are of the same shape, so that doesn't seem to be the actual error.

The problem is that
(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
is not a simple array. It has the right shape, but the elements are sparse matrices (the 'sequence' in the error message).
Turning the inner sparse matrix into a dense array should solve the problem:
dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np).A).dot(dmat)
The np.dot method is not aware of sparse matrices, at least not in your version of numpy (1.8?), so it treats it as sequence. Newer versions are 'sparse' aware.
Another solution is to use the sparse matrix product (dot or *).
sparse.spdiags(...).dot(dmat etc)
I had to play around to get reasonable values for N,Np,Ns, dmat,peq. You really should have given us small samples. It makes testing ideas much easier.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python MemoryError with metric-learn - python

Related

gap statistics LinAlgError: Last 2 dimensions of the array must be square

skcuda.linalg.PCA's fit_transform throws error

LinAlgError: 0-dimensional array given. Array must be at least two-dimensional

Can't differentiate wrt numpy arrays of dtype int64?

Sparse Matrix Addition yields 'ValueError: setting an array element with a sequence.'

Categories

Resources