hierarchical clustering in scipy - memory error - python

Here is my code:
import numpy as np
from scipy.cluster.hierarchy import fclusterdata
def mydist(p1,p2):
return 1
Y = np.random.randn(100000,2)
fclust1 = fclusterdata(Y, 1.0, metric=mydist)
It produces the following error:
MemoryError Traceback (most recent call last)
<ipython-input-52-818db8791e96> in <module>()
----> 1 fclust1 = fclusterdata(Y, 1.0, metric=mydist)
C:\Anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in fclusterdata(X, t, criterion, metric, depth, method, R)
1682 'array.')
1683
-> 1684 Y = distance.pdist(X, metric=metric)
1685 Z = linkage(Y, method=method)
1686 if R is None:
C:\Anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, p, w, V, VI)
1218
1219 m, n = s
-> 1220 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1221
1222 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
MemoryError:
So I am guessing my vector is too large. I am a bit surprised, since my distance function is trivial. What is max size vector that fclusterdata can accept?

Hierarchical clustering usually requires a pairwise distance matrix.
That means you need O(n^2) memory. And it does not 'see' that your distance is constant (and it doesn't make sense to optimize for this either).
It's not a very scalable algorithm.

Related

curve_fit multivariable arrays non-linear regression

I am trying to fit the coefficients of a multivariate function with curve_fit. All variables are arrays of the following shape : (1000,) Manually I can fit the curves as follows. First I define my function where the variables = [dphi1,dphi2,phi1,phi2,M] and the coefficients = [c1,c2,c3,c4,c5,c6,c7,c8,c9]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import math
# Function definition
def ddphi1(dphi1,dphi2,phi1,phi2,M,c1,c2,c3,c4,c5,c6,c7,c8,c9):
return (-(c1*np.sin(phi1-phi2)*np.cos(phi1-phi2)*dphi1**2)-(c2*np.sin(phi1-phi2)*dphi2**2)+(c3*np.cos(phi1-phi2)*np.sin(phi2))+(c4*np.cos(phi1-phi2)*(dphi2-dphi1))-(c5*np.sin(phi1))+c6*M-c7*dphi1)/(c8-(c9*np.cos(phi1-phi2)*np.cos(phi1-phi2)))
my first prediction of the coefficients:
p = [0.5625, 0.375, 27.590625000000003, 0.09375, 55.18125, 62.5, 0.425, 1, 0.5625]
I calculate the values of the function iteratively. I take the length of any of the variables already have the same size:
n = len(time1)
y = np.empty(n)
for i in range(n):
y[i] = ddphi1(dphi11[i],dphi22[i],phi11[i],phi22[i],M[i],p[0],p[1],p[2],p[3],p[4],p[5],p[6],p[7],p[8])
plt.plot(time1, ddphi11)
plt.plot(time1, y, 'r')
Predicted Vs real data
Now the idea is to calculate the coefficients automatically with curve_fit as follows: ** ddphi1 ist my Callback function and ddphi11 my data of shape (1000,) as well as the other variables
from scipy.optimize import curve_fit
g = [0.56, 0.37, 27.63, 0.094, 55.18, 62.5, 0.625, 1, 0.56]
c,cov =curve_fit(ddphi1,(dphi1,dphi2,phi1,phi2,M),ddphi11,g)
print(c)
and I receive this error
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-158-e8e42e7b1216> in <module>()
1 from scipy.optimize import curve_fit
2
----> 3 c,cov =curve_fit(ddphi1,(dphi1,dphi2,phi1,phi2,M),ddphi11,g=[0.56, 0.37, 27.63, 0.094, 55.18, 62.5, 0.625, 1, 0.56])
4 print(c)
1 frames
/usr/local/lib/python3.7/dist-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
719 # non-array_like `xdata`.
720 if check_finite:
--> 721 xdata = np.asarray_chkfinite(xdata, float)
722 else:
723 xdata = np.asarray(xdata, float)
/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in asarray_chkfinite(a, dtype, order)
484
485 """
--> 486 a = asarray(a, dtype=dtype, order=order)
487 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
488 raise ValueError(
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
I have seen that most of the data that goes into the curve_fir is in the form list. Maybe there is a solution when dealing with arrays? espero I hope you can help me as I am new to Python.
I was finally able to solve it. only the arrays should have been concatenated in a global variable
X=np.column_stack([dphi11,dphi22,phi11,phi22,M])
Then describe the model in based the global variable
def model(X,c1,c2,c3,c4,c5,c6,c7,c8,c9):
dphi1 = X[:,0]
dphi2 = X[:,1]
phi1 = X[:,2]
phi2 = X[:,3]
M = X[:,4]
f = (-(c1*np.sin(phi1-phi2)*np.cos(phi1-phi2)*dphi1**2)-(c2*np.sin(phi1-phi2)*dphi2**2)+(c3*np.cos(phi1-phi2)*np.sin(phi2))+(c4*np.cos(phi1-phi2)*(dphi2-dphi1))-(c5*np.sin(phi1))+c6*M-c7*dphi1)/(c8-(c9*np.cos(phi1-phi2)*np.cos(phi1-phi2)))
return f
and the magic begins
guesses = [0.56, 0.37, 27.63, 0.094, 55.18, 62.5, 0.625, 1, 0.56]
from scipy.optimize import curve_fit
popt, pcov = curve_fit(model, X, ddphi11, guesses)
print(popt)

Fitting zenithal equal area projection with astropy and fit_wcs_from_points

I'm trying to use astropy.wcs.utils.fit_wcs_from_points to fit points projected with zenithal equal area projection (WCS code ZEA). This projection is popular in all-sky cameras.
I have started by projecting a set of celestial coordinates with a known WCS, to see if I can recover it. Input values are obtained by:
x, y = w.world_to_pixel(lon * u.deg, lat * u.deg)
world_coords = SkyCoord(lon * u.deg, lat * u.deg)
and the projection is:
from astropy import wcs
w = wcs.WCS(naxis=2)
scale = 0.095
w.wcs.crpix = [1290, 1950]
w.wcs.cdelt = [scale, scale]
w.wcs.crval = [0, 90]
w.wcs.ctype = ["ALON-ZEA", "ALAT-ZEA"]
In all my tests I get the following exception when I perform the fitting:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-cfb0b01c0d18> in <module>
----> 1 astropy.wcs.utils.fit_wcs_from_points([x, y], world_coords,
2 proj_point=SkyCoord(0 * u.deg, 90 * u.deg),
3 projection='ZEA')
/usr/lib64/python3.9/site-packages/astropy/wcs/utils.py in fit_wcs_from_points(xy, world_coords, proj_point, projection, sip_degree)
1076 # and cd terms are way off.
1077 p0 = np.concatenate([wcs.wcs.cd.flatten(), wcs.wcs.crpix.flatten()])
-> 1078 fit = least_squares(_linear_wcs_fit, p0,
1079 args=(lon, lat, xp, yp, wcs))
1080 wcs.wcs.crpix = np.array(fit.x[4:6])
/usr/lib64/python3.9/site-packages/scipy/optimize/_lsq/least_squares.py in least_squares(fun, x0, jac, bounds, method, ftol, xtol, gtol, x_scale, loss, f_scale, diff_step, tr_solver, tr_options, jac_sparsity, max_nfev, verbose, args, kwargs)
812
813 if not np.all(np.isfinite(f0)):
--> 814 raise ValueError("Residuals are not finite in the initial point.")
815
816 n = x0.size
ValueError: Residuals are not finite in the initial point.
Things I have tried, without changes:
pass the actual test WCS transformation in projection, instead of ZEA
set CRPIX to 0 in my test WCS, so all the points are around (0 ,0)
remove proj_point
try with astropy 4.2.1 (latest released)
This sample code reproduces the problem:
import numpy as np
import astropy.wcs.utils
from astropy.coordinates import SkyCoord
import astropy.units as u
x0 = np.array([702.4, 1480.4, 1223.5, 897, 1916.6])
y0 = np.array([1925.8, 2269.3, 2679.1, 1632.7, 1586.3])
zea_lon = [268.8, 145.6, 181.6, 305.4, 56.5]
zea_lat = [31.8, 54.0, 15.2, 40.6, 16.1]
world_coords0 = SkyCoord(zea_lon * u.deg, zea_lat * u.deg)
astropy.wcs.utils.fit_wcs_from_points([x0, y0], world_coords0,
proj_point=SkyCoord(0 * u.deg, 90 * u.deg),
projection='ZEA')

skcuda.linalg.PCA's fit_transform throws error

I am trying to run PCA (Principal component analysis) on GPU. I am using skcuda.linalg.PCA for that purpose, but it's not working. From their tutorial (https://scikit-cuda.readthedocs.io/en/latest/generated/skcuda.linalg.PCA.html):
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linalg as linalg
from skcuda.linalg import PCA as cuPCA
pca = cuPCA(n_components=4) # map the data to 4 dimensions
X = np.random.rand(1000,100) # 1000 samples of 100-dimensional data vectors
X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
X_gpu.set(X) # copy data to gpu
T_gpu = pca.fit_transform(X_gpu) # calculate the principal components
When I run it I get the following error:
cublasInternalError Traceback (most recent call last)
<ipython-input-31-02aaf0fa19e4> in <module>
8 X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
9 X_gpu.set(X) # copy data to gpu
---> 10 T_gpu = pca.fit_transform(X_gpu)
/opt/conda/lib/python3.7/site-packages/skcuda/linalg.py in fit_transform(self, X_gpu)
204 cuGemv (self.h, 'n', p, k, -1.0, P_gpu.gpudata, p, U_gpu.gpudata, 1, 1.0, P_gpu[:,k].gpudata, 1)
205
--> 206 l2 = cuNrm2(self.h, p, P_gpu[:,k].gpudata, 1)
207 cuScal(self.h, p, 1.0/l2, P_gpu[:,k].gpudata, 1)
208 cuGemv(self.h, 'n', n, p, 1.0, R_gpu.gpudata, n, P_gpu[:,k].gpudata, 1, 0.0, T_gpu[:,k].gpudata, 1)
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasDnrm2(handle, n, x, incx)
1295 n, int(x), incx,
1296 ctypes.byref(result))
-> 1297 cublasCheckStatus(status)
1298 return np.float64(result.value)
1299
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasCheckStatus(status)
177 raise cublasError
178 else:
--> 179 raise e
180
181 # Helper functions:
cublasInternalError
Initially, I was running on my own data and I got this error. Then I decided to run the example and I got exactly the same error. Does any1 know what's the problem here? I am using Kaggle notebook with Tesla T4 GPU. Thanks.

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

I need to make an integral of the type g(u)jn(u) where g(u) is a smooth function without zeros and jn(u) in the Bessel function with infinity zeros, but I got the following error:
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
First I need to change of variable x to variable u and make an integration in the new variable u but how the function u(x) is not analytically invertible so I need to use interpolation to make this inversion numerically.
import numpy as np
from scipy.interpolate import InterpolatedUnivariateSpline
x = np.linspace(0.1, 100, 1000)
u = lambda x: x*np.exp(x)
dxdu_x = lambda x: 1/((1+x) * np.exp(x)) ## dxdu as function of x: not invertible
dxdu_u = InterpolatedUnivariateSpline(u(x), dxdu_x(x)) ## dxdu as function of u: change of variable
After this, the integral is:
from mpmath import mp
def f(n):
integrand = lambda U: dxdu_u(U) * mp.besselj(n,U)
bjz = lambda nth: mp.besseljzero(n, nth)
return mp.quadosc(integrand, [0,mp.inf], zeros=bjz)
I use quadosc from mpmath and not quad from scipy because quadosc is more appropriate to make integral of rapidly oscillating functions, like Bessel functions. But, by other hand, this force me to use two different packges, scipy to calculate dxdu_u by interpolation, and mpmath to calculate the Bessel functions mp.besselj(n,U) and the integral of the product dxdu_u(U) * mp.bessel(n,U) so I suspect that this mix of two different packages can make some issue/ conflict. So when I make:
print(f(0))
I got the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-ac2976a6b736> in <module>
12 return mp.quadosc(integrand, [0,mp.inf], zeros=bjz)
13
---> 14 f(0)
<ipython-input-38-ac2976a6b736> in f(n)
10 integrand = lambda U: dxdu_u(U) * mp.besselj(n,U)
11 bjz = lambda nth: mp.besseljzero(n, nth)
---> 12 return mp.quadosc(integrand, [0,mp.inf], zeros=bjz)
13
14 f(0)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
Does anyone know how I can solve this problem?
Thanks
The full traceback (the part you sniped) shows that the error is in the __call__ method of the univariatespline object. So indeed the problem is that the mpmath integration routine feeds in its mpf decimals, and scipy has no way of dealing with them.
A simplest fix is then to manually cast the offending part of the argument of the integrand to a float:
integrand = lambda U: dxdu_u(float(U)) * mp.besselj(n,U)
In general this is prone to numerical errors (mpmath uses its high-precision variables on purpose!) so proceed with caution. In this specific case it might be OK, because the interpolation is actually done in double precision. Still, best check the results.
A possible alternative might be to avoid mpmath and use the weights argument to scipy.integrate.quad, see the docs (scroll down to weights="sin" part)
Another alternative is to stick with mpmath all the way and implement the interpolation yourselves in pure python (this way, mpf objects are probably fine since they should support usual arithmetics). It's likely a simple linear interpolation is enough. If it's not, it's not too big of a deal to code up your own cubic spline interpolator.
The full traceback:
In [443]: f(0)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-443-6bfbdbfff9c4> in <module>
----> 1 f(0)
<ipython-input-440-7ebeff3611f6> in f(n)
2 integrand = lambda U: dxdu_u(U) * mp.besselj(n,U)
3 bjz = lambda nth: mp.besseljzero(n, nth)
----> 4 return mp.quadosc(integrand, [0,mp.inf], zeros=bjz)
5
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in quadosc(ctx, f, interval, omega, period, zeros)
998 # raise ValueError("zeros do not appear to be correctly indexed")
999 n = 1
-> 1000 s = ctx.quadgl(f, [a, zeros(n)])
1001 def term(k):
1002 return ctx.quadgl(f, [zeros(k), zeros(k+1)])
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in quadgl(ctx, *args, **kwargs)
807 """
808 kwargs['method'] = 'gauss-legendre'
--> 809 return ctx.quad(*args, **kwargs)
810
811 def quadosc(ctx, f, interval, omega=None, period=None, zeros=None):
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in quad(ctx, f, *points, **kwargs)
740 ctx.prec += 20
741 if dim == 1:
--> 742 v, err = rule.summation(f, points[0], prec, epsilon, m, verbose)
743 elif dim == 2:
744 v, err = rule.summation(lambda x: \
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in summation(self, f, points, prec, epsilon, max_degree, verbose)
230 print("Integrating from %s to %s (degree %s of %s)" % \
231 (ctx.nstr(a), ctx.nstr(b), degree, max_degree))
--> 232 results.append(self.sum_next(f, nodes, degree, prec, results, verbose))
233 if degree > 1:
234 err = self.estimate_error(results, prec, epsilon)
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in sum_next(self, f, nodes, degree, prec, previous, verbose)
252 case the quadrature rule is able to reuse them.
253 """
--> 254 return self.ctx.fdot((w, f(x)) for (x,w) in nodes)
255
256
/usr/local/lib/python3.6/dist-packages/mpmath/ctx_mp_python.py in fdot(ctx, A, B, conjugate)
942 hasattr_ = hasattr
943 types = (ctx.mpf, ctx.mpc)
--> 944 for a, b in A:
945 if type(a) not in types: a = ctx.convert(a)
946 if type(b) not in types: b = ctx.convert(b)
/usr/local/lib/python3.6/dist-packages/mpmath/calculus/quadrature.py in <genexpr>(.0)
252 case the quadrature rule is able to reuse them.
253 """
--> 254 return self.ctx.fdot((w, f(x)) for (x,w) in nodes)
255
256
<ipython-input-440-7ebeff3611f6> in <lambda>(U)
1 def f(n):
----> 2 integrand = lambda U: dxdu_u(U) * mp.besselj(n,U)
3 bjz = lambda nth: mp.besseljzero(n, nth)
4 return mp.quadosc(integrand, [0,mp.inf], zeros=bjz)
5
at this point it starts using the scipy interpolation code
/usr/local/lib/python3.6/dist-packages/scipy/interpolate/fitpack2.py in __call__(self, x, nu, ext)
310 except KeyError:
311 raise ValueError("Unknown extrapolation mode %s." % ext)
--> 312 return fitpack.splev(x, self._eval_args, der=nu, ext=ext)
313
314 def get_knots(self):
/usr/local/lib/python3.6/dist-packages/scipy/interpolate/fitpack.py in splev(x, tck, der, ext)
366 return tck(x, der, extrapolate=extrapolate)
367 else:
--> 368 return _impl.splev(x, tck, der, ext)
369
370
/usr/local/lib/python3.6/dist-packages/scipy/interpolate/_fitpack_impl.py in splev(x, tck, der, ext)
596 shape = x.shape
597 x = atleast_1d(x).ravel()
--> 598 y, ier = _fitpack._spl_(x, der, t, c, k, ext)
599
600 if ier == 10:
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
_fitpack._spl_ probably is compiled code (for speed). It can't take the mpmath objects directly; it has to pass their values as C compatible doubles.
To illustrate the problem, make a numpy array of mpmath objects:
In [444]: one,two = mp.mpmathify(1), mp.mpmathify(2)
In [445]: arr = np.array([one,two])
In [446]: arr
Out[446]: array([mpf('1.0'), mpf('2.0')], dtype=object)
In [447]: arr.astype(float) # default 'unsafe' casting
Out[447]: array([1., 2.])
In [448]: arr.astype(float, casting='safe')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-448-4860036bcca8> in <module>
----> 1 arr.astype(float, casting='safe')
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'
With integrand = lambda U: dxdu_u(float(U)) * mp.besselj(n,U),
In [453]: f(0) # a minute or so later
Out[453]: mpf('0.61060303588231069')

StatsModel quantile regression ValueError

I got an error after running quantile regression in Python StatsModel module. The error is following:
ValueError Traceback (most recent call last)
<ipython-input-221-3547de1b5e0d> in <module>()
16 model = smf.quantreg(fit_formula, train)
17
---> 18 fitted_model = model.fit(0.2)
19
20 #fitted_model.predict(test)
in fit(self, q, vcov, kernel, bandwidth, max_iter, p_tol, **kwargs)
177 resid = np.abs(resid)
178 xstar = exog / resid[:, np.newaxis]
--> 179 diff = np.max(np.abs(beta - beta0))
180 history['params'].append(beta)
181 history['mse'].append(np.mean(resid*resid))
ValueError: operands could not be broadcast together with shapes (178,) (176,)
I was thinking it was possibly caused by constant features, so I removed those, but I still got the same error. I am wondering what is the cause. My code is following:
quantiles = np.arange(.05, .99, .1)
cols = train.columns.tolist()[1:-2]
fit_formula = ''
for c in cols:
fit_formula = fit_formula + ' + ' + c
fit_formula = 'revenue ~ ' + train.columns.tolist()[0] + fit_formula
model = smf.quantreg(fit_formula, train)
fitted_model = model.fit(0.2)
I think your design matrix is singular, i.e. this does not hold for your data:
np.linalg.matrix_rank(model.exog) == model.exog.shape[1]
Guessing from looking at the code: The parameter, beta, is initialized for the iteration loop with
exog_rank = np_matrix_rank(self.exog)
beta = np.ones(exog_rank)
which has different lengtht than the beta from the auxiliary weighted least squares regression, and the convergence check fails. The iteratively reweighted step used a generalized inverse, pinv, which does not raise an exception because of the singular design matrix.
Based on your traceback, (178,) (176,), you would still have two collinear columns that need to be dropped.
(That's a bug: Either it should raise a proper exception for the singular case, or handle it with pinv throughout.)

Categories

Resources