I'm attempting to fit a 2D point-cloud (x and y coordinates). So far, I've had limited success with scipy's interpolation packages, most notable UnivariateSpline, which produced the following (sub-optimal) fit (please ignore the colors):
However, this is obviously the wrong package, since I need a final curve that can bend back in on itself (so no longer a 1d function) near the edges of my parabolic point-cloud.
I then read up on interp2d, for example, but don't understand what my z array would be. Is there a better class of packages that I'm perhaps overlooking?
Update 1: as suggested in the comments, I've redone this using scipy.interpolate.splprep; my general setup is:
from scipy.interpolate import splprep, splev
pts = np.vstack((X.ravel(), Y.ravel)) #X and Y contain my points
(tck, u), fp, ier, msg = splprep(pts, u=None, per=0, k=3, full_output=True) #s = optional parameter (default used here)
print('Spline score:',fp) #goodness of fit flatlines after a given s value (and higher), which captures the default s-value as well
x_new, y_new = splec(u_new, tck, der=0)
plt.plot(x_new, y_new, 'k')
plt.show()
The plot is below. Can anyone suggest a method for automating the decision of s... possible a loop while assessing coefficient of determination of each s plot? Or is there something baked in?
Update 2:
I've since re-run this on two different point-clouds, and found that the ordering of the points significantly alters the outcome. When I re-order the points in the point-cloud to be along an initial fit to a parabola, I get much better results with my spline. As well, the results are still sub-optimal, as can be seen below.
Are there any further adjustments I can make with this method? Alternatively, does anyone have a suggestion for a competitive approach I can investigate?
Update 3: actually, setting knots = 5 helps out tremendously:
I faced a similar issue, and the solution to my problem was to use Principal Curves (https://hastie.su.domains/Papers/Principal_Curves.pdf).
This implementation available on GitHub could help you: https://github.com/zsteve/pcurvepy.
Related
The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.
I am trying to solve the elliptical differential equation using fourth-order runge-kutta method in python.
After execution, I get a very small part of the actual plot that should be obtained and alongside with it an error saying that:
"RuntimeWarning: invalid value encountered in double_scalars"
import numpy as np
import matplotlib.pyplot as plt
#Define constants
g=9.8
L=1.04
#Define the differential Function
def fun(y,x):
return-(2*(g/L)*(np.cos(y)-np.cos(np.pi/6)))**(1/2)
#Define variable arrays
x=np.zeros(1000)
y=np.zeros(1000)
y[0]=np.pi/6
dx=0.5
#Runge-Kutta Method
for i in range(len(y)-1):
k1=fun(x[i],y[i])
k2=fun(x[i]+dx/2, y[i]+dx*k1/2)
k3=fun(x[i]+dx/2, y[i]+dx*k2/2)
k4=fun(x[i]+dx, y[i]+dx*k3)
y[i+1]=y[i]+dx/6*(k1+2*k2+2*k3+k4)
x[i+1]=x[i]+dx
#print(y)
#print(x)
plt.plot(x,y)
plt.xlabel('Time')
plt.ylabel('Theta')
plt.grid()
And the graph I obtain is something like,
My question is why am I getting the error message? Thanks for helping!
Several points that lead to this behavior. First, you switched the order of the arguments in the ODE function, probably to make it compatible with odeint. Use the tfirst=True optional argument to avoid that and have the independent variable always first.
The actual source of the error is the term
(np.cos(y)-np.cos(np.pi/6)))**(1/2)
remember that in your version y has the value x[i], so that at some point the expression under the root becomes negative.
If you correct the first point, you will probably still encounter the second error as the exact solution moves parabolically towards the fixed point, so that the stages of RK4 are likely to overshoot. One can fix that by providing a sufficiently secured square root function,
def softroot(x): return x/max(1e-12,abs(x))**0.5
#Define the differential Function
def fun(x,y):
return -(2*(g/L)*softroot(np.cos(y)-np.cos(np.pi/6)))
#Define variable arrays
dx=0.01
x=np.arange(0,1,dx)
y=np.zeros(x.shape)
y[0]=np.pi/6
...
results in a plot
as the solution already starts in the fixed point. Shifting the initial point a little down to y[0]=np.pi/6-1e-8 produces a jump to the fixed point below.
I need to transfer some Matlab code into P. I got stuck at a numpy.arange I use to set points continuously on the arc of circle with a given angle (in radians).
I got this far (example for points on x-axis):
def sensor_data_arc_x():
theta = np.arange(0, angle/2, 2*np.pi/360)
return np.multiply(radius, np.cos(np.transpose(theta)))
I know numpy.arange does not include the endpoint, although the Matlab equivalent does; the array is always one element short, which is messing up my further computations.
Is there an way to include the endpoint?
I recommend that you work through a tutorial on for loops -- the information you need is there, as well as other hints on using controlled iteration. To solve your immediate need, simply increase your upper bound by one loop increment:
inc = 2*np.pi/360
theta = np.arange(0, angle/2 + inc, inc)
The smooth.spline function in R allows a tradeoff between roughness (as defined by the integrated square of the second derivative) and fitting the points (as defined by summing the squares of the residuals). This tradeoff is accomplished by the spar or df parameter. At one extreme you get the least squares line, and the other you get a very wiggly curve which intersects all of the data points (or the mean if you have duplicated x values with different y values)
I have looked at scipy.interpolate.UnivariateSpline and other spline variants in Python, however, they seem to only tradeoff by increasing the number of knots, and setting a threshold (called s) for the allowed SS residuals. By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.
Does Python have a spline fitting mechanism that behaves in this way? Allowing all knots but penalizing the second derivative?
You can use R functions in Python with rpy2:
import rpy2.robjects as robjects
r_y = robjects.FloatVector(y_train)
r_x = robjects.FloatVector(x_train)
r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(x=r_x, y=r_y, spar=0.7)
ySpline=np.array(robjects.r['predict'](spline1,robjects.FloatVector(x_smooth)).rx2('y'))
plt.plot(x_smooth,ySpline)
If you want to directly set lambda: spline1 = r_smooth_spline(x=r_x, y=r_y, lambda=42) doesn't work, because lambda has already another meaning in Python, but there is a solution: How to use the lambda argument of smooth.spline in RPy WITHOUT Python interprating it as lambda.
To get the code running you first need to define the data x_train and y_train and you can define x_smooth=np.array(np.linspace(-3,5,1920)). if you want to plot it between -3 and 5 in Full-HD-resolution.
Note that this code is not fully compatible with Jupyter-notebooks for the latest versions of rpy2. You can fix this by using !pip install -Iv rpy2==3.4.2 as described in NotImplementedError: Conversion 'rpy2py' not defined for objects of type '<class 'rpy2.rinterface.SexpClosure'>' only after I run the code twice
I've been looking for exactly the same thing, but would rather not have to translate the code to Python. The Splinter package seems like an option, however: https://github.com/bgrimstad/splinter
From research on google, I concluded that
By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.
I'm working on some python code to interpolate irregular data onto a 180° lat x 360° lon spherical grid. The code is currently hanging when I call the following:
def geo_interp(lats,lons,data,grid_size_deg):
deg2rad = pi/180.
new_lats = np.linspace(grid_size_deg, 180, 180) * deg2rad
new_lons = np.linspace(grid_size_deg, 360, 360) * deg2rad
knotst, knotsp = new_lats.copy(), new_lons.copy()
knotst[0] += .0001
knotst[-1] -= .0001
knotsp[0] += .0001
knotsp[-1] -= .0001
lut = LSQSphereBivariateSpline(lats.ravel(),lons.ravel(),data.T.ravel(),knotst,knotsp)
data_interp = lut(new_lats, new_lons)
return data_interp
The arrays I'm using as arguments when I call the above subroutine all fit the requirements of LSQSphereBivariateSpline as listed in the documentation. When I run it, it takes much longer than I feel like it should take to process a 180x360 dataset.
When I run the script with python -m trace --trace, the last line of output before nothing happens for a long time is
fitpack2.py(1025): w=w, eps=eps)
As far as I can tell, line 1025 of fitpack2.py is in a comment, which is even more confusing.
So my questions are:
1. Is there a way to tell if it's hanging or just very slow?
2. If it's hanging, how might I fix it?
The only thing I can think of is that I have no idea what I'm doing as far as choosing knots. Is there a good way to choose those? I just went with the grid I'll be interpolating to later, since the example in the doc seemed to be an arbitrary grid.
UPDATE: It finally finished after about 3 hours but the "interpolated data" looked like random noise. Also, if this is relevant, as far as I can tell LSQSphereBivariateSpline is the only function I can use for this because my lats and lons are not strictly increasing.
Also, I should add that when it finished it output the following warning: Warning (from warnings module):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/interpolate/fitpack2.py", line 1029
warnings.warn(message)
UserWarning:
WARNING. The coefficients of the spline returned have been computed as the
minimal norm least-squares solution of a (numerically) rank
deficient system (deficiency=16336, rank=48650). Especially if the rank
deficiency, which is computed by 6+(nt-8)*(np-7)+ier, is large,
the results may be inaccurate. They could also seriously depend on
the value of eps.
SOLUTION: I had far too many knots, causing both the glacial pace and the useless results.