The smooth.spline function in R allows a tradeoff between roughness (as defined by the integrated square of the second derivative) and fitting the points (as defined by summing the squares of the residuals). This tradeoff is accomplished by the spar or df parameter. At one extreme you get the least squares line, and the other you get a very wiggly curve which intersects all of the data points (or the mean if you have duplicated x values with different y values)
I have looked at scipy.interpolate.UnivariateSpline and other spline variants in Python, however, they seem to only tradeoff by increasing the number of knots, and setting a threshold (called s) for the allowed SS residuals. By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.
Does Python have a spline fitting mechanism that behaves in this way? Allowing all knots but penalizing the second derivative?
You can use R functions in Python with rpy2:
import rpy2.robjects as robjects
r_y = robjects.FloatVector(y_train)
r_x = robjects.FloatVector(x_train)
r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(x=r_x, y=r_y, spar=0.7)
ySpline=np.array(robjects.r['predict'](spline1,robjects.FloatVector(x_smooth)).rx2('y'))
plt.plot(x_smooth,ySpline)
If you want to directly set lambda: spline1 = r_smooth_spline(x=r_x, y=r_y, lambda=42) doesn't work, because lambda has already another meaning in Python, but there is a solution: How to use the lambda argument of smooth.spline in RPy WITHOUT Python interprating it as lambda.
To get the code running you first need to define the data x_train and y_train and you can define x_smooth=np.array(np.linspace(-3,5,1920)). if you want to plot it between -3 and 5 in Full-HD-resolution.
Note that this code is not fully compatible with Jupyter-notebooks for the latest versions of rpy2. You can fix this by using !pip install -Iv rpy2==3.4.2 as described in NotImplementedError: Conversion 'rpy2py' not defined for objects of type '<class 'rpy2.rinterface.SexpClosure'>' only after I run the code twice
I've been looking for exactly the same thing, but would rather not have to translate the code to Python. The Splinter package seems like an option, however: https://github.com/bgrimstad/splinter
From research on google, I concluded that
By contrast, the smooth.spline in R allows having knots at all the x values, without necessarily having a wiggly curve that hits all the points -- the penalty comes from the second derivative.
Related
The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.
I'm attempting to fit a 2D point-cloud (x and y coordinates). So far, I've had limited success with scipy's interpolation packages, most notable UnivariateSpline, which produced the following (sub-optimal) fit (please ignore the colors):
However, this is obviously the wrong package, since I need a final curve that can bend back in on itself (so no longer a 1d function) near the edges of my parabolic point-cloud.
I then read up on interp2d, for example, but don't understand what my z array would be. Is there a better class of packages that I'm perhaps overlooking?
Update 1: as suggested in the comments, I've redone this using scipy.interpolate.splprep; my general setup is:
from scipy.interpolate import splprep, splev
pts = np.vstack((X.ravel(), Y.ravel)) #X and Y contain my points
(tck, u), fp, ier, msg = splprep(pts, u=None, per=0, k=3, full_output=True) #s = optional parameter (default used here)
print('Spline score:',fp) #goodness of fit flatlines after a given s value (and higher), which captures the default s-value as well
x_new, y_new = splec(u_new, tck, der=0)
plt.plot(x_new, y_new, 'k')
plt.show()
The plot is below. Can anyone suggest a method for automating the decision of s... possible a loop while assessing coefficient of determination of each s plot? Or is there something baked in?
Update 2:
I've since re-run this on two different point-clouds, and found that the ordering of the points significantly alters the outcome. When I re-order the points in the point-cloud to be along an initial fit to a parabola, I get much better results with my spline. As well, the results are still sub-optimal, as can be seen below.
Are there any further adjustments I can make with this method? Alternatively, does anyone have a suggestion for a competitive approach I can investigate?
Update 3: actually, setting knots = 5 helps out tremendously:
I faced a similar issue, and the solution to my problem was to use Principal Curves (https://hastie.su.domains/Papers/Principal_Curves.pdf).
This implementation available on GitHub could help you: https://github.com/zsteve/pcurvepy.
I need to transfer some Matlab code into P. I got stuck at a numpy.arange I use to set points continuously on the arc of circle with a given angle (in radians).
I got this far (example for points on x-axis):
def sensor_data_arc_x():
theta = np.arange(0, angle/2, 2*np.pi/360)
return np.multiply(radius, np.cos(np.transpose(theta)))
I know numpy.arange does not include the endpoint, although the Matlab equivalent does; the array is always one element short, which is messing up my further computations.
Is there an way to include the endpoint?
I recommend that you work through a tutorial on for loops -- the information you need is there, as well as other hints on using controlled iteration. To solve your immediate need, simply increase your upper bound by one loop increment:
inc = 2*np.pi/360
theta = np.arange(0, angle/2 + inc, inc)
I am working on translating a model from MATLAB to Python. The crux of the model lies in MATLAB's ode15s. In the MATLAB execution, the ode15s has standard options:
options = odeset()
[t P] = ode15s(#MODELfun, tspan, y0, options, params)
For reference, y0 is a vector (of size 98) as is MODELfun.
My Python attempt at an equivalent is as follows:
ode15s = scipy.integrate.ode(Model.fun)
ode15s.set_integrator('vode', method = 'bdf', order = 15)
ode15s.set_initial_value(y0).set_f_params(params)
dt = 1
while ode15s.successful() and ode15s.t < duration:
ode15s.integrate(ode15s.t+dt)
This though, does not seem to be working. Any suggestions, or an alternative?
Edit:
After looking at the output, the result I'm getting from the Python is either no change in some elements of y0 over time, or a constant change at each step for the rest of the y0. Any experience with something like this?
According to the SciPy wiki for Matlab Users, the right way for using the ode15s is
scipy.integrate.ode(f).set_integrator('vode', method='bdf', order=15)
One point to make clear is that, unlike Matlab's ode15s, the scipy integrator 'vode' does not support models with a mass matrix. So any recommendation should include this caveat.
I am trying to estimate few parameters using the constained maximum likelihood in R and more specifically the constrOptim() from the stata package in R. I am programming in Python and using R via the RPy2.
In my model, I am assuming that the data follow the Beta-distribution, so I created a simulated dataset by using prespecified values for the parameters and now I am trying to estimate these parameters in order to verify that my estimation program works fine.
What I have observed is that my estimation is quite sensitive to the initial parameters. For example I have 11 parameters to estimate (let's call the parameters as pam1..pam11) and their true value is:
pam1=0.2 pam2=0.3 pam3=0.4 pam4=0.7 pam5=0.55 pam6=0.45 pam7=0.1 pam8=0.01 pam9=0.01 pam10=45 pam11=45
In the constrOptim() I am setting the starting parameters as:
start_param=FloatVector((pam1,pam2,pam3,pam4,pam5,pam6,pam7,pam8,pam9,pam10,pam,11))
where I set the starting values. I have observed that when I am using different sets of starting values the results change. For example when I am using the set
start_param=FloatVector((0.2,0.3,0.4,0.6,0.7,0.8,0.3,0.011,0.011,15,15))
and I obtain the following estimates
$par
[1] 0.20851065 0.30348571 0.43616932 0.73695654 0.58287221
0.45541506
[7] 0.11191879 0.02233908 0.01988878 46.57249043 45.48544918
$value
[1] -215.9711
$convergence
[1] 0
but when I am using another set as for example:
start_param=FloatVector((0.2,0.3,0.4,0.75,0.55,0.45,0.3,0.05,0.05,59,59))
the results change and it seems that I am losing convergence
$par
[1] 0.17218738 0.27165359 0.48458978 0.80295773 0.62618983 0.43254786
[7] 0.12426385 0.02991442 0.01853252 57.78269692 59.35376216
$value
[1] -146.9858
$convergence
[1] 1
My question is the following:
I have seen that in Stata, there is an option that searches for better starting values for the numerical optimization algorithm. I tried to set multiple starting values by setting a matrix but this did not work.
Is there an option in constrOptim that will allow me to do something like this?
Many thanks in advance.
For additional information, the specification I use for the constrOptim() is:
res=statsr.constrOptim(start_param,Rmaxlikelihood,grad='NULL',ui=ui,ci=ci,method="Nelder-Mead",control=list("maxit=3000,trace=F"))
I came across a function in R which does exactly what I was looking for.
The package ‘Rsolnp’ has the function "gosolnp" which is described to perform Random Initialization and Multiple Restarts of the solnp solver.
It is quite efficient and the documentation provides examples on how to use it.
More: http://cran.r-project.org/web/packages/Rsolnp/Rsolnp.pdf