LMFIT confidence interval uncertainty estimates error on python - python

the output error is :
MinimizerException:
Cannot determine Confidence Intervals without sensible uncertainty estimates
Why I got this error? How can I calculate uncertainty estimates and solve this problem??
for dosya1 in glob.glob("mean*"):
data1=np.genfromtxt(dosya1, skip_header=0, skip_footer=0, names=["wavelength","mean"])
x=data1["wavelength"]
mod=VoigtModel()
pars = mod.guess(y, x=x)
pars['gamma'].set(value=0.7, vary=True, expr="")
out=mod.fit(y,pars, x=x)
pars=lmfit.Parameters()
pars.add_many(('amp', out.params["amplitude"].value), ('sig', out.params["sigma"].value), ("gam",out.params["gamma"].value),("cent",out.params["center"].value))
def residual(p):
amp=p["amp"].value
sig=p["sig"].value
gam=p["gam"].value
cent=p["cent"].value
return ((wofz((x-cent + wofz(gam).imag)/(sig*(sqrt(2)))).real) / (sig*(sqrt(2))))- y
mini = lmfit.Minimizer(residual, pars)
result=mini.minimize()
ci = lmfit.conf_interval(mini, result)
lmfit.printfuncs.report_ci(ci)

You will get this error message if lmfit.minimize() (actually, leastsq(), which it calls) is unable to estimate uncertainties by inverting the curvature matrix. It uses these values (which are often very good estimates, BTW) as the scale for explicitly exploring parameter space. There are several possible reasons why leastsq() might fail to estimate uncertainties. Common reasons are that one or more of the variables is not found to alter the fit, or the residual contains NaNs.
It is hard to predict when this might happen. You should allow for the possibility and/or check that the initial fit succeeded and was able to make the initial estimate of the uncertainties (check result.errorbars) before calling conf_interval().

Related

Can i tell numpy curve_fit to find the best parameters that meet some conditions?

I have this set of experimental data:
x_data = np.array([0, 2, 5, 10, 15, 30, 60, 120])
y_data = np.array([1.00, 0.71, 0.41, 0.31, 0.29, 0.36, 0.26, 0.35])
t = np.linspace(min(x_data), max(x_data), 151)
scatter plot
I want to fit them with a curve that follows an exponential behaviour for t < t_lim and a linear behaviour for t > t_lim, where t_lim is a value that i can set as i want. I want to use curve_fit to find the best fit. I would like to find the best fit meeting these two conditions:
The end point of the first behaviour (exponential) must be the starting point of the second behaviour (linear): in other words, I don't want the jump discontinuity in the middle.
I would like the second behaviour (linear) to be descending.
I solved in this way:
t_lim = 15
def y(t, k, m, q):
return np.concatenate((np.exp(-k*t)[t<t_lim], (m*t + q)[t>=t_lim]))
popt, pcov = curve_fit(y, x_data, y_data, p0=[0.5, -0.005, 0.005])
y_model = y(t, k_opt, m_opt, q_opt)
I obtain this kind of curve:
chart_plot
I don't know how to tell python to find the best values of m, k, q that meet the two conditions (no jump discontinuity, and m < 0)
Instead of trying to add these conditions as explicit constraints, I'd go about modifying the form of y so that these conditions are always satisfied.
For example, try replacing m with -m**2. That way, the coefficient in the linear part will always be negative.
For the continuity condition, how about this: For an exponential with a given decay factor and a linear curve with a given slope which are supposed to meet at a given t_lim there's only exactly one value for q that will satisfy that condition. You can explicitly compute that value and just plug that in.
Basically, q won't be a fit parameter anymore; instead, inside of y, you'd compute the correct q value based on k, m, t_lim.
This post is not a direct answer to the question. This is a preliminary study.
First : Fitting to a simple exponential function with only a constant (without decreasing or increasing linear part) :
The result is not bad considering the wide scatter on the right part.
Second : Fitting to an exponential function with a linear function (without taking account of the expected decreasing on the right).
The slope of the linear part is very low : 0.000361
But the slope is positive which is not as wanted.
Since the scatter is very large one suspects that the slope of the linear function might be governed mainly by the scatter. In order to check this hypothesis one make the same fitting calculus whitout one point. Taking only the seven first points (that is forgetting the eighth point) the result is :
Now the slope is negative as wanted. But this is an untruthful result.
Of course if some technical reason implies that the slope is necessarily negative one could use a picewise function made of an exponenlial and a linear function. But what is the credibility of such a model ?
This doesn't answer to the question. Neverthelss I hope that this inspection will be of interest.
For information :
The usual nonlinear regression methods are often non convergent in case of large scatter due to the difficulty to set initial values of the parameters sufficienly close to the unknown correct values. In order to avoid the difficulty the above fittings where made with a non usual method which doesn't requires "guessed" initial value. For the principle refer to : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
In the referenced document the case of the function exponential and linear isn't fully treated. In order to overcome this deficiency the method is shown below with the numerical calculus (MathsCAD).
If more accuracy is needed use a nonlinear regression software with the values of p,a,b,c found above as initial values to start the iterative calculus.

Problems with curve_fit from scipy.optimze

I know that there are some similar questions, but since none of them brought me any further, I decided to ask one of my own.
I am sorry, if the answer to my problem is already somewhere out there, but I really couldn't find it.
I tried fitting f(x) = a*x**b to rather linear data using curve_fit. It compiles properly, but the result is way off as shown below:
The thing is, that I don't really know what I am doing, but on the other hand fitting always is more of an art than science and there was at least one general bug with scipy.optimize.
My data looks like this:
x-values:
[16.8, 2.97, 0.157, 0.0394, 14.000000000000002, 8.03, 0.378, 0.192, 0.0428, 0.029799999999999997, 0.000781, 0.0007890000000000001]
y-values:
[14561.766666666666, 7154.7950000000001, 661.53750000000002, 104.51446666666668, 40307.949999999997, 15993.933333333332, 1798.1166666666666, 1015.0476666666667, 194.93800000000002, 136.82833333333332, 9.9531566666666684, 12.073133333333333]
That's my code (using a really nice example in the last answer to that question):
def func(x,p0,p1): # HERE WE DEFINE A FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p0*(x**p1)
# Here you give the initial parameters for p0 which Python then iterates over to find the best fit
popt, pcov = curve_fit(func,xvalues,yvalues, p0=(1.0,1.0))#p0=(3107,0.944)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p0 = popt[0]
p1 = popt[1]
residuals = yvalues - func(xvalues,p0,p1)
fres = sum(residuals**2)
print 'chi-square'
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(5e-4,20) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p0,p1)
The starting values are from a fit with gnuplot, that is plausible but I need to cross-check.
This is printed output (first fitted p0, p1, then chi-square):
[ 4.67885857e+03 6.24149549e-01]
chi-square
424707043.407
I guess this is a difficult question, therefore much thanks in advance!
When fitting curve_fit optimizes the sum of (data - model)^2 / (error)^2
If you don't pass in errors (as you are doing here) curve_fit assumes that all of the points have an error of 1.
In this case, as your data spans many orders of magnitude, the points with the largest y values dominate the objective function, and causes curve_fit to attempt to fit them at the expense of the others.
The best way of fixing this would be including the errors in your yvalues in the fit (it looks like you do as you have error bars in the plot you have made!). You can do this by passing them in as the sigma parameter of curve_fit.
I would rethink the experimental part. Two datapoints are questionable:
The image you showed us looks pretty good because you took the log:
You could do a linear fit on log(x) and log(y). In this way you might limit the impact of the largest residuals. Another approach would be robust regression (RANSAC from sklearn or least_squares from scipy).
Nevertheless you should either gather more datapoints or repeat the measurements.

Scipy leastsq constraint by ks_2samp

I want to fit a histogram by the sum of two gaussians, both with different amplitude, mean and deviation. To do that, I have used scipy's curve_fit, but the KS-test afterwards was awful. That was mostly because the first few (as in the most negativ x values) values were not very accurate, and therefor the cumulative function was way off. I also noted that the cumulative function was off by 20%, and therefor an accurate outcome of the KS-test is impossible.
Then I tried to make a constraint to the integrand, following this question. The relevant code I got is the following (without importing and plotting part):
def residuals(p, x,y):
integral = quad( gauss2, -300, 300, args= (p[0],p[1],p[2],p[3],p[4],p[5]))[0]
penalization = abs(1-integral)*10000
print penalization
return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization
def gauss2(x,A, mu, sigma, A2, mu2, sigma2):
if A2<0:
return 1000
return A*np.exp(-(x-mu)**2/(2.*sigma**2))+ A2*np.exp(-(x-mu2)**2/(2.*sigma2**2))
hist, bin_edges = np.histogram(data, normed=True, bins=bins)
hist_cm=np.cumsum(hist)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
coeff, pcov2 = leastsq(residuals, x0=(0.01,0.,60.,0.01,150.,40.) ,args=(bin_centres, hist)
hist_fit = gauss2(bin_centres, *coeff)
hist_fit_cm=np.cumsum(hist_fit)
KStest= stats.ks_2samp(hist_cm,hist_fit_cm)
This results in a pretty good estimate, and a P-value of 0.629. As far as I know, this means that the histogram and the fit have a 62.9% change of coming from the same data, is this correct?
Now I thought that I could improve the answer by not penalising for the integrand, but for the P-value. For this I changed the def residuals with the following:
def residuals(p, x,y):
global bin_centres #its global defined, so should be good
iets = np.cumsum(gauss2(bin_centres,p[0],p[1],p[2],p[3],p[4],p[5]))
pizza=stats.ks_2samp(np.cumsum(y),iets)[1]
penalization = 1000*(1-pizza)
return y - gauss2(x, p[0],p[1],p[2],p[3],p[4],p[5] ) - penalization
Since the P-value (which I call pizza) should reach as close to 1 as possible, the penalization becomes smaller with a higher P-value. But this gives results which make less sense: the P-value turns out to be 0.160. When plotting it's even worse: two spikes, instead of the smooth fit I obtained with the first method.
Is a KS-test a good penalisation method, instead of the integrand? How can implement it in a good way then?
(brief answer, as far as I understand reading the code)
The first penalization penalization = abs(1-integral)*10000 is a constraint on the total integral. I think this is the same as imposing A + A2 == 1 so the mixture in gauss2 integrates to one. An alternative without constraints would be to impose this directly by, for example, using a Logit function for the mixing probability.
The Kolmogorov-Smirnov penalization uses a L1 distance and penalized the largest deviation between the empirical and the parametric cdf, approximately (*)
L1 = np.max(np.abs(np.cumsum(y) - iets))
The p-value is just a monotonic transformation of the L1 distance, but will have a different curvature and will penalize differently.
(*) The actual calculation looks at all the step points directly.
As aside: The Kolmogorov-Smirnov test is designed for continuous not for discrete or binned variables. The appropriate distance measure would be based on chi-square test or power divergence. However, this only affects ks_2samp as a hypothesis test, and not if we just use it as a distance measure.
Another aside: the use of integrate.quad could be replaced by using norm.cdf directly.

Mathematical background of statsmodels wls_prediction_std

wls_prediction_std returns standard deviation and confidence interval of my fitted model data. I would need to know the way the confidence intervals are calculated from the covariance matrix. (I already tried to figure it out by looking at the source code but wasn't able to) I was hoping some of you guys could help me out by writing out the mathematical expression behind wls_prediction_std.
There should be a variation on this in any textbook, without the weights.
For OLS, Greene (5th edition, which I used) has
se = s^2 (1 + x (X'X)^{-1} x')
where s^2 is the estimate of the residual variance, x is vector or explanatory variables for which we want to predict and X are the explanatory variables used in the estimation.
This is the standard error for an observation, the second part alone is the standard error for the predicted mean y_predicted = x beta_estimated.
wls_prediction_std uses the variance of the parameter estimate directly.
Assuming x is fixed, then y_predicted is just a linear transformation of the random variable beta_estimated, so the variance of y_predicted is just
x Cov(beta_estimated) x'
To this we still need to add the estimate of the error variance.
As far as I remember, there are estimates that have better small sample properties.
I added the weights, but never managed to verify them, so the function has remained in the sandbox for years. (Stata doesn't return prediction errors with weights.)
Aside:
Using the covariance of the parameter estimate should also be correct if we use a sandwich robust covariance estimator, while Greene's formula above is only correct if we don't have any misspecified heteroscedasticity.
What wls_prediction_std doesn't take into account is that, if we have a model for the heteroscedasticity, then the error variance could also depend on the explanatory variables, i.e. on x.

Standard error in non-linear regression

I have been doing some Monte Carlo physics simulations with Python and I am in unable to determine the standard error for the coefficients of a non-linear least square fit.
Initially, I was using SciPy's scipy.stats.linregress for my model since I thought it would be a linear model but noticed it is actually some sort of power function. I then used NumPy's polyfit with the degrees of freedom being 2 but I can't find anyway to determine the standard error of the coefficients.
I know gnuplot can determine the errors for me but I need to do fits for over 30 different cases. I was wondering if anyone knows of anyway for Python to read the standard error from gnuplot or is there some other library I can use?
Finally found the answer to this long asked question! I'm hoping this can at least save someone a few hours of hopeless research for this topic. Scipy has a special function called curve_fit under its optimize section. It uses the least square method to determine the coefficients and best of all, it gives you the covariance matrix. The covariance matrix contains the variance of each coefficient. More exactly, the diagonal of the matrix is the variance and by square rooting the values, the standard error of each coefficient can be determined! Scipy doesn't have much documentation for this so here's a sample code for a better understanding:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
def func(x,a,b,c):
return a*x**2 + b*x + c #Refer [1]
x = np.linspace(0,4,50)
y = func(x,2.6,2,3) + 4*np.random.normal(size=len(x)) #Refer [2]
coeff, var_matrix = curve_fit(func,x,y)
variance = np.diagonal(var_matrix) #Refer [3]
SE = np.sqrt(variance) #Refer [4]
#======Making a dictionary to print results========
results = {'a':[coeff[0],SE[0]],'b':[coeff[1],SE[1]],'c':[coeff[2],SE[2]]}
print "Coeff\tValue\t\tError"
for v,c in results.iteritems():
print v,"\t",c[0],"\t",c[1]
#========End Results Printing=================
y2 = func(x,coeff[0],coeff[1],coeff[2]) #Saves the y values for the fitted model
plot.plot(x,y)
plot.plot(x,y2)
plot.show()
What this function returns is critical because it defines what will used to fit for the model
Using the function to create some arbitrary data + some noise
Saves the covariance matrix's diagonal to a 1D matrix which is just a normal array
Square rooting the variance to get the standard error (SE)
it looks like gnuplot uses levenberg-marquardt and there's a python implementation available - you can get the error estimates from the mpfit.covar attribute (incidentally, you should worry about what the error estimates "mean" - are other parameters allowed to adjust to compensate, for example?)

Categories

Resources