Standard error in non-linear regression - python

I have been doing some Monte Carlo physics simulations with Python and I am in unable to determine the standard error for the coefficients of a non-linear least square fit.
Initially, I was using SciPy's scipy.stats.linregress for my model since I thought it would be a linear model but noticed it is actually some sort of power function. I then used NumPy's polyfit with the degrees of freedom being 2 but I can't find anyway to determine the standard error of the coefficients.
I know gnuplot can determine the errors for me but I need to do fits for over 30 different cases. I was wondering if anyone knows of anyway for Python to read the standard error from gnuplot or is there some other library I can use?

Finally found the answer to this long asked question! I'm hoping this can at least save someone a few hours of hopeless research for this topic. Scipy has a special function called curve_fit under its optimize section. It uses the least square method to determine the coefficients and best of all, it gives you the covariance matrix. The covariance matrix contains the variance of each coefficient. More exactly, the diagonal of the matrix is the variance and by square rooting the values, the standard error of each coefficient can be determined! Scipy doesn't have much documentation for this so here's a sample code for a better understanding:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
def func(x,a,b,c):
return a*x**2 + b*x + c #Refer [1]
x = np.linspace(0,4,50)
y = func(x,2.6,2,3) + 4*np.random.normal(size=len(x)) #Refer [2]
coeff, var_matrix = curve_fit(func,x,y)
variance = np.diagonal(var_matrix) #Refer [3]
SE = np.sqrt(variance) #Refer [4]
#======Making a dictionary to print results========
results = {'a':[coeff[0],SE[0]],'b':[coeff[1],SE[1]],'c':[coeff[2],SE[2]]}
print "Coeff\tValue\t\tError"
for v,c in results.iteritems():
print v,"\t",c[0],"\t",c[1]
#========End Results Printing=================
y2 = func(x,coeff[0],coeff[1],coeff[2]) #Saves the y values for the fitted model
plot.plot(x,y)
plot.plot(x,y2)
plot.show()
What this function returns is critical because it defines what will used to fit for the model
Using the function to create some arbitrary data + some noise
Saves the covariance matrix's diagonal to a 1D matrix which is just a normal array
Square rooting the variance to get the standard error (SE)

it looks like gnuplot uses levenberg-marquardt and there's a python implementation available - you can get the error estimates from the mpfit.covar attribute (incidentally, you should worry about what the error estimates "mean" - are other parameters allowed to adjust to compensate, for example?)

Related

Is there any way where we can calculate parameter covariance matrix instead of correlation matrix in python lmfit package

I have re-parameterize Arrhenius equation of form k = kref*exp(-E/R((1/T)-(1/Tref)) and i wanted to estimate the values of parameters E & kref which i got from lmfit package and correlations between it too.
However the whole idea of re-parameterization was to see if we are getting low correlation between k0 and E after re-parameterization of original Arrhenius equation i.e. k = ko*exp(-E/RT) where kref = ko*exp(-E/RTref) so to do that i got following relation
Cov(ko,E)/k0 = Var(E)/RTref - Cov(Kref,E)/kref
So my question is that is there any way we can find Var(E) and also standard deviation of kref??
I'm not certain I understand what you are asking - your notation is not clear. It is really always better to provide a minimal and complete example of working code.
The fit result from lmfit does include the covariance matrix (covar) between the variable parameters. Is that what you are looking for?

Constraining OLS (or WLS) coeffecients using statsmodels

I have a regression of the form model = sm.GLM(y, X, w = weight).
Which ends up being a simple weighted OLS. (note that specificying w as the error weights array actually works in sm.GLM identically to sm.WLS despite it not being in the documentation).
I'm using GLM because this allows me to fit with some additional constraints using fit_constrained(). My X consists of 6 independent variables, 2 of which i want to constrain the resulting coeffecients to be positive. But i can not seem to figure out the syntax to get fit_constrained() to work. The documentation is extremely bare and i can not find any good examples anywhere. All i really need is the correct syntax for imputing these constraints. Thanks!
The function you see is meant for linear constraints, that is a combination of your coefficients fulfill some linear equalities, not meant for defining boundaries.
The closest you can get is using scipy least squares and defining the boundaries, for example, we set up some dataset with 6 coefficients:
from scipy.optimize import least_squares
import numpy as np
np.random.seed(100)
x = np.random.uniform(0,1,(30,6))
y = np.random.normal(0,2,30)
The function to basically matrix multiply and return error:
def fun(b, x, y):
return b[0] + np.matmul(x,b[1:]) - y
The first coefficient is the intercept. Let's say we require the 2nd and 6th to be always positive:
res_lsq = least_squares(fun, [1,1,1,1,1,1,1], args=(x, y),
bounds=([-np.inf,0,-np.inf,-np.inf,-np.inf,-np.inf,0],+np.inf))
And we check the result:
res_lsq.x
array([-1.74342242e-01, 2.09521327e+00, -2.02132481e-01, 2.06247855e+00,
-3.65963504e+00, 6.52264332e-01, 5.33657765e-20])

statmodels in python package, How exactly duplicated features are handled?

I am a heavy R user and am recently learning python.
I have a question about how statsmodels.api handles duplicated features.
In my understanding, this function is a python version of glm in R package. So I am expecting that the function returns the maximum likelihood estimates (MLE).
My question is which algorithm is statsmodels employ to obtain MLE?
Especially how is the algorithm handling the situation with duplicated features?
To clarify my question, I generate a sample of size 50 from Bernoullie distribution with a single covariate x1.
import statsmodels.api as sm
import pandas as pd
import numpy as np
def ilogit(eta):
return 1.0 - 1.0/(np.exp(eta)+1)
## generate samples
Nsample = 50
cov = {}
cov["x1"] = np.random.normal(0,1,Nsample)
cov = pd.DataFrame(cov)
true_value = 0.5
resp = {}
resp["FAIL"] = np.random.binomial(1, ilogit(true_value*cov["x1"]))
resp = pd.DataFrame(resp)
resp["NOFAIL"] = 1 - resp["FAIL"]
Then fit the logistic regression as:
## fit logistic regrssion
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This returns:
The estimated coefficient is more or less similar to the true value (=0.5).
Then I create a duplicate column, namely x2, and fit the logistic regression model again. (glm in R package would return NA for x2)
cov["x2"] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
This outputs:
Surprisingly, this works and coefficient estimates of x1 and x2 are exactly identical (=0.1182). As the previous fit returns the coefficient estimate of x1 = 0.2364, the estimate was halved.
Then I increase the number of duplicated features to 9 and fit the model:
cov = cov
for icol in range(3,10):
cov["x"+str(icol)] = cov["x1"]
fit = sm.GLM(resp,cov,family=sm.families.Binomial(sm.families.links.logit)).fit()
fit.summary()
As expected, the estimates of each duplicated variable are the same (0.0263) and they seem to be 9 times smaller than the original estimate for x1 (0.2364).
I am surprised with this unexpected behaviour of maximum likelihood estimates. Could you explain why this is happening and also what kind of algorithms are employed behind statsmodels.api?
The short answer:
GLM is using the Moore-Penrose generalized inverse, pinv, in this case, which corresponds to a principal component regression where components with zero eigenvalues are dropped. zero eigenvalue is defined by the default threshold (rcond) in numpy.linalg.pinv.
statsmodels does not have a systematic policy towards collinearity. Some nonlinear optimization routines raise an exception when the matrix inverse fails. However, the linear regression models, OLS and WLS, use the generalized inverse by default, in which case we see the behavior as above.
The default optimization algorithm in GLM.fit is iteratively reweighted least squares irls which uses WLS and inherits the default behavior of WLS for singular design matrices.
The version in statsmodels master has also the option of using the standard scipy optimizers where the behavior with respect to singular or near singular design matrices will depend on the details of the optimization algorithm.

Need Python polynomial fit function that returns covariance

I want to do least-squares polynomial fits on data sets (X,Y,Yerr) and obtain the covariance matrices of the fit parameters. Also, since I have many data sets, CPU-time is an issue, so I'm seeking an analytical (=fast) solution. I found the following (non-ideal) options:
numpy.polyfit does the fit, but doesn't take into account the errors Yerr, nor does it return the covariance;
numpy.polynomial.polynomial.polyfit does accept Yerr as an input (in the form of weights), but doesn't return covariance either;
scipy.optimize.curve_fit and scipy.optimize.leastsq can be tailored to fit polynomials and return the covariance matrix, but - being iterative methods - these are much slower than the polyfit routines (which yield an analytical solution);
Does Python provide an analytical polynomial fit routine that returns the covariance of the fit parameters (or do I have to write one myself :-) ?
Update:
It appears that in Numpy 1.7.0, numpy.polyfit now not only does accept weights, but also returns the covariance matrix of the coefficients ... So, issue resolved! :-)
You want a fast weighted least squares model that returns the covariance matrix without additional overhead? In general, the right covariance matrix will depend on the data generating process (DGP) because different DGP (say Heteroscedasticity of errors) imply different distributions of parameter estimates (think White vs. OLS standard errors). But if you can assume WLS is the right way to do it, and I believe you would use the asymptotic variance estimate for beta for WLS, (1/n X'V^-1X)^-1, where V is the weighting matrix created from Yerrs. That's a pretty simple formula if numpy.polynomial.polynomial.polyfit is working for you.
I looked for an online reference but couldn't find one. But see Fumio Hayashi's Ecomometrics, 2000, Princeton University press, p. 133 - 137 for a derivation and discussion.
Update 12/4/12:
There is another stack overflow question that comes close:
numpy.polyfit has no keyword 'cov' that has a nice explanation (with code) of how to use scikits.statsmodels to do what you want. I believe you'll want to replace the line:
result = sm.OLS(Y,reg_x_data).fit()
to
result = sm.WLS(Y,reg_x_data, weights).fit()
Where you define weights as a function of Yerr as before with numpy.polynomial.polynomial.polyfit. More details on using statsmodels with WLS over at
the statsmodels website.
Here it is using scipy.linalg.lstsq
import numpy as np,numpy.random, scipy.linalg
#generate the test data
N = 100
xs = np.random.uniform(size=N)
errs = np.random.uniform(0, 0.1, size=N) # errors
ys = 1 + 2 * xs + 3 * xs ** 2 + errs * np.random.normal(size=N)
# do the fit
polydeg = 2
A = np.vstack([1 / errs] + [xs ** _ / errs for _ in range(1, polydeg + 1)]).T
result = scipy.linalg.lstsq(A, (ys / errs))[0]
covar = np.matrix(np.dot(A.T, A)).I
print result, '\n', covar
>> [ 0.99991811 2.00009834 3.00195187]
[[ 4.82718910e-07 -2.82097554e-06 3.80331414e-06]
[ -2.82097554e-06 1.77361434e-05 -2.60150367e-05]
[ 3.80331414e-06 -2.60150367e-05 4.22541049e-05]]

Integrating a function using non-uniform measure (python/scipy)

I would like to integrate a function in python and provide the probability density (measure) used to sample values. If it's not obvious, integrating f(x)dx in [a,b] implicitly use the uniform probability density over [a,b], and I would like to use my own probability density (e.g. exponential).
I can do it myself, using np.random.* but then
I miss the optimizations available in scipy.integrate.quad. Or maybe all those optimizations assume the uniform density?
I need to do the error estimation myself, which is not trivial. Or maybe it is? Maybe the error is just the variance of sum(f(x))/n?
Any ideas?
As unutbu said, if you have the density function, the you can just integrate the product of your function with the pdf using scipy.integrate.quad.
For the distribution that are available in scipy.stats, we can also just use the expect function.
For example
>>> from scipy import stats
>>> f = lambda x: x**2
>>> stats.norm.expect(f, loc=0, scale=1)
1.0000000000000011
>>> stats.norm.expect(f, loc=0, scale=np.sqrt(2))
1.9999999999999996
scipy.integrate.quad also has some predefined weight functions, although they are not normalized to be probability density functions.
The approximation error depends on the settings for the call to integrate.quad.
Just for the sake of brevity, 3 ways were suggested for calculating the expected value of f(x) under the probability p(x):
Assuming p is given in closed-form, use scipy.integrate.quad to evaluate f(x)p(x)
Assuming p can be sampled from, sample N values x=P(N), then evaluate the expected value by np.mean(f(X)) and the error by np.std(f(X))/np.sqrt(N)
Assuming p is available at stats.norm, use stats.norm.expect(f)
Assuming we have the CDF(x) of the distribution rather than p(x), calculate H=Inverse[CDF] and then integrate f(H(x)) using scipy.integrate.quad
Another possibilty would be to integrate x -> f( H(x)) where H is the inverse of the cumulative distribution of your probability distribtion.
[This is because of change of variable: replacing y=CDF(x) and noting that p(x)=CDF'(x) yields the change dy=p(x)dx and thus int{f(x)p(x)dx}==int{f(x)dy}==int{f(H(y))dy with H the inverse of CDF.]

Categories

Resources