is there an equivalent of R's nls in statsmodels? - python

Does statsmodels support nonlinear regression to an arbitrary equation? (I know that there are some forms that are already built in, e.g. for logistic regression, but I am after something more flexible)
In the solution https://stats.stackexchange.com/a/44249 to a question about non-linear regression,
the code is in R and uses the function nls. There it has the equation's parameters defined with start = list(a1=0, ...). These are of course just some initial guesses and not the final fitted values. But what is different here compared to lm is that the parameters don't need to be from the columns of the input data.
I've been able to use statsmodels.formula.api.ols as an equivalent for R's lm, but when I try to use it with an equation that has parameters (and not weights for the inputs / combinations of inputs), statsmodels complains about the parameters not being defined. It does not seem to have an equivalent argument as start= so it isn't obvious how to introduce them.
Is there a different class or function in statsmodels that accepts definition of these initial parameter values?
EDIT:
My current attempt and also workaround following suggestion with lmfit
from statsmodels.formula.api import ols
import numpy as np
import pandas as pd
def eqn_poly(x, a, b):
''' simple polynomial '''
return a*x**2.0 + b*x
def eqn_nl(x, a, b):
''' fractional equation '''
return 1.0 / ((a+x)*b)
x = np.arange(0, 3, 0.1)
y1 = eqn_poly(x, 0.1, 0.5)
y2 = eqn_nl(x, 0.1, 0.5)
sigma =0.05
y1_noise = y1 + sigma * np.random.randn(*y1.shape)
y2_noise = y2 + sigma * np.random.randn(*y2.shape)
df1 = pd.DataFrame(np.vstack([x, y1_noise]).T, columns= ['x', 'y'])
df2 = pd.DataFrame(np.vstack([x, y2_noise]).T, columns= ['x', 'y'])
res1 = ols("y ~ 1 + x + I(x ** 2.0)", df1).fit()
print res1.summary()
res3 = ols("y ~ 1 + x + I(x ** 2.0)", df2).fit()
#res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
# ^^^ this fails if a, b are not initialised ^^^
# so initialise a, b
a,b = 1.0, 1.0
res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
print res2.summary()
# ===> and now the fitting is bad, it has an intercept -4.79, and a weight
# on the equation 15.7.
Giving lmfit the formula it is able to find parameters.
import lmfit
mod = lmfit.Model(eqn_nl)
lm_result = mod.fit(y2_noise, x=x, a=1.0, b=1.0)
print lm_result.fit_report()
# ===> this one works fine, a=0.101, b=0.4977
But trying to put y1, x into ols doesn't seem to work ("PatsyError: model is missing required outcome variables"). I didn't really follow that suggestion.

consider scipy.optimize.curve_fit as desired R.nls-like function

Related

generic curve fitting by changing model function in python

I am using SciPy.optimize.curve_fit https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
to get the coefficients of curve fitting function, the SciPy function takes the model function as its first argument
so if I want to make a linear curve fit, I pass to it the following function:
def objective(x, a, b):
return a * x + b
If I want polynomial curve fitting of second degree, I pass the following:
def objective(x, a, b, c):
return a * x + b * x**2 + c
And so on, what I want to achieve is to make this model function generic
for example, if the user wanted to curve fit to a polynomial of 5th degree by inputting 5 it should change to
def objective(x, a, b, c, d, e, f):
return (a * x) + (b * x**2) + (c * x**3) + (d * x**4) + (e * x**5) + f
While the code is running
is this possible?
And if it is not possible using SciPy because it requires to change a function, is there any other way to achieve what I want ?
If you really want to implement it on your own, you can either use a variable number of coefficients *coeffs:
def objective(x, *coeffs):
result = coeffs[-1]
for i, coeff in enumerate(coeffs[:-1]):
result += coeff * x**(i+1)
return result
or use np.polyval:
import numpy as np
def objective(x, *coeffs):
return np.polyval(x, coeffs)
However, note that there's no need to use curve_fit. You can directly use np.polyfit to do a least-squares polynomial fit.
The task can be accomplished in a few ways. If you want to use scipy, you can simply create a dictionary to refer to specific function by using numerical input from user:
import scipy.optimize as optimization
polnum = 2 # suppose this is input from user
def objective1(x, a, b):
return a * x + b
def objective2(x, a, b, c):
return a * x + b * x**2 + c
# Include some more functions
# Do not include round brackets in the dictionary
object_dict = {
1: objective1,
2: objective2
# Include the numbers with corresponding functions
}
opt = optimization.curve_fit(object_dict[polnum], x, y) # Curve fitted
print(opt[0]) # Returns parameters
However, I would suggest you going with a bit better way where you do not have to define each function:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
polnum = 2 # suppose this is input from user
# Creates a Polynomial of polnum degree
model = make_pipeline(PolynomialFeatures(polnum), LinearRegression)
# You can change the estimator Linear Regression to any other offered in sklearn.linear_model
# Make sure that x and y data are of the same shape
model.fit(x, y)
# Now you can use the model to check some other values
y_predict = model.predict(x_predict[:, np.newaxis])

Python curve fitting with constraints

I have been looking for Python curve fitting with constraints. One of the option is to use lmfit module and another option is to use penalization to enforce the constraints. I have following code in which I am trying to enforce a+b=3.6 as the constraint. In other words, y=3.6 when x=1 and x is always >=1 in my case.
import numpy as np
import scipy.optimize as sio
def func(x, a, b, c):
return a+b*x**c
x = [1, 2, 4, 8, 16]
y = [3.6, 3.96, 4.31, 5.217, 6.842]
lb = np.ones(3, dtype=float)
ub = np.ones(3, dtype=float)*10.
popt, pcov = sio.curve_fit(func, x, y)
print(popt)
Ideally, I would like to use the lmfit approach and spent good amount of trying to understand examples but could not succeed. Can someone help with an example?
If I understand your question correctly, you want to model some data with
def func(x, a, b, c):
return a+b*x**c
and for a particular set of data you want to impose the constraint that a+b=3.6. You could, just "hardwire that in", changing the function to be
def func2(x, b, c):
a = 3.6 - b
return a+b*x**c
and now you have a model function with only two variables: b, and c.
That would not be very flexible but would work.
Using lmfit gives back some of that flexibility. To do a completely unconstrained fit, you would say
from lmfit import Model
mymodel = Model(func)
params = mymodel.make_params(a=2, b=1.6, c=0.5)
result = mymodel.fit(y, params, x=x)
(Just as an aside: scipy.optimize.curve_fit permits you to not specify initial values for the parameters and implicitly sets them to 1 without telling you. This is a terrible mis-feature - always give initial values).
If you do want to impose the constraint a+b=3.6, you could then do
params['a'].expr = '3.6-b'
result2 = mymodel.fit(y, params, x=x)
print(result2.fit_report())
When I do that with the data you provided, this prints (note that it reports 2 variables, not 3):
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 34
# data points = 5
# variables = 2
chi-square = 0.01066525
reduced chi-square = 0.00355508
Akaike info crit = -26.7510142
Bayesian info crit = -27.5321384
[[Variables]]
a: 3.28044833 +/- 0.04900625 (1.49%) == '3.6-b'
b: 0.31955167 +/- 0.04900626 (15.34%) (init = 1.6)
c: 0.86901253 +/- 0.05281279 (6.08%) (init = 0.5)
[[Correlations]] (unreported correlations are < 0.100)
C(b, c) = -0.994
Your code hinted at using (but did not actually use) upper and lower bounds for the parameter values. Those are also possible with lmfit, as with
params['b'].min = 1
params['b'].min = 10
and so forth. I'm not sure you need them here, and would caution against trying to set bounds too tightly.

Co variance matrix of a model in python

To find the co variance matrix of a fitted model in python (equivalent to vcov() (R fucntion) in python)
lmfit <- lm(formula = Y ~ X, data=Data_df)
lmpred <- predict(lmfit, newdata=Data_df, se.fit=TRUE, interval = "prediction")
std_er <- sqrt(((X0) %*% vcov(lmfit)) %*% t(X0))
trying to convert the above code in python. For which i need to find the co variance matrix of the fitted model ie, vcov.
I wont be able to use np.cov() as im trying to find the co variance matrix of the model.
i have already used statsmodels.regression.linear_model.OLSResults.cov_params(), But i m not getting the same values as in R.
The scipy ODR code can independently calculate the parameter covariance matrix, here is an example extracted from the source code of my zunzun.com online curve fitter:
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()
print('Covariance matrix:')
print(cov_beta)
Please provide specific details on what you're using.
Assuming you're using numpy arrays for your data, there's numpy.cov estimator
This works for when vcov() returns a 1x1 dataframe. I solved my function in Python using:
fit = scipy.optimize.minimize(fun, x0=x, method = 'L-BFGS-B')
Then, I specified the hessian inverse return value as follows:
vcov = fit['hess_inv'].todense().ravel()
This gave me the same result ~(±1e-3) as stats4::vcov() in R for scenarios where vcov() returns a 1x1 data frame.

Add constraints to scipy.optimize.curve_fit?

I have the option to add bounds to sio.curve_fit. Is there a way to expand upon this bounds feature that involves a function of the parameters? In other words, say I have an arbitrary function with two or more unknown constants. And then let's also say that I know the sum of all of these constants is less than 10. Is there a way I can implement this last constraint?
import numpy as np
import scipy.optimize as sio
def f(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0, 100, 101)
y = 2*x**2 + 3*x + 4
popt, pcov = sio.curve_fit(f, x, y, \
bounds = [(0, 0, 0), (10 - b - c, 10 - a - c, 10 - a - b)]) # a + b + c < 10
Now, this would obviously error, but I think it helps to get the point across. Is there a way I can incorporate a constraint function involving the parameters to a curve fit?
Thanks!
With lmfit, you would define 4 parameters (a, b, c, and delta). a and b can vary freely. delta is allowed to vary, but has a maximum value of 10 to represent the inequality. c would be constrained to be delta-a-b (so, there are still 3 variables: c will vary, but not independently from the others). If desired, you could also put bounds on the values for a, b, and c. Without testing, your code would be approximately::
import numpy as np
from lmfit import Model, Parameters
def f(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0, 100.0, 101)
y = 2*x**2 + 3*x + 4.0
fmodel = Model(f)
params = Parameters()
params.add('a', value=1, vary=True)
params.add('b', value=4, vary=True)
params.add('delta', value=5, vary=True, max=10)
params.add('c', expr = 'delta - a - b')
result = fmodel.fit(y, params, x=x)
print(result.fit_report())
Note that if you actually get to a situation where the constraint expression or bounds dictate the values for the parameters, uncertainties may not be estimated.
curve_fit and least_squares only accept box constraints. In scipy.optimize, SLSQP can deal with more complicated constraints.
For curve fitting specifically, you can have a look at lmfit package.

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Categories

Resources