Minimize squared error for fixed parameter regression - python

I'm currently comparing several traditional methods for selecting an exponential trend in a series of points. Trends are often selected in my industry without concern for measures of fit, and this means that a common method is to simply measure yr/yr change and average these results over a period of time. This produces a factor, but doesn't produce a fit, so it's difficult to compare it to an exponential regression or other approach. So my question:
If I have a pre-selected, fixed trend factor for an exponential curve, is there a simple method for optimizing the 'intercept' value which would minimize the squared error of the overall fit to a set of data? Consider the following:
import numpy as np
from sklearn.metrics import r2_score
from scipy.optimize import curve_fit
#Define exponential function
def ex(x,a,b):
return a*b**x
#Seed data with normally distributed error
x=np.linspace(1,20,20)
np.random.seed(100)
y=ex(x,100,1.01)+3*np.random.randn(20)
#Exponential regression for trend value, fitted values, and r_sq measure
popt, pcov = curve_fit(ex, x, y)
trend,fit,r_sq=(popt[1])-1, ex(x,*popt), r2_score(y,ex(x,*popt))
#Mean Yr/Yr change as an alternative measure of trend
trend_yryr=np.mean(np.diff(y)/y[1:])
print(trend)
print(trend_yryr)
The mean year/year change produces a different trend value for the data, and I'd like to compare it to the exponential regression's selected trend. My goal is to find the intercept which would minimize squared error for this alternative trend value over the data, and measure that squared error for comparison. Thanks for the help.

One way of fixing a parameter when using curve_fit is to pass a lambda function that hardcodes the parameters you want to fix. Here's how I did it:
# ... all your preamble, but importing matplotlib
new_b = trend_yryr + 1
popt2, pcov2 = curve_fit(lambda x, a: ex(x, a, new_b), x, y)
fit2, r_sq2 = ex(x, *popt2, new_b), r2_score(y, ex(x, *popt2, new_b))
popt is array([100.24989416, 1.00975864]) and popt2 is array([99.09513612]). Plotting the results gives me this:
You could use lmfit to do this perhaps a bit more elegantly, but it's essentially up to you.
from lmfit import Parameters, minimize
def residual(params, x, y):
a = params['a']
b = params['b']
model = ex(x, a, b)
return model - y
p1 = Parameters()
p1.add('a', value=1, vary=True)
p1.add('b', value=1, vary=True)
p2 = Parameters()
p2.add('a', value=1, vary=True)
p2.add('b', value=np.mean(np.diff(y)/y[1:]) + 1, vary=False) # important
out1 = minimize(residual, p1, args=(x, y))
out2 = minimize(residual, p2, args=(x, y))
The outputs of both methods are essentially the same so I won't post them again

Related

Comparison of curve_fit and scipy.odr - absolute sigma

I currently want to fit data with errors in x and y and im using the scipy.odr package to get my results. Im just wondering about the correct use of errors sx and sy. So here is an example.
Lets assume im measuring a voltage V for different currents I.
So I have measured 1V with an error of +- 0.1V and so on. So if I'm assuming no error in current measurement I could use spipy.curve_fit as follows. Absolute_sigma is set True because my absolute error of +- 0.1V.
So I'm getting:
from scipy.optimize import curve_fit
y=[1,2.5,3,4]
x=[1,2,3,4]
yerr=[0.1,0.1,0.1,0.1]
def func(x, a, b):
return a*x+b
popt, pcov = curve_fit(func, x, y,sigma=yerr,absolute_sigma=True)
print(popt)
print(np.sqrt(np.diag(pcov)))
[0.95 0.25]
[0.04472136 0.12247449]
In a second step I want to use the odr-package with errors in both, current and voltage.
According to the documentation it should be used as follows: sx and sy are the errors for my measurement data.
So I should assume to get similar results to curve_fit if I'm using a very small error for sx.
from scipy.odr import *
x_err = [0.0000000001]*x.__len__()
y_err = yerr
def linear_func(p, x):
m, c = p
return m*x+c
linear_model = Model(linear_func)
data = RealData(x, y, sx=x_err, sy=y_err)
odr = ODR(data, linear_model, beta0=[0.4, 0.4])
out = odr.run()
out.pprint()
Beta: [0.94999996 0.24999994]
Beta Std Error: [0.13228754 0.36228459]
But as you can see, the result is different from the curve_fit above with absolute_sigma=True.
Using the same data with curve_fit and absolute_sigma=False leads to the same results as the ODR-Fit .
popt, pcov = curve_fit(func, x, y,sigma=yerr,absolute_sigma=False)
print(popt)
print(np.sqrt(np.diag(pcov)))
[0.95 0.25]
[0.13228756 0.36228442]
So I guess the ODR-Fit is not really taking care of my absolute errors as curve_fit and absolute_sigma=True does. Is there any way to do that or am I missing something?
The option absolute_sigma=True in curve_fit() gives out the real covariance matrix in that sense that np.sqrt(np.diag(pcov)) yields the 1-sigma standard deviation as error as defined in, e.g., Numerical Recipes, i.e., it could have a unit, like meter or so. A very helpful summary comes with kmpfit
ODR, however, gives out the standard error derived from a scaled covariance matrix. See here, How to compute standard error from ODR results? or the example below.
This scaling by ODR is performed such that a reduced chi2 - calculated with the input weights being scaled in the same manner - yields approx 1. Or from the curve_fit docs:
This constant is set by demanding that the reduced chisq for the optimal parameters popt when using the scaled sigma equals unity.
The remaining question is now: What does sd_beta really mean?
It's the standard error, see e.g., Standard error of mean versus standard deviation
(there may exist conditions where the magnitude of both are the same. See comments below)
see also the preceding discussion here: https://mail.python.org/pipermail/scipy-user/2013-February/034196.html
Now, via scaling pcov with the reduced chi2 one obtains for the parameters errors the same output of a) curve_fit(..., absolute_sigma=False) and b) ODR which are a priory relative errors:
# calculate chi2
residuals = (y - func(x, *popt))
chi_arr = residuals / yerr
chi2_red = (chi_arr**2).sum() / (len(x)-len(popt))
print('red. chi2:\t\t\t\t', chi2_red)
print('np.sqrt(np.diag(pcov) * chi2_red):\t', np.sqrt(np.diag(pcov) * chi2_red))
yielding:
red. chi2: 8.75
np.sqrt(np.diag(pcov) * chi2_red): [0.13228756 0.36228442]
Note, that, however, for curve_fit(..., absolute_sigma=True) and ODR the covariance matrices pcov from curve_fitand cov_beta from the ODR output are still the same. Here, the disadvantage of back- and forth rescaling becomes perceivable!
The relative error now of course implies, that if the input errors are scaled, the magnitude of the relative errors remains:
# scaled uncertainty
yerr = np.asfarray([0.1, 0.1, 0.1, 0.1]) * 2
popt, pcov = curve_fit(func, x, y, sigma=yerr, absolute_sigma=False)
print('popt:\t\t\t', popt)
print('np.sqrt(np.diag(pcov)):\t', np.sqrt(np.diag(pcov)))
with the same output of:
popt: [0.95 0.25]
np.sqrt(np.diag(pcov)): [0.13228756 0.36228442]

How to compute standard deviation errors with scipy.optimize.least_squares

I compare fitting with optimize.curve_fit and optimize.least_squares. With curve_fit I get the covariance matrix pcov as an output and I can calculate the standard deviation errors for my fitted variables by that:
perr = np.sqrt(np.diag(pcov))
If I do the fitting with least_squares, I do not get any covariance matrix output and I am not able to calculate the standard deviation errors for my variables.
Here's my example:
#import modules
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import least_squares
noise = 0.5
N = 100
t = np.linspace(0, 4*np.pi, N)
# generate data
def generate_data(t, freq, amplitude, phase, offset, noise=0, n_outliers=0, random_state=0):
#formula for data generation with noise and outliers
y = np.sin(t * freq + phase) * amplitude + offset
rnd = np.random.RandomState(random_state)
error = noise * rnd.randn(t.size)
outliers = rnd.randint(0, t.size, n_outliers)
error[outliers] *= 10
return y + error
#generate data
data = generate_data(t, 1, 3, 0.001, 0.5, noise, n_outliers=10)
#initial guesses
p0=np.ones(4)
x0=np.ones(4)
# create the function we want to fit
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
# create the function we want to fit for least-square
def my_sin_lsq(x, t, y):
# freq=x[0]
# phase=x[1]
# amplitude=x[2]
# offset=x[3]
return (np.sin(t*x[0]+x[2])*x[1]+ x[3]) - y
# now do the fit for curve_fit
fit = curve_fit(my_sin, t, data, p0=p0)
print 'Curve fit output:'+str(fit[0])
#now do the fit for least_square
res_lsq = least_squares(my_sin_lsq, x0, args=(t, data))
print 'Least_squares output:'+str(res_lsq.x)
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = my_sin(t, *p0)
#data_first_guess_lsq = x0[2]*np.sin(t*x0[0]+x0[1])+x0[3]
data_first_guess_lsq = my_sin(t, *x0)
# recreate the fitted curve using the optimized parameters
data_fit = my_sin(t, *fit[0])
data_fit_lsq = my_sin(t, *res_lsq.x)
#calculation of residuals
residuals = data - data_fit
residuals_lsq = data - data_fit_lsq
ss_res = np.sum(residuals**2)
ss_tot = np.sum((data-np.mean(data))**2)
ss_res_lsq = np.sum(residuals_lsq**2)
ss_tot_lsq = np.sum((data-np.mean(data))**2)
#R squared
r_squared = 1 - (ss_res/ss_tot)
r_squared_lsq = 1 - (ss_res_lsq/ss_tot_lsq)
print 'R squared curve_fit is:'+str(r_squared)
print 'R squared least_squares is:'+str(r_squared_lsq)
plt.figure()
plt.plot(t, data)
plt.title('curve_fit')
plt.plot(t, data_first_guess)
plt.plot(t, data_fit)
plt.plot(t, residuals)
plt.figure()
plt.plot(t, data)
plt.title('lsq')
plt.plot(t, data_first_guess_lsq)
plt.plot(t, data_fit_lsq)
plt.plot(t, residuals_lsq)
#error
perr = np.sqrt(np.diag(fit[1]))
print 'The standard deviation errors for curve_fit are:' +str(perr)
I would be very thankful for any help, best wishes
ps: I got a lot of input from this source and used part of the code Robust regression
The result of optimize.least_squares has a parameter inside of it called jac. From the documentation:
jac : ndarray, sparse matrix or LinearOperator, shape (m, n)
Modified Jacobian matrix at the solution, in the sense that J^T J is a Gauss-Newton approximation of the Hessian of the cost function. The type is the same as the one used by the algorithm.
This can be used to estimate the Covariance Matrix of the parameters using the following formula: Sigma = (J'J)^-1.
J = res_lsq.jac
cov = np.linalg.inv(J.T.dot(J))
To find the variance of the parameters one can then use:
var = np.sqrt(np.diagonal(cov))
The SciPy program optimize.least_squares requires the user to provide in input a function fun(...) which returns a vector of residuals. This is typically defined as
residuals = (data - model)/sigma
where data and model are vectors with the data to fit and the corresponding model predictions for each data point, while sigma is the 1σ uncertainty in each data value.
In this situation, and assuming one can trust the input sigma uncertainties, one can use the output Jacobian matrix jac returned by least_squares to estimate the covariance matrix. Moreover, assuming the covariance matrix is diagonal, or simply ignoring non-diagonal terms, one can also obtain the 1σ uncertainty perr in the model parameters (often called "formal errors") as follows (see Section 15.4.2 of Numerical Recipes 3rd ed.)
import numpy as np
from scipy import linalg, optimize
res = optimize.least_squares(...)
U, s, Vh = linalg.svd(res.jac, full_matrices=False)
tol = np.finfo(float).eps*s[0]*max(res.jac.shape)
w = s > tol
cov = (Vh[w].T/s[w]**2) # Vh[w] # robust covariance matrix
perr = np.sqrt(np.diag(cov)) # 1sigma uncertainty on fitted parameters
The above code to obtain the covariance matrix is formally the same as the following simpler one (as suggested by Alex), but the above has the major advantage that it works even when the Jacobian is close to degenerate, which is a common occurrence in real-world least-squares fits
cov = linalg.inv(res.jac.T # res.jac) # covariance matrix when jac not degenerate
If one does not trust the input uncertainties sigma, one can still assume that the fit is good, to estimate the data uncertainties from the fit itself. This corresponds to assuming chi**2/DOF=1, where DOF is the number of degrees of freedom. In this case, one can use the following lines to rescale the covariance matrix before computing the uncertainties
chi2dof = np.sum(res.fun**2)/(res.fun.size - res.x.size)
cov *= chi2dof
perr = np.sqrt(np.diag(cov)) # 1sigma uncertainty on fitted parameters

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?
Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.
I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.
sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).
I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

Scipy Fmin Guassian model to real data

I've been trying to solve this for a bit and really just haven't seen an example or anything that my brain is able to use to move forward.
The goal is to find a model Gaussian curve by minimizing the total chi-squared between the real data and the model resulting from unknown parameters that require sensible estimations (the Gaussian is of unknown position, amplitude and width). scipy.optimize.fmin has come up but I've never used this before and I'm still very new to python...
Ultimately, I'd like to plot the original data along with the model - I have use pyplot before, it's just generating the model and using fmin that has me completely bewildered where I'm essentially here:
def gaussian(a, b, c, x):
return a*np.exp(-(x-b)**2/(2*c**2))
I've seen multiple ways to generate a model and this has rendered me confused and thus I have no code! I have imported my data file through np.loadtxt.
Thanks for anyone that can suggest a framework or help at all.
There are basically four (or five) main steps involved in model fitting problems like this:
Define your forward model, yhat = F(P, x), that takes a set of parameters P and your independent variable x, and estimates your response variable y
Define your loss function, loss = L(P, x, y) that you'd like to minimize over your parameters
Optional: define a function that returns the Jacobian matrix, i.e. the partial derivatives of your loss function w.r.t. your model parameters.*
Make an initial guess at your model parameters
Plug all these into one of the optimizers and get the fitted parameters for your model
Here's a worked example to get you started:
import numpy as np
from scipy.optimize import minimize
from matplotlib import pyplot as pp
# function that defines the model we're fitting
def gaussian(P, x):
a, b, c = P
return a*np.exp(-(x-b)**2 /( 2*c**2))
# objective function to minimize
def loss(P, x, y):
yhat = gaussian(P, x)
return ((y - yhat)**2).sum()
# generate a gaussian distribution with known parameters
amp = 1.3543
pos = 64.546
var = 12.234
P_real = np.array([amp, pos, var])
# we use the vector of real parameters to generate our fake data
x = np.arange(100)
y = gaussian(P_real, x)
# add some gaussian noise to make things harder
y_noisy = y + np.random.randn(y.size)*0.5
# minimize needs an initial guess at the model parameters
P_guess = np.array([1, 50, 25])
# minimize provides a unified interface to all of scipy's solvers. you
# can also access them individually in scipy.optimize, but the
# standalone versions have annoying differences in their syntax. for now
# we'll use the Nelder-Mead solver, which doesn't use the Jacobian. we
# also need to hand it x and y_noisy as additional args to loss()
res = minimize(loss, P_guess, method='Nelder-Mead', args=(x, y_noisy))
# res is a dict containing the results of the optimization. in particular we
# want the optimized model parameters:
P_fit = res['x']
# we can pass these to gaussian() to evaluate our fitted model
y_fit = gaussian(P_fit, x)
# now let's plot the results:
fig, ax = pp.subplots(1,1)
ax.hold(True)
ax.plot(x, y, '-r', lw=2, label='Real')
ax.plot(x, y_noisy, '-k', alpha=0.5, label='Noisy')
ax.plot(x, y_fit, '--b', lw=5, label='Fit')
ax.legend(loc=0, fancybox=True)
*Some solvers, e.g. conjugate gradient methods, take the Jacobian as an additional argument, and by and large these solvers are faster and more robust, but if you're feeling lazy and performance isn't all that critical then you can usually get away without providing the Jacobian, in which case it will use the finite differences method to estimate the gradients.
You can read more about the different solvers here

How to calculate error for polynomial fitting (in slope and intercept)

Hi I want to calculate errors in slope and intercept which are calculated by scipy.polyfit function. I have (+/-) uncertainty for ydata so how can I include it for calculating uncertainty into slope and intercept? My code is,
from scipy import polyfit
import pylab as plt
from numpy import *
data = loadtxt("data.txt")
xdata,ydata = data[:,0],data[:,1]
x_d,y_d = log10(xdata),log10(ydata)
polycoef = polyfit(x_d, y_d, 1)
yfit = 10**( polycoef[0]*x_d+polycoef[1] )
plt.subplot(111)
plt.loglog(xdata,ydata,'.k',xdata,yfit,'-r')
plt.show()
Thanks a lot
You could use scipy.optimize.curve_fit instead of polyfit. It has a parameter sigma for errors of ydata. If you have your error for every y value in a sequence yerror (so that yerror has the same length as your y_d sequence) you can do:
polycoef, _ = scipy.optimize.curve_fit(lambda x, a, b: a*x+b, x_d, y_d, sigma=yerror)
For an alternative see the paragraph Fitting a power-law to data with errors in the Scipy Cookbook.

Categories

Resources