numpy.polyfit versus scipy.odr

numpy.polyfit versus scipy.odr - python

I have a data set which in theory is described by a polynomial of the second degree. I would like to fit this data and I have used numpy.polyfit to do this. However, the down side is that the error on the returned coefficients is not available. Therefore I decided to also fit the data using scipy.odr. The weird thing was that the coefficients for the polynomial deviated from each other.
I do not understand this and therefore decided to test both fitting routines on a set of data that I produce my self:
import numpy
import scipy.odr
import matplotlib.pyplot as plt
x = numpy.arange(-20, 20, 0.1)
y = 1.8 * x**2 -2.1 * x + 0.6 + numpy.random.normal(scale = 100, size = len(x))
#Define function for scipy.odr
def fit_func(p, t):
return p[0] * t**2 + p[1] * t + p[2]
#Fit the data using numpy.polyfit
fit_np = numpy.polyfit(x, y, 2)
#Fit the data using scipy.odr
Model = scipy.odr.Model(fit_func)
Data = scipy.odr.RealData(x, y)
Odr = scipy.odr.ODR(Data, Model, [1.5, -2, 1], maxit = 10000)
output = Odr.run()
#output.pprint()
beta = output.beta
betastd = output.sd_beta
print "poly", fit_np
print "ODR", beta
plt.plot(x, y, "bo")
plt.plot(x, numpy.polyval(fit_np, x), "r--", lw = 2)
plt.plot(x, fit_func(beta, x), "g--", lw = 2)
plt.tight_layout()
plt.show()
An example of an outcome is as follows:
poly [ 1.77992643 -2.42753714 3.86331152]
ODR [ 3.8161735 -23.08952492 -146.76214989]
In the included image, the solution of numpy.polyfit (red dashed line) corresponds pretty well. The solution of scipy.odr (green dashed line) is basically completely off. I do have to note that the difference between numpy.polyfit and scipy.odr was less in the actual data set I wanted to fit. However, I do not understand where the difference between the two comes from, why in my own testing example the difference is extremely big, and which fitting routine is better?
I hope you can provide answers that might help me give a better understanding between the two fitting routines and in the process provide answers to the questions I have.

In the way you are using ODR it does a full orthogonal distance regression. To have it do a normal nonlinear least squares fit add
Odr.set_job(fit_type=2)
before starting the optimization and you will get what you expected.
The reason that the full ODR fails so badly is due to not specifying weights/standard deviations. Obviously it does hard to interpret that point cloud and assumes equal wheights for x and y. If you provide estimated standard deviations, odr will yield a good (though different of course) result, too.
Data = scipy.odr.RealData(x, y, sx=0.1, sy=10)

The actual problem is that the odr output has the beta coefficients in the opposite order than numpy.polyfit has. So the green curve is not calculated correctly. To plot it, use instead
plt.plot(x, fit_func(beta[::-1], x), "g--", lw = 2)

Related

How to fit any non-linear functions in python?

I have already checked post1, post2, post3 and post4 but didn't help.
I have a data about a specific plant including two variables called "Age" and "Height". The correlation between them is non-linear.
To fit a model, one solution I assume is as follows:
If the non-linear function is
then we can bring in a new variable k where
so we have changed the first non-linear function into a multilinear regression one. Based on this, I have the following code:
data['K'] = data["Age"].pow(2)
x = data[["Age", "K"]]
y = data["Height"]
model = LinearRegression().fit(x, y)
print(model.score(x, y)) # = 0.9908571840250205
Am I doing correctly?
How to do with cubic and exponential functions?
Thanks.

for cubic polynomials
data['x2'] = data["Age"].pow(2)
data['x3'] = data["Age"].pow(3)
x = data[["Age", "x2","x3"]]
y = data["Height"]
model = LinearRegression().fit(x, y)
print(model.score(x, y))
you can handle exponential data by fitting log(y).
or find some library that can fit polynomials automatically t.ex: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

Hopefully you don't have a religious fervor for using SKLearn here because the answer I'm going to suggest is going to completely ignore it.
If you're interested doing regression analysis where you get to have complete autonomy with the fitting function, I'd suggest cutting directly down to the least-squares optimization algorithm that drives a lot of this type of work, which you can do using scipy
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import leastsq
x, y = np.array([0,1,2,3,4,5]), np.array([0,1,4,9,16,25])
# initial_guess[i] maps to p[x] in function_to_fit, must be reasonable
initial_guess = [1, 1, 1]
def function_to_fit(x, p):
return pow(p[0]*x, 2) + p[1]*x + p[2]
def residuals(p,y,x):
return y - function_to_fit(x,p)
cnsts = leastsq(
residuals,
initial_guess,
args=(y, x)
)[0]
fig, ax = plt.subplots()
ax.plot(x, y, 'o')
xi = np.arange(0,10,0.1)
ax.plot(xi, [function_to_fit(x, cnsts) for x in xi])
plt.show()
Now this is a numeric approach to the solution, so I would recommend taking a moment to make sure you understand the limitations of such an approach - but for problems like these I've found they're more than adequate for functionalizing non-linear data sets without trying to do some hand-waving to make it if inside a linearizable manifold.

Better fitting for a power-law curve

So I had some points in a dataframe that led me to believe I was dealing with a power law curve. After some googling, I used what I found in this post to go about curve fitting.
def func_powerlaw(x, m, c, c0):
return c0 + x**m * c
target_func = func_powerlaw
X = np.array(selection_to_feed.selection[1:])
y = np.array(selection_to_feed.avg_feed_size[1:])
popt, pcov = curve_fit(func_powerlaw, X, y, p0 =np.asarray([-1,10**5,0]))
curvex = np.linspace(0,5000,1000)
curvey = target_func(curvex, *popt)
plt.figure(figsize=(10, 5))
plt.plot(curvex, curvey, '--')
plt.plot(X, y, 'ro')
plt.legend()
plt.show()
This is the result:
Curve
The problem is, the curve fit results in negative values for the first few values (as you can see in the blue line), and in the actual relationship, no negative Y values can exist.
A few questions:
What can I do make sure no negative Y values can be output? Really, an X of 0 should have a Y value of 0 as well.
Is power law curve fitting even the right thing to do? How would you describe this curve?
Thank you!

If you are only looking for a simple approximating equation with a better fit, I extracted data from your plot and added the known data point [0,0] per your note. Since the uncertainty for the [0,0] point is zero - that is, you are 100% certain of that value - I used a weighted regression where that one known point was given an extremely high weight and the weight for all other points was 1. This had the effect of forcing the curve through the [0,0] point, which can be done with any software that allows weighted fitting. I found that a Standard Geometric plus offset equation, "y = a * pow(x, (b * x)) + offset", with parameters:
a = -1.0704001788540748E+02
b = -1.5095055897637395E-03
Offset = 1.0704001788540748E+02
fits as shown in the attached plot and passes through [0,0]. My suggestion is to perform a regression using this equation with the actual data plus the known [0,0] point, using these values as the initial parameter estimates - and if possible using a very large weight for the [0,0] point as I did.

Gaussian data fit varying depending on position of x data

I am having a hard time trying to understand why my Gaussian fit to a set of data (ydata) does not work well if I shift the interval of x-values corresponding to that data (xdata1 to xdata2). The Gaussian is written as:
where A is just an amplitude factor. Changing some of the values of the data, it is easy to make it work for both cases, but one can also easily find cases in which it does not work well for xdata1 and also in which covariance of the parameters is not estimated.
I am using scipy.optimize.curve_fit in Spyder with Python 3.7.1 on Windows 7.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
xdata1 = np.linspace(-9,4,20, endpoint=True) # works fine
xdata2 = xdata1+2
ydata = np.array([8,9,15,12,14,20,24,40,54,94,160,290,400,420,300,130,40,10,8,4])
def gaussian(x, amp, mean, sigma):
return amp*np.exp(-(((x-mean)**2)/(2*sigma**2)))/(sigma*np.sqrt(2*np.pi))
popt1, pcov1 = curve_fit(gaussian, xdata1, ydata)
popt2, pcov2 = curve_fit(gaussian, xdata2, ydata)
fig, ([ax1, ax2]) = plt.subplots(nrows=1, ncols=2,figsize=(9, 4))
ax1.plot(xdata1, ydata, 'b+:', label='xdata1')
ax1.plot(xdata1, gaussian(xdata1, *popt1), 'r-', label='fit')
ax1.legend()
ax2.plot(xdata2, ydata, 'b+:', label='xdata2')
ax2.plot(xdata2, gaussian(xdata2, *popt2), 'r-', label='fit')
ax2.legend()

The problem is your second attempt at fitting a gaussian is getting stuck in a local minimum while searching parameter space: curve_fit is a wrapper for least_squares which uses gradient descent to minimize the cost function and this is liable to get stuck in local minima.
You should try providing reasonable starting parameters (by using the p0 argument of curve_fit) to avoid this:
#... your code
y_max = np.max(y_data)
max_pos = ydata[ydata==y_max][0]
initial_guess = [y_max, max_pos, 1] # amplitude, mean, std
popt2, pcov2 = curve_fit(gaussian, xdata2, ydata, p0=initial_guess)
Which as you can see provides a reasonable fit:
You should write a function which can provide reasonable estimates of the starting parameters. Here I just found the maximum y value and used this to determine the initial parameters. I've found this works well for the fitting normal distributions but you could consider other methods.
Edit:
You can also solve the problem by scaling the amplitude: the amplitude is so large the parameter space is distorted and the gradient descent simply follows the direction of greatest change in the amplitude and effectively ignores the sigma. Consider the following plot in parameter space (Colour is the sum of the squared residuals of the fit for given parameters and the white cross shows the optimal solution):
Make sure to make note of the different scales for the x and y axis.
One needs to make a large number of 'unit' sized steps in y (amplitude) to get to the minimum from the point x,y = (0,0), where as you only need less than one 'unit' sized step to get to the minimum in x (sigma). The algorithm simply takes steps in amplitude as this is the steepest gradient. When it gets to the amplitude which minimises the cost function it simply stops the algorithm as it appears to have converged and makes little or no changes in the sigma parameter.
One way to fix this is to scale your ydata to un-distort the parameter space: divide your ydata by 100 and you will see your fit works without providing any starting parameters!

Fitting a vector function with curve_fit in Scipy

I want to fit a function with vector output using Scipy's curve_fit (or something more appropriate if available). For example, consider the following function:
import numpy as np
def fmodel(x, a, b):
return np.vstack([a*np.sin(b*x), a*x**2 - b*x, a*np.exp(b/x)])
Each component is a different function but they share the parameters I wish to fit. Ideally, I would do something like this:
x = np.linspace(1, 20, 50)
a = 0.1
b = 0.5
y = fmodel(x, a, b)
y_noisy = y + 0.2 * np.random.normal(size=y.shape)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=fmodel, xdata=x, ydata=y_noisy, p0=[0.3, 0.1])
But curve_fit does not work with functions with vector output, and an error Result from function call is not a proper array of floats. is thrown. What I did instead is to flatten out the output like this:
def fmodel_flat(x, a, b):
return fmodel(x[0:len(x)/3], a, b).flatten()
popt, pcov = curve_fit(f=fmodel_flat, xdata=np.tile(x, 3),
ydata=y_noisy.flatten(), p0=[0.3, 0.1])
and this works. If instead of a vector function I am actually fitting several functions with different inputs as well but which share model parameters, I can concatenate both input and output.
Is there a more appropriate way to fit vector function with Scipy or perhaps some additional module? A main consideration for me is efficiency - the actual functions to fit are much more complex and fitting can take some time, so if this use of curve_fit is mangled and is leading to excessive runtimes I would like to know what I should do instead.

If I can be so blunt as to recommend my own package symfit, I think it does precisely what you need. An example on fitting with shared parameters can be found in the docs.
Your specific problem stated above would become:
from symfit import variables, parameters, Model, Fit, sin, exp
x, y_1, y_2, y_3 = variables('x, y_1, y_2, y_3')
a, b = parameters('a, b')
a.value = 0.3
b.value = 0.1
model = Model({
y_1: a * sin(b * x),
y_2: a * x**2 - b * x,
y_3: a * exp(b / x),
})
xdata = np.linspace(1, 20, 50)
ydata = model(x=xdata, a=0.1, b=0.5)
y_noisy = ydata + 0.2 * np.random.normal(size=(len(model), len(xdata)))
fit = Fit(model, x=xdata, y_1=y_noisy[0], y_2=y_noisy[1], y_3=y_noisy[2])
fit_result = fit.execute()
Check out the docs for more!

I think what you're doing is perfectly fine from an efficiency stand point. I'll try to look at the implementation and come up with something more quantitative, but for the time being here is my reasoning.
What you're doing during curve fitting is optimizing the parameters (a,b) such that
res = sum_i |f(x_i; a,b)-y_i|^2
is minimal. By this I mean that you have data points (x_i,y_i) of arbitrary dimensionality, two parameters (a,b) and a fitting model that approximates the data at query points x_i.
The curve fitting algorithm starts from a starting (a,b) pair, puts this into a black box that computes the above square error, and tries to come up with a new (a',b') pair that produces a smaller error. My point is that the error above is really a black box for the fitting algorithm: the configurational space of the fitting is defined merely by the (a,b) parameters. If you imagine how you'd implement a simple curve fitting function, you could imagine that you try to do, say, a gradient descent, with the square error as cost function.
Now, it should be irrelevant to the fitting procedure how the black box computes the error. It's easy to see that the dimensionality of x_i is really irrelevant for scalar functions, since it doesn't matter if you have 1000 1d query points to fit for, or a 10x10x10 grid in 3d space. What matters is that you have 1000 points x_i for which you need to compute f(x_i) ~ y_i from the model.
The only subtlety that should further be noted is that in case of a vector-valued function, the calculation of the error is not trivial. In my opinion, it's fine to define the error at each x_i point using the 2-norm of the vector-valued function. But hey: in this case, the square error at point x_i is
|f(x_i; a,b)-y_i|^2 == sum_k (f(x_i; a,b)[k]-y_i[k])^2
which implies that the square error for each component is accumulated. This just means that what you're doing right now is just right: by replicating your x_i points and taking into account each component of the function individually, your square error will contain exactly the 2-norm of the error at each point.
So my point is what you're doing is mathematically correct, and I don't expect any behaviour of the fitting procedure to depend on the way how multivariate/vector-valued functions are handled.

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear regression.
The data I have is noisy, but for each data point I can estimate errorbars. So, for example data points looks like:
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
Let's say I am trying to do this in Python.
First way that I know is:
m, c, r_value, p_value, std_err = scipy.stats.linregress(x_list, y_list)
I understand this gives me errorbars of the result, but this does not take into account errorbars of the initial data.
Second way that I know is:
m, c = numpy.polynomial.polynomial.polyfit(x_list, y_list, 1, w = [1.0 / ty for ty in y_err], full=False)
Here we use the inverse of the errorbar for the each point as a weight that is used in the least square approximation. So if a point is not really that reliable it will not influence result a lot, which is reasonable.
But I can not figure out how to get something that combines both these methods.
What I really want is what second method does, meaning use regression when every point influences the result with different weight. But at the same time I want to know how accurate my result is, meaning, I want to know what are errorbars of the resulting coefficients.
How can I do this?

Not entirely sure if this is what you mean, but…using pandas, statsmodels, and patsy, we can compare an ordinary least-squares fit and a weighted least-squares fit which uses the inverse of the noise you provided as a weight matrix (statsmodels will complain about sample sizes < 20, by the way).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import statsmodels.formula.api as sm
x_list = [0.3333333333333333, 0.2886751345948129, 0.25, 0.23570226039551587, 0.22360679774997896, 0.20412414523193154, 0.2, 0.16666666666666666]
y_list = [0.13250359351851854, 0.12098339583333334, 0.12398501145833334, 0.09152715, 0.11167239583333334, 0.10876248333333333, 0.09814170444444444, 0.08560799305555555]
y_err = [0.003306749165349316, 0.003818446389148108, 0.0056036878203831785, 0.0036635292592592595, 0.0037034897788415424, 0.007576672222222223, 0.002981084130692832, 0.0034913019065973983]
# put x and y into a pandas DataFrame, and the weights into a Series
ws = pd.DataFrame({
'x': x_list,
'y': y_list
})
weights = pd.Series(y_err)
wls_fit = sm.wls('x ~ y', data=ws, weights=1 / weights).fit()
ols_fit = sm.ols('x ~ y', data=ws).fit()
# show the fit summary by calling wls_fit.summary()
# wls fit r-squared is 0.754
# ols fit r-squared is 0.701
# let's plot our data
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, facecolor='w')
ws.plot(
kind='scatter',
x='x',
y='y',
style='o',
alpha=1.,
ax=ax,
title='x vs y scatter',
edgecolor='#ff8300',
s=40
)
# weighted prediction
wp, = ax.plot(
wls_fit.predict(),
ws['y'],
color='#e55ea2',
lw=1.,
alpha=1.0,
)
# unweighted prediction
op, = ax.plot(
ols_fit.predict(),
ws['y'],
color='k',
ls='solid',
lw=1,
alpha=1.0,
)
leg = plt.legend(
(op, wp),
('Ordinary Least Squares', 'Weighted Least Squares'),
loc='upper left',
fontsize=8)
plt.tight_layout()
fig.set_size_inches(6.40, 5.12)
plt.show()
WLS residuals:
[0.025624005084707302,
0.013611438189866154,
-0.033569595462217161,
0.044110895217014695,
-0.025071632845910546,
-0.036308252199571928,
-0.010335514810672464,
-0.0081511479431851663]
The mean squared error of the residuals for the weighted fit (wls_fit.mse_resid or wls_fit.scale) is 0.22964802498892287, and the r-squared value of the fit is 0.754.
You can obtain a wealth of data about the fits by calling their summary() method, and/or doing dir(wls_fit), if you need a list of every available property and method.

I wrote a concise function to perform the weighted linear regression of a data set, which is a direct translation of GSL's "gsl_fit_wlinear" function. This is useful if you want to know exactly what your function is doing when it performs the fit
def wlinear_fit (x,y,w) :
"""
Fit (x,y,w) to a linear function, using exact formulae for weighted linear
regression. This code was translated from the GNU Scientific Library (GSL),
it is an exact copy of the function gsl_fit_wlinear.
"""
# compute the weighted means and weighted deviations from the means
# wm denotes a "weighted mean", wm(f) = (sum_i w_i f_i) / (sum_i w_i)
W = np.sum(w)
wm_x = np.average(x,weights=w)
wm_y = np.average(y,weights=w)
dx = x-wm_x
dy = y-wm_y
wm_dx2 = np.average(dx**2,weights=w)
wm_dxdy = np.average(dx*dy,weights=w)
# In terms of y = a + b x
b = wm_dxdy / wm_dx2
a = wm_y - wm_x*b
cov_00 = (1.0/W) * (1.0 + wm_x**2/wm_dx2)
cov_11 = 1.0 / (W*wm_dx2)
cov_01 = -wm_x / (W*wm_dx2)
# Compute chi^2 = \sum w_i (y_i - (a + b * x_i))^2
chi2 = np.sum (w * (y-(a+b*x))**2)
return a,b,cov_00,cov_11,cov_01,chi2
To perform your fit, you would do
a,b,cov_00,cov_11,cov_01,chi2 = wlinear_fit(x_list,y_list,1.0/y_err**2)
Which will return the best estimate for the coefficients a (the intercept) and b (the slope) of the linear regression, along with the elements of the covariance matrix cov_00, cov_01 and cov_11. The best estimate on the error on a is then the square root of cov_00 and the one on b is the square root of cov_11. The weighted sum of the residuals is returned in the chi2 variable.
IMPORTANT: this function accepts inverse variances, not the inverse standard deviations as the weights for the data points.

sklearn.linear_model.LinearRegression supports specification of weights during fit:
x_data = np.array(x_list).reshape(-1, 1) # The model expects shape (n_samples, n_features).
y_data = np.array(y_list)
y_err = np.array(y_err)
model = LinearRegression()
model.fit(x_data, y_data, sample_weight=1/y_err)
Here the sample weight is specified as 1 / y_err. Different versions are possible and often it's a good idea to clip these sample weights to a maximum value in case the y_err varies strongly or has small outliers:
sample_weight = 1 / y_err
sample_weight = np.minimum(sample_weight, MAX_WEIGHT)
where MAX_WEIGHT should be determined from your data (by looking at the y_err or 1 / y_err distributions, e.g. if they have outliers they can be clipped).

I found this document helpful in understanding and setting up my own weighted least squares routine (applicable for any programming language).
Typically learning and using optimized routines is the best way to go but there are times where understanding the guts of a routine is important.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.