finding regression to second order poynomial python - python

I have a piece of code below and want to find the regression to the line (how good the data points match this line). I want my fit to be second order polynomial. How can i do this? and is there a method which takes errors into consideration.
plt.errorbar(x,y,fmt='*')
z = np.polyfit(x, y, 2)
xxx=np.linspace(0.65,2,10)
ppp = np.poly1d(z)
plt.plot(xxx,ppp(xxx))

According to the documentation on numpy.polyfit, it can also return residuals, which are the errors you are looking for. Look at the Returns section. And you can set the polynomial degree that you want with the parameter deg.

Related

How to achieve line of best fit instead of a logarithmic

I attempted to fit a best fit line with my data points using scipy.optimize.curvefit function:
x = np.array([0,2246,2600,3465,4392])
y = [-0.763,0.052,0.081,0.266,0.179]
yerror = [0.201,0.113,0.139,0.162,0.204]
plt.errorbar(wavelength,A,yerr=B, xerr=None, fmt='o')
def func(x, a, b, c):#the best fit function
return a + (b * x)**c
popt, pcov = scipy.optimize.curve_fit(func, x, y)
x_fit = np.linspace(0, np.max(x), 1000) # create curve line of best fit
plt.plot(x_fit, func(x_fit, *popt), "b")
My popt value is: array([-7.63283206e-01, 2.23580046e-04, 2.63164486e-01])
where the first value -7.63283206e-01 is the intercept I wish it to show in the graph.
The data points and best fit are plotted here using code above and gives a logarithmic curve, but I want the line of best fit to pass through the y axis like this instead to illustrate a straighter curve.
Thanks in advance!
"Best fit" means nothing until the criteria of fitting be specified (Least mean square error or least mean square relative error or least mean absolute error or etc.). The "best fit" is different for each one.
Sine there is no criteria of fitting specified, why not choosing the simplest method, without iterative process and without need of guessed initial values of parameters.
The method below, from https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales gives :
If you specify a particular criteria of fitting then an iterative method with guessed initial values are required. You can use the above values a,b,c as initial values for a robust and fast convergence.

Fitting data from scatterplot

I have a Dataframe with two columns which I scatter plotted and got something like the following picture:
I would like to know if there is a way to find a distribution curve who best fits it, since the tutorials I've found focus in the distribution of one variable only (e.g. this case. I'm looking for something like this:
Does anyone have any directions or sample code for this case?
You can try fitting different degrees of polynomial using numpy.polyfit. It takes x, y and degree of fitting polynomial as inputs.
You can write a loop which iterates from 1 to say 5 for the degrees. Plot the f(x) using the coefficients which are returned by the function.
for d in degrees:
Fit using np.polyfit(x, y, d)
Get coefficients and optionally plot f(x) for degree d
Then find sum of squares (yi - f(xi))^2
Note that the sum of squares is just an indication of the error - in general it would go down as the degree increases but the plotting will kind of show you if you are overfitting to the data.
This is just one of the ways to go about solving the problem.

Problems with curve_fit from scipy.optimze

I know that there are some similar questions, but since none of them brought me any further, I decided to ask one of my own.
I am sorry, if the answer to my problem is already somewhere out there, but I really couldn't find it.
I tried fitting f(x) = a*x**b to rather linear data using curve_fit. It compiles properly, but the result is way off as shown below:
The thing is, that I don't really know what I am doing, but on the other hand fitting always is more of an art than science and there was at least one general bug with scipy.optimize.
My data looks like this:
x-values:
[16.8, 2.97, 0.157, 0.0394, 14.000000000000002, 8.03, 0.378, 0.192, 0.0428, 0.029799999999999997, 0.000781, 0.0007890000000000001]
y-values:
[14561.766666666666, 7154.7950000000001, 661.53750000000002, 104.51446666666668, 40307.949999999997, 15993.933333333332, 1798.1166666666666, 1015.0476666666667, 194.93800000000002, 136.82833333333332, 9.9531566666666684, 12.073133333333333]
That's my code (using a really nice example in the last answer to that question):
def func(x,p0,p1): # HERE WE DEFINE A FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p0*(x**p1)
# Here you give the initial parameters for p0 which Python then iterates over to find the best fit
popt, pcov = curve_fit(func,xvalues,yvalues, p0=(1.0,1.0))#p0=(3107,0.944)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p0 = popt[0]
p1 = popt[1]
residuals = yvalues - func(xvalues,p0,p1)
fres = sum(residuals**2)
print 'chi-square'
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(5e-4,20) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p0,p1)
The starting values are from a fit with gnuplot, that is plausible but I need to cross-check.
This is printed output (first fitted p0, p1, then chi-square):
[ 4.67885857e+03 6.24149549e-01]
chi-square
424707043.407
I guess this is a difficult question, therefore much thanks in advance!
When fitting curve_fit optimizes the sum of (data - model)^2 / (error)^2
If you don't pass in errors (as you are doing here) curve_fit assumes that all of the points have an error of 1.
In this case, as your data spans many orders of magnitude, the points with the largest y values dominate the objective function, and causes curve_fit to attempt to fit them at the expense of the others.
The best way of fixing this would be including the errors in your yvalues in the fit (it looks like you do as you have error bars in the plot you have made!). You can do this by passing them in as the sigma parameter of curve_fit.
I would rethink the experimental part. Two datapoints are questionable:
The image you showed us looks pretty good because you took the log:
You could do a linear fit on log(x) and log(y). In this way you might limit the impact of the largest residuals. Another approach would be robust regression (RANSAC from sklearn or least_squares from scipy).
Nevertheless you should either gather more datapoints or repeat the measurements.

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

Python LeastSquares plot

I have to draw plot using least squares method in Python 3. I have list of x and y values:
y = [186,273,308,484]
x = [2.25,2.34,2.47,2.56]
There are many more values for x and for y, there is only a shortcut. And now, I know, that f(x)=y should be a linear function. I can get cofactor „a” and „b” of this function, by calculating:
delta_x = x[len(x)]-x[0] and delta_y = y[len(y)]-y[0]
Etc, using tangent function. I know, how to do it.
But there are also uncertainties of y, about 2 percent of y. So I have y_errors table, which contains all uncertainties of y.
But what now, how I can draw least squares?
Of course I have been used Google, I saw docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#least-square-fitting-leastsq, but there are some problems.
I tried to edit example from scipy.org to my own purpose. So I edited x, y, y_meas variables, putting here my own lists. But now, I dont know, what is p0 variable in this example. And what should I must edit to make my example working.
Of course I can edit also residuals function. It must get only one variable - y_true. In addition to this I dont understand arquments of leastsq function.
Sorry for my english and for asking such newbie question. But I dont understand this method. Thank You in advance.
I believe you are trying to fit a set of {x, y} (and possibly sigma_y: the uncertainties in y) values to a linear expression. This is known as linear regression, and For linear regression (or indeed, for regression of any polynomial) you can use numpy's polyfit. The uncertainties can be used for the weights::
weight = 1/sigma_y
where sigma_y is the standard deviation in y.
The least-squares routines in scipy.optimize allow you to fit a non-linear function to data, but you have to write the function that computes the "residual" (data - model) in terms of variables that are to be adjusted in order to minimize the calculated residual.

Categories

Resources