Find the appropriate polynomial fit for data in Python

Find the appropriate polynomial fit for data in Python - python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

Related

A Python module that can perform fitting with asymmetrical X and Y errors, and introduce bounds on fitted parameters?

I am fitting datasets with some broken powerlaws, the data has assymetrical errors in X and Y, and I'd like to be able to introduce constrains on the fitted parameters (i.e. not below 0, or within a certain range).
Using Scipy.ODR, I can fit the data great including the assymetrical errors on both axes, however I can't seem to find any way in the documentation to introduce bounds on my fitted parameters and discussions online seem to suggest this is flat out impossible with this module: https://stackoverflow.com/a/17786438/19086741
Using Lmfit, I can also fit the data well and can introduce bounds to the fitted parameters. However, discussions online once again state that Lmfit is not able to handle asymmetrical errors, and errors on both axes.
Is there some module, or perhaps I am missing something with one of these modules, that would allow me to meet both of my requirements in this case? Many thanks.

Sorry, I don't have a good answer for you. As you note, Lmfit does not support ODR regression which allows for uncertainties in the (single) independent variable as well as uncertainties in the dependent variables.
I think this would be possible in principle. Unfortunately, ODR has a very different interface to the other minimization routines making a wrapper as "another possible solving algorithm for lmfit" a bit challenging. I am sure that none of the developers would object to someone trying this, but it would take some effort.
FWIW, you say "both axes" as if you are certain there are exactly 2 axes. ODR supports exactly 1 independent variable: lmfit is not limited to this assumption.
You also say that lmfit cannot handle asymmetric uncertainties. That is only partially true. The lmfit.Model interface allows only a single uncertainty value per data point. But with the lmfit.minimize interface, you write your own objective function to calculate the array you want minimized, and so can weight some residual of "data" and "model" any way you want.

Curve Fitting in Python for extrapolation, Regression analysis

This question is regarding curve fitting in python.
First, I would say that I do not know the curve fit function to insert into "curve_fit" function in the scipy library; therefore, I am trying to use a polyfit which is OK if I am interested in interpolation but my goal is to predict values at future points, in other words extrapolation.
I have attached a screenshot of a raw signal, smoothed and its polyfit result. It has the correct poly order but still fails at extrapolation. My conclusion is that poly fit is not the right approach here, but I can not estimate the curve function. What are you thoughts?
Please note that this is not a distribution since the y values may keep slowly decreasing infinitely, even below 0.
I'd say the function looks like an exponential Gaussian but again it's not a distribution so dont want to do that.
My last thought was to split the plot into two, the first model can certainly be modeled as a polynomial and the second as an exponential. (values are different than first png cuz it's of a different signal).
Then, maybe combine the two. What do you think about this?
Attached is a screenshot of this too.

Since many curves can fit the data and extrapolate differently, you need to choose the right basis functions to get the behaviour you want.
So far you have tried polynomials for instance, these however tend to +- infinite, which is perhaps not what you want.
I would try and use curve_fit on a sum of Hermite polynomials or Laguerre polynomials. For instance, for Laguerre polynomials, you could try
a + b*exp(-k x) + c*(1-x)*exp(-k x) + d*(x^2 - 4*x + 2)*exp(-k x) + ...
Python has a lot of convenience functions built in for this, see e.g. https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.polynomials.laguerre.html
Note however that you should also fit k to your data, which you could use curve_fit for.

Python - trying to perform a more robust linear fit

I have this data that I fit a linear function to and the fit determines other work (never mind, not important). I'm using numpy.polyfit, and when I simply include the data and the degree of the fit, nothing else, it produces this plot:
Now, the fit is okay, but the general consensus is the line of best fit is being skewed by those red data points above it and I should actually be fitting to the data just below it which forms a nice linear shape (beginning around that congested blob of blue points). So I attempted to add a weighting to my call to polyfit, and I chose an arbitrary weighting of 1/sqrt(y-values), so basically the smaller y-values will be weighted towards more favourably. This gave the following:
Which admittedly is better but I'm still unsatisfied, as now it appears the line is too low. I would ideally like a middle-ground, but since I chose really an arbitrary weighting, I was wondering if in general there is a way to perform a more robust fit using Python, or even if this can be done using polyfit? Using a separate package if it works will be fine too.

This question doesn't really have much to do with programming or python and more to do with statistics or linear algebra.
You could try seeing the error difference between a best fit line or best fit quadratic see which has less error. But a lot of it is context related.
If you have 500 data points, then you could find a 500th order polynomial to model your dataset with zero error. But if you weight your data points then it needs to make sense for the data.
If you want your best fit line to "look right" then just cut the foreplay and draw it where you want it. If you want it to make sense then ask a mathematician for a formula that makes sense then follow it.

statsmodels has robust linear estimators, RLM, with various weight functions that should work well in cases like this.
http://www.statsmodels.org/dev/generated/statsmodels.robust.robust_linear_model.RLM.html
http://www.statsmodels.org/dev/examples/index.html#robust
These are M-estimators that are robust to "y outliers", but not to "x outliers" that are influential outlying regressors.

How calculate the Error for trapezoidal rule if I only have data? (Python)

I got this array of data and I need to calculate the area under the curve, so I use the Numpy library and the Scipy library which contain the functions trapz in Numpy and integrate.simps in Scipy for a Numerical Integration which gave me a really nice result in both cases.
The problem now is, that I need the error for each one or at least the error for the Trapezoidal Rule. The thing is, that the formula for that ask me a function, which obviously I don't have. I have been researching for a way to obtain the error but always return to the same point...
Here are the pages of scipy.integrate http://docs.scipy.org/doc/scipy/reference/integrate.html and trapz in Numpy http://docs.scipy.org/doc/numpy/reference/generated/numpy.trapz.html I try and see a lot of code about the Numerical Integration and prefer to use the existing ones...
Any ideas please?

While cel is right that you cannot determine an integration error if you don't know the function, there is something you can do.
You can use curve fitting to fit a function through the available data points. You can then use that function for error estimation.
If you expect the data to fit a certain kind of function like a sine, log or exponential it is good to use that as a basis for curve fitting.
For instance, if you are measuring the drag on a moving car, it is known that this mostly proportional to the velocity squared because of air resistance.
However, if you do not have any knowledge about the applicable function then assuming you have N data points, there is a polynomial of the N-1 degree that fits exactly though all those data points. Determining such a polynomial from the data is solving a system of lineair equations. See e.g. polynomial interpolation. You could use this polynomial as an estimate for the unknown real function. Note however that outside the range of data points this polynomial might be wildly inaccurate.

large set of data, interpolation

I am looking for a "method" to get a formula, formula which comes from fitting a set of data (3000 point). I was using Legendre polynomial, but for > 20 points it gives not exact values. I can write chi2 test, but algorithm needs a loot of time to calculate N parameters, and at the beginning I don't know how the function looks like, so it takes time. I was thinking about splines... Maybe ...
So the input is: 3000 pints
Output : f(x) = ... something
I want to have a formula from fit. What is a best way to do this in python?
Let the force would be with us!
Nykon

How about a polynomial fit:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
or some other interpolation scheme:
http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
It is difficult to recommend a suitable method without knowing more about the dataset and something about how good of a fit is required.

Except, a spline does not give you a "formula", at least not unless you have the wherewithal to deal with all of the piecewise segments. Even then, it will not be easily written down, or give you anything that is at all pretty to look at.
A simple spline gives you an interpolant. Worse, for 3000 points, an interpolating spline will give you roughly that many cubic segments! You did say interpolation before. OF course, an interpolating polynomial of that high an order will be complete crapola anyway, so don't think you can just go back there.
If all that you need is a tool that can provide an exact interpolation at any point, and you really don't need to have an explicit formula, then an interpolating spline is a good choice.
Or do you really want an approximant? A function that will APPROXIMATELY fit your data, smoothing out any noise? The fact is, a lot of the time when people who have no idea what they are doing say "interpolation" they really do mean approximation, smoothing. This is possible of course, but there are entire books written on the subject of curve fitting, the modeling of empirical data. You first goal is then to choose an intelligent model, that will represent this data. Best of course is if you have some intelligent choice of model from physical understanding of the relationship under study, then you can estimate the parameters of that model using a nonlinear regression scheme, of which there are many to be found.
If you have no model, and are unwilling to choose one that roughly has the proper shape, then you are left with generic models in the form of splines, which can be fit in a regression sense, or with high order polynomial models, for which I have little respect.
My point in all of this is YOU need to make some choices and do some research on a choice of model.

The only formula would be a polynomial of order 3000.
How good does the fit need to be? What type of formula do you expect?

You could sample your observed points (randomly is best) and fit a cubic spline to this sample (if you repeat this procedure, you can create a distribution of splines). Fitting a spline to 3,000 points is a bit much, but generating a distribution of spline based on a sample could give you an idea of what the function will look like. As Josh mentioned above, http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html is a good place to start your search.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.