Curve Fitting in Python for extrapolation, Regression analysis

Curve Fitting in Python for extrapolation, Regression analysis - python

This question is regarding curve fitting in python.
First, I would say that I do not know the curve fit function to insert into "curve_fit" function in the scipy library; therefore, I am trying to use a polyfit which is OK if I am interested in interpolation but my goal is to predict values at future points, in other words extrapolation.
I have attached a screenshot of a raw signal, smoothed and its polyfit result. It has the correct poly order but still fails at extrapolation. My conclusion is that poly fit is not the right approach here, but I can not estimate the curve function. What are you thoughts?
Please note that this is not a distribution since the y values may keep slowly decreasing infinitely, even below 0.
I'd say the function looks like an exponential Gaussian but again it's not a distribution so dont want to do that.
My last thought was to split the plot into two, the first model can certainly be modeled as a polynomial and the second as an exponential. (values are different than first png cuz it's of a different signal).
Then, maybe combine the two. What do you think about this?
Attached is a screenshot of this too.

Since many curves can fit the data and extrapolate differently, you need to choose the right basis functions to get the behaviour you want.
So far you have tried polynomials for instance, these however tend to +- infinite, which is perhaps not what you want.
I would try and use curve_fit on a sum of Hermite polynomials or Laguerre polynomials. For instance, for Laguerre polynomials, you could try
a + b*exp(-k x) + c*(1-x)*exp(-k x) + d*(x^2 - 4*x + 2)*exp(-k x) + ...
Python has a lot of convenience functions built in for this, see e.g. https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.polynomials.laguerre.html
Note however that you should also fit k to your data, which you could use curve_fit for.

Related

Issues with Modified Gaussian Fit

I have a curve-fitting problem. I am measuring a system's response over a range of input values (both input and output values are scalar and real).
I change a particular parameter for the system between trials that results in different outputs for the same inputs.
This behavior is illustrated below:
I need to fit a model that essentially takes that parameter and x values and produce the observed y values as closely as possible.
I am trying to fit a Gaussian function as follows:
I am modeling $f_\alpha(c)$ as $a0 + a1*c$
I am modeling $f_\beta(c)$ as $b1 + b1*c$
Essentially make the peak and mean location shift as a function of the parameter c.
The problem is I am having convergence issues related to $f_\beta(c)$. If I just set the median to a constant I am able to estimate a0 and a1, but obviously the fit is poor.
I am using scipy.optimize.curve_fit
So my question is basically, is there a better way to tackle this problem? For e.g., a function that can model this better, better forms for $f_\alpha$; and $f_\beta$?
Sample data here: https://drive.google.com/file/d/1QPjrpxaDnnj3pmqzjZgjJJLKWnQjLikc
Sample code here: https://drive.google.com/open?id=135P9euXYAoa9CR3hrLOg6ja_6XfGndDN
(Dataset is slightly different than what I have shown in the figure in the posting)
Two questions:
1. Is there a methodical way to make the guesses such that I am guaranteed to converge?
2. Are there better functions that I could use to approximate my observations?
Thanks for any help in advance

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

How calculate the Error for trapezoidal rule if I only have data? (Python)

I got this array of data and I need to calculate the area under the curve, so I use the Numpy library and the Scipy library which contain the functions trapz in Numpy and integrate.simps in Scipy for a Numerical Integration which gave me a really nice result in both cases.
The problem now is, that I need the error for each one or at least the error for the Trapezoidal Rule. The thing is, that the formula for that ask me a function, which obviously I don't have. I have been researching for a way to obtain the error but always return to the same point...
Here are the pages of scipy.integrate http://docs.scipy.org/doc/scipy/reference/integrate.html and trapz in Numpy http://docs.scipy.org/doc/numpy/reference/generated/numpy.trapz.html I try and see a lot of code about the Numerical Integration and prefer to use the existing ones...
Any ideas please?

While cel is right that you cannot determine an integration error if you don't know the function, there is something you can do.
You can use curve fitting to fit a function through the available data points. You can then use that function for error estimation.
If you expect the data to fit a certain kind of function like a sine, log or exponential it is good to use that as a basis for curve fitting.
For instance, if you are measuring the drag on a moving car, it is known that this mostly proportional to the velocity squared because of air resistance.
However, if you do not have any knowledge about the applicable function then assuming you have N data points, there is a polynomial of the N-1 degree that fits exactly though all those data points. Determining such a polynomial from the data is solving a system of lineair equations. See e.g. polynomial interpolation. You could use this polynomial as an estimate for the unknown real function. Note however that outside the range of data points this polynomial might be wildly inaccurate.

Curve Fitting with Known Integrals Python

I have some data that are the integrals of an unknown curve within bins. For your interest, the data is ocean wave energy and the bins are for directions, e.g. 0-15 degrees. If possible, I would like to fit a curve on to the data that conserves the integrals within the bins. I've tried sketching it on a notepad with a pencil and it seems like it could be possible. Does anyone know of any curve-fitting tool in Python to do this, for example in the scipy interpolation sub-package?
Thanks in advance
Edit:
Thanks for the help. If I do it, it looks like I will try the method that is recommended in section 4 of this paper: http://journals.ametsoc.org/doi/abs/10.1175/1520-0485%281996%29026%3C0136%3ATIOFFI%3E2.0.CO%3B2. In theory, it basically uses matrices to make some 'fake' data from the known integrals between each band. When plotted, this data then produces an interpolated line graph that preserves the integrals.

It's a little outside my bailiwick, but I can suggest having a look at SciKits to see if there's anything there that might be useful. Other packages to browse would be pandas and StatsModels. Good luck!

If you have a curve f(x) which is an approximation to the integral of another curve g(x), i.e. f=int(g,x) then the two are related by the Fundamental theorem of calculus, that is, your original function is the derivative of the first curve g = df/dx. As such you can use numpy.diff or any of the higher order methods to approximate df/dx to obtain an estimate of your original curve.

One possibility: calculate the cumulative sum of the bin volumes (np.cumsum), fit an interpolating spline to it, and then take the derivative to get the curve.
scipy splines have methods to calculate the derivatives.
The only limitation, in case it is relevant in your case, the spline through the cumulative sum might not be monotonic, and the derivative might be negative over some intervals.
I guess that the literature on smoothing a histogram looks at similar constraints on the volume of the integral/bin, but I don't have any references ready.

1/ fit2histogram
Your question is about fitting an histogram. I just came through documentation for some Python package for Multi-Variate Pattern Analysis, PyMVPA, and some function for histogram fitting is proposed. An example is here: PyMVPA.
However, I guess that set of available distributions is limited to famous distributions.
2/ integral computation
As already mentionned, next solution is to approximate integral value, and to fit a model to the resulting set of data. Either you know explicit expression for the derivative, or you use computational derivation: finite difference, analytical method.

large set of data, interpolation

I am looking for a "method" to get a formula, formula which comes from fitting a set of data (3000 point). I was using Legendre polynomial, but for > 20 points it gives not exact values. I can write chi2 test, but algorithm needs a loot of time to calculate N parameters, and at the beginning I don't know how the function looks like, so it takes time. I was thinking about splines... Maybe ...
So the input is: 3000 pints
Output : f(x) = ... something
I want to have a formula from fit. What is a best way to do this in python?
Let the force would be with us!
Nykon

How about a polynomial fit:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
or some other interpolation scheme:
http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
It is difficult to recommend a suitable method without knowing more about the dataset and something about how good of a fit is required.

Except, a spline does not give you a "formula", at least not unless you have the wherewithal to deal with all of the piecewise segments. Even then, it will not be easily written down, or give you anything that is at all pretty to look at.
A simple spline gives you an interpolant. Worse, for 3000 points, an interpolating spline will give you roughly that many cubic segments! You did say interpolation before. OF course, an interpolating polynomial of that high an order will be complete crapola anyway, so don't think you can just go back there.
If all that you need is a tool that can provide an exact interpolation at any point, and you really don't need to have an explicit formula, then an interpolating spline is a good choice.
Or do you really want an approximant? A function that will APPROXIMATELY fit your data, smoothing out any noise? The fact is, a lot of the time when people who have no idea what they are doing say "interpolation" they really do mean approximation, smoothing. This is possible of course, but there are entire books written on the subject of curve fitting, the modeling of empirical data. You first goal is then to choose an intelligent model, that will represent this data. Best of course is if you have some intelligent choice of model from physical understanding of the relationship under study, then you can estimate the parameters of that model using a nonlinear regression scheme, of which there are many to be found.
If you have no model, and are unwilling to choose one that roughly has the proper shape, then you are left with generic models in the form of splines, which can be fit in a regression sense, or with high order polynomial models, for which I have little respect.
My point in all of this is YOU need to make some choices and do some research on a choice of model.

The only formula would be a polynomial of order 3000.
How good does the fit need to be? What type of formula do you expect?

You could sample your observed points (randomly is best) and fit a cubic spline to this sample (if you repeat this procedure, you can create a distribution of splines). Fitting a spline to 3,000 points is a bit much, but generating a distribution of spline based on a sample could give you an idea of what the function will look like. As Josh mentioned above, http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html is a good place to start your search.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.