large set of data, interpolation

large set of data, interpolation - python

I am looking for a "method" to get a formula, formula which comes from fitting a set of data (3000 point). I was using Legendre polynomial, but for > 20 points it gives not exact values. I can write chi2 test, but algorithm needs a loot of time to calculate N parameters, and at the beginning I don't know how the function looks like, so it takes time. I was thinking about splines... Maybe ...
So the input is: 3000 pints
Output : f(x) = ... something
I want to have a formula from fit. What is a best way to do this in python?
Let the force would be with us!
Nykon

How about a polynomial fit:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
or some other interpolation scheme:
http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
It is difficult to recommend a suitable method without knowing more about the dataset and something about how good of a fit is required.

Except, a spline does not give you a "formula", at least not unless you have the wherewithal to deal with all of the piecewise segments. Even then, it will not be easily written down, or give you anything that is at all pretty to look at.
A simple spline gives you an interpolant. Worse, for 3000 points, an interpolating spline will give you roughly that many cubic segments! You did say interpolation before. OF course, an interpolating polynomial of that high an order will be complete crapola anyway, so don't think you can just go back there.
If all that you need is a tool that can provide an exact interpolation at any point, and you really don't need to have an explicit formula, then an interpolating spline is a good choice.
Or do you really want an approximant? A function that will APPROXIMATELY fit your data, smoothing out any noise? The fact is, a lot of the time when people who have no idea what they are doing say "interpolation" they really do mean approximation, smoothing. This is possible of course, but there are entire books written on the subject of curve fitting, the modeling of empirical data. You first goal is then to choose an intelligent model, that will represent this data. Best of course is if you have some intelligent choice of model from physical understanding of the relationship under study, then you can estimate the parameters of that model using a nonlinear regression scheme, of which there are many to be found.
If you have no model, and are unwilling to choose one that roughly has the proper shape, then you are left with generic models in the form of splines, which can be fit in a regression sense, or with high order polynomial models, for which I have little respect.
My point in all of this is YOU need to make some choices and do some research on a choice of model.

The only formula would be a polynomial of order 3000.
How good does the fit need to be? What type of formula do you expect?

You could sample your observed points (randomly is best) and fit a cubic spline to this sample (if you repeat this procedure, you can create a distribution of splines). Fitting a spline to 3,000 points is a bit much, but generating a distribution of spline based on a sample could give you an idea of what the function will look like. As Josh mentioned above, http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html is a good place to start your search.

Related

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!

To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

Curve Fitting in Python for extrapolation, Regression analysis

This question is regarding curve fitting in python.
First, I would say that I do not know the curve fit function to insert into "curve_fit" function in the scipy library; therefore, I am trying to use a polyfit which is OK if I am interested in interpolation but my goal is to predict values at future points, in other words extrapolation.
I have attached a screenshot of a raw signal, smoothed and its polyfit result. It has the correct poly order but still fails at extrapolation. My conclusion is that poly fit is not the right approach here, but I can not estimate the curve function. What are you thoughts?
Please note that this is not a distribution since the y values may keep slowly decreasing infinitely, even below 0.
I'd say the function looks like an exponential Gaussian but again it's not a distribution so dont want to do that.
My last thought was to split the plot into two, the first model can certainly be modeled as a polynomial and the second as an exponential. (values are different than first png cuz it's of a different signal).
Then, maybe combine the two. What do you think about this?
Attached is a screenshot of this too.

Since many curves can fit the data and extrapolate differently, you need to choose the right basis functions to get the behaviour you want.
So far you have tried polynomials for instance, these however tend to +- infinite, which is perhaps not what you want.
I would try and use curve_fit on a sum of Hermite polynomials or Laguerre polynomials. For instance, for Laguerre polynomials, you could try
a + b*exp(-k x) + c*(1-x)*exp(-k x) + d*(x^2 - 4*x + 2)*exp(-k x) + ...
Python has a lot of convenience functions built in for this, see e.g. https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.polynomials.laguerre.html
Note however that you should also fit k to your data, which you could use curve_fit for.

Python - trying to perform a more robust linear fit

I have this data that I fit a linear function to and the fit determines other work (never mind, not important). I'm using numpy.polyfit, and when I simply include the data and the degree of the fit, nothing else, it produces this plot:
Now, the fit is okay, but the general consensus is the line of best fit is being skewed by those red data points above it and I should actually be fitting to the data just below it which forms a nice linear shape (beginning around that congested blob of blue points). So I attempted to add a weighting to my call to polyfit, and I chose an arbitrary weighting of 1/sqrt(y-values), so basically the smaller y-values will be weighted towards more favourably. This gave the following:
Which admittedly is better but I'm still unsatisfied, as now it appears the line is too low. I would ideally like a middle-ground, but since I chose really an arbitrary weighting, I was wondering if in general there is a way to perform a more robust fit using Python, or even if this can be done using polyfit? Using a separate package if it works will be fine too.

This question doesn't really have much to do with programming or python and more to do with statistics or linear algebra.
You could try seeing the error difference between a best fit line or best fit quadratic see which has less error. But a lot of it is context related.
If you have 500 data points, then you could find a 500th order polynomial to model your dataset with zero error. But if you weight your data points then it needs to make sense for the data.
If you want your best fit line to "look right" then just cut the foreplay and draw it where you want it. If you want it to make sense then ask a mathematician for a formula that makes sense then follow it.

statsmodels has robust linear estimators, RLM, with various weight functions that should work well in cases like this.
http://www.statsmodels.org/dev/generated/statsmodels.robust.robust_linear_model.RLM.html
http://www.statsmodels.org/dev/examples/index.html#robust
These are M-estimators that are robust to "y outliers", but not to "x outliers" that are influential outlying regressors.

Curve_fit not converging means...?

I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?

There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.

Curve Fitting with Known Integrals Python

I have some data that are the integrals of an unknown curve within bins. For your interest, the data is ocean wave energy and the bins are for directions, e.g. 0-15 degrees. If possible, I would like to fit a curve on to the data that conserves the integrals within the bins. I've tried sketching it on a notepad with a pencil and it seems like it could be possible. Does anyone know of any curve-fitting tool in Python to do this, for example in the scipy interpolation sub-package?
Thanks in advance
Edit:
Thanks for the help. If I do it, it looks like I will try the method that is recommended in section 4 of this paper: http://journals.ametsoc.org/doi/abs/10.1175/1520-0485%281996%29026%3C0136%3ATIOFFI%3E2.0.CO%3B2. In theory, it basically uses matrices to make some 'fake' data from the known integrals between each band. When plotted, this data then produces an interpolated line graph that preserves the integrals.

It's a little outside my bailiwick, but I can suggest having a look at SciKits to see if there's anything there that might be useful. Other packages to browse would be pandas and StatsModels. Good luck!

If you have a curve f(x) which is an approximation to the integral of another curve g(x), i.e. f=int(g,x) then the two are related by the Fundamental theorem of calculus, that is, your original function is the derivative of the first curve g = df/dx. As such you can use numpy.diff or any of the higher order methods to approximate df/dx to obtain an estimate of your original curve.

One possibility: calculate the cumulative sum of the bin volumes (np.cumsum), fit an interpolating spline to it, and then take the derivative to get the curve.
scipy splines have methods to calculate the derivatives.
The only limitation, in case it is relevant in your case, the spline through the cumulative sum might not be monotonic, and the derivative might be negative over some intervals.
I guess that the literature on smoothing a histogram looks at similar constraints on the volume of the integral/bin, but I don't have any references ready.

1/ fit2histogram
Your question is about fitting an histogram. I just came through documentation for some Python package for Multi-Variate Pattern Analysis, PyMVPA, and some function for histogram fitting is proposed. An example is here: PyMVPA.
However, I guess that set of available distributions is limited to famous distributions.
2/ integral computation
As already mentionned, next solution is to approximate integral value, and to fit a model to the resulting set of data. Either you know explicit expression for the derivative, or you use computational derivation: finite difference, analytical method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.