I have a curve that shows a multi-step feature, in each step, I guess it can be fitted by a linear function or a higher-order polynomial. The figure is shown here. I think it is like one flat line connecting with a steap line and continues. I have found some posts to fit the step-like function. But unfortunately, all I can find is for functions only with one step. I have tried to extend them to my case, but I failed. Could you please tell me how can I fit my curves like this kind?
And also my curves will turn to show fewer features of clear steps, shown like this, If the method could work them both, that would be much greater.
Any help would be much appreciated!
First of all, you need to know how many discontinuities you have and where they are. You can get them by smoothing the data with a median filter (preserves discontinuities) and check for the peaks in the differences.
Once you got the discontinuities, you can either fit a multi-step function, or just loop through the segments and fit individually.
Here a list of helpful functions:
fitting: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
median filter: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.medfilt.html
find locations of maxima: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.argrelmax.html
Edit: you could use a one-sided maximum filter instead of the median filter to make the function monotonic and then search for maxima in the differences. You could also do this with a threshold.
Also helpful for those analysis is the pandas library using pandas.Series.rolling: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html#pandas.Series.rolling
Edit: note, that the problem with multi step function fitting is the degeneracy of the solution.
Each step introduces two additional parameters (location and step hight), but any permutation would lead to the same solution. Another special case happens, when two step fall into the same gap, then the steps can arbitrarily share the gap. These cases give the optimizer a hard time. You could add a constraint to the steps to sort them (xstep1 < xstep2 < ...), but this adds unnecessary complexity in my opinion. Therefore better find the gaps and fit the segments individually.
Related
I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?
I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.
I use a python package called emcee to fit a function to some data points. The fit looks great, but when I want to plot the value of each parameter at each step I get this:
In their example (with a different function and data points) they get this:
Why is my function converging so fast, and why does it have that weird shape in the beginning. I apply MCMC using likelihood and posterior probability. And even if the fit looks very good, the error on parameters of function are very small (10^10 smaller than the actual value) and I think it is because of the random walks. Any idea how to fix it? Here is their code for fitting: http://dan.iel.fm/emcee/current/user/line/ I used the same code with the obvious modifications for my data points and fitting function.
I would not say that your function is converging faster than the emcee line-fitting example you're linked to. In the example, the walkers start exploring the most likely values in the parameter space almost immediately, whereas in your case it takes more than 200 iterations to reach the high probability region.
The jump in your trace plots looks like a burn-in. It is a common feature of MCMC sampling algorithms where your walkers are given starting points away from the bulk of the posterior and then must find their way to it. It looks like in your case the likelihood function is fairly smooth so you only need a hundred or so iterations to achieve this, giving your function this "weird shape" you're talking about.
If you can constrain the starting points better, do so; if not, you might consider discarding this initial length before further analysis is made (see here and here for discussions on burn in lengths).
As for whether the errors are realistic or not, you need to inspect the resulting posterior models for that, because the ratio of the actual parameter value to its uncertainties does not have a say in this. For example, if we take your linked example and change the true value of b to 10^10, the resulting errors will be ten magnitudes smaller while remaining perfectly valid.
I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?
There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.