I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?
I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.
Related
I am using sklearn.cluster.DBSCAN on my dataset. However, it tends to classify some of my data points as noise, even though they are not. However, if I increase eps even more, it will start merging unrelated clusters. I figured that my best attempt would be if I kept the clustering phase the same, but increased the allowed range for the "neighbour finding" phase. Is there such a possibility? My only other approach would be to build a kd-tree of all non-noise points, and for each noise point to look for the closest non-noise point and evaluate if they belong together. However, of course it would be better if this was working in a built-in way.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
I have a dataset for which I need to fit the plot. I am using leastsq() for fitting. However, currently I need to give initial guess values manually which is really affecting the fitting. Is there any way to first calculate the initial guess values which I can pass in leastsq()?
No, you can't really calculate an initial guess.
You'll just have to make an educated guess, and that really depends on your data and model.
If the initial guess affects the fitting, there is likely something else going on; you're probably getting stuck in local minima. Your model may be too complex, or your data range may be so large that you run into floating point precision limits and the fitting algorithm can't detect any changes for parameter changes. The latter can often be avoided by normalizing your data (and model), or, for example, using log(-log) space instead of linear space.
Or avoid leastsq altogether, and use a different minimization method (which will likely be much slower, but may produce overall better and more consistent results), such as the Nelder-Mead amoebe method.
I use a python package called emcee to fit a function to some data points. The fit looks great, but when I want to plot the value of each parameter at each step I get this:
In their example (with a different function and data points) they get this:
Why is my function converging so fast, and why does it have that weird shape in the beginning. I apply MCMC using likelihood and posterior probability. And even if the fit looks very good, the error on parameters of function are very small (10^10 smaller than the actual value) and I think it is because of the random walks. Any idea how to fix it? Here is their code for fitting: http://dan.iel.fm/emcee/current/user/line/ I used the same code with the obvious modifications for my data points and fitting function.
I would not say that your function is converging faster than the emcee line-fitting example you're linked to. In the example, the walkers start exploring the most likely values in the parameter space almost immediately, whereas in your case it takes more than 200 iterations to reach the high probability region.
The jump in your trace plots looks like a burn-in. It is a common feature of MCMC sampling algorithms where your walkers are given starting points away from the bulk of the posterior and then must find their way to it. It looks like in your case the likelihood function is fairly smooth so you only need a hundred or so iterations to achieve this, giving your function this "weird shape" you're talking about.
If you can constrain the starting points better, do so; if not, you might consider discarding this initial length before further analysis is made (see here and here for discussions on burn in lengths).
As for whether the errors are realistic or not, you need to inspect the resulting posterior models for that, because the ratio of the actual parameter value to its uncertainties does not have a say in this. For example, if we take your linked example and change the true value of b to 10^10, the resulting errors will be ten magnitudes smaller while remaining perfectly valid.
I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?
There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.