I have some sampled (univariate) data - but the clock driving the sampling process is inaccurate - resulting in a random slip of (less than) 1 sample every 30. A more accurate clock at approximately 1/30 of the frequency provides reliable samples for the same data ... allowing me to establish a good estimate of the clock drift.
I am looking to interpolate the sampled data to correct for this so that I 'fit' the high frequency data to the low-frequency. I need to do this 'real time' - with no more than the latency of a few low-frequency samples.
I recognise that there is a wide range of interpolation algorithms - and, among those I've considered, a spline based approach looks most promising for this data.
I'm working in Python - and have found the scipy.interpolate package - though I could see no obvious way to use it to 'stretch' n samples to correct a small timing error. Am I overlooking something?
I am interested in pointers to either a suitable published algorithm, or - ideally - a Python library function to achieve this sort of transform. Is this supported by SciPy (or anything else)?
UPDATE...
I'm beginning to realise that what, at first, seemed a trivial problem isn't as straightforward as I first thought. I am no-longer convinced that naive use of splines will suffice. I've also realised that my problem can be better described without reference to 'clock drift'... like this:
A single random variable is sampled at two different frequencies - one low and one high, with no common divisor - e.g. 5hz and 144hz. If we assume sample 0 is identical at both sample rates, sample 1 #5hz falls between samples 28 amd 29. I want to construct a new series - at 720hz, say - that fits all the known data points "as smoothly as possible".
I had hoped to find an 'out of the box' solution.
Before you can ask the programming question, it seems to me you need to investigate a more fundamental scientific one.
Before you can start picking out particular equations to fit badfastclock to goodslowclock, you should investigate the nature of the drift. Let both clocks run a while, and look at their points together. Is badfastclock bad because it drifts linearly away from real time? If so, a simple quadratic equation should fit badfastclock to goodslowclock, just as a quadratic equation describes the linear acceleration of a object in gravity; i.e., if badfastclock is accelerating linearly away from real time, you can deterministically shift badfastclock toward real time. However, if you find that badfastclock is bad because it is jumping around, then smooth curves -- even complex smooth curves like splines -- won't fit. You must understand the data before trying to manipulate it.
Bsed on your updated question, if the data is smooth with time, just place all the samples in a time trace, and interpolate on the sparse grid (time).
Related
I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?
I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
I am trying to solve a problem very similar to the one discussed in this post
I have a broadband signal, which contains a component with time-varying frequency. I need to monitor the phase of this component over time. I am able to track the frequency shifts by (a somewhat brute force method of) peak tracking in the spectrogram. I need to "clean up" the signal around this time varying peak to extract the Hilbert phase (or, alternatively, I need a method of tracking the phase that does not involve the Hilbert transform).
To summarize that previous post: varying the coefficients of a FIR/IIR filter in time causes bad things to happen (it does not just shift the passband, it also completely confuses the filter state in ways that cause surprising transients). However, there probably is some way to adjust filter coefficients in time (probably by jointly modifying the filter coefficients and the filter state in some intelligent way). This is beyond my expertise, but I'd be open to any solutions.
There were two classes of solutions that seem plausible: one is to use a resonator filter (basically a damped harmonic oscillator driven by the signal) with a time-varying frequency. This model is simple enough to avoid surprising filter transients. I will try this -- but resonators have very poor attenuation in the stop band (if they can even be said to have a stop band?). This makes me nervous as I'm not 100% sure how the resonate filters will behave.
The other suggestion was to use a filter bank and smoothly interpolate between various band-pass filtered signals according to the frequency. This approach seems appealing, but I suspect it has some hidden caveats. I imagine that linearly mixing two band-pass filtered signals might not always do what you would expect, and might cause weird things? But, this is not my area of expertise, so if mixing over a filter bank is considered a safe solution (one that has been analyzed and published before), I would use it.
Another potential class of solutions occurs to me, which is to just take the phase from the frequency peak in a sliding short-time Fourier transform (could be windowed, multitaper, etc). If anyone knows any prior literature on this I'd be very interested. Related, would be to take the phase at the frequency power peak from a sliding complex Morlet wavelet transform over the band of interest.
So, I guess, basically I have three classes of solutions in mind.
1. Resonator filters with time-varying frequncy.
2. Using a filter bank, possibly with mixing?
3. Pulling phase from a STFT or CWT, (these can be considered a subset of the filter bank approach)
My supicion is that in (2,3) surprising thing will happen to the phase from time to time, and that in (1) we may not be able to reject as much noise as we'd like. It's not clear to me that this problem even has a perfect solution (uncertainty principle in time-frequency resolution?).
Anyway, if anyone has solved this before, and... even better, if anyone knows any papers that sound directly applicable here, I would be grateful.
Not sure if this will help, but googling "monitor phase of time varying component" resulted in this: Link
Hope that helps.
I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?
There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.
I am looking for a "method" to get a formula, formula which comes from fitting a set of data (3000 point). I was using Legendre polynomial, but for > 20 points it gives not exact values. I can write chi2 test, but algorithm needs a loot of time to calculate N parameters, and at the beginning I don't know how the function looks like, so it takes time. I was thinking about splines... Maybe ...
So the input is: 3000 pints
Output : f(x) = ... something
I want to have a formula from fit. What is a best way to do this in python?
Let the force would be with us!
Nykon
How about a polynomial fit:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
or some other interpolation scheme:
http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
It is difficult to recommend a suitable method without knowing more about the dataset and something about how good of a fit is required.
Except, a spline does not give you a "formula", at least not unless you have the wherewithal to deal with all of the piecewise segments. Even then, it will not be easily written down, or give you anything that is at all pretty to look at.
A simple spline gives you an interpolant. Worse, for 3000 points, an interpolating spline will give you roughly that many cubic segments! You did say interpolation before. OF course, an interpolating polynomial of that high an order will be complete crapola anyway, so don't think you can just go back there.
If all that you need is a tool that can provide an exact interpolation at any point, and you really don't need to have an explicit formula, then an interpolating spline is a good choice.
Or do you really want an approximant? A function that will APPROXIMATELY fit your data, smoothing out any noise? The fact is, a lot of the time when people who have no idea what they are doing say "interpolation" they really do mean approximation, smoothing. This is possible of course, but there are entire books written on the subject of curve fitting, the modeling of empirical data. You first goal is then to choose an intelligent model, that will represent this data. Best of course is if you have some intelligent choice of model from physical understanding of the relationship under study, then you can estimate the parameters of that model using a nonlinear regression scheme, of which there are many to be found.
If you have no model, and are unwilling to choose one that roughly has the proper shape, then you are left with generic models in the form of splines, which can be fit in a regression sense, or with high order polynomial models, for which I have little respect.
My point in all of this is YOU need to make some choices and do some research on a choice of model.
The only formula would be a polynomial of order 3000.
How good does the fit need to be? What type of formula do you expect?
You could sample your observed points (randomly is best) and fit a cubic spline to this sample (if you repeat this procedure, you can create a distribution of splines). Fitting a spline to 3,000 points is a bit much, but generating a distribution of spline based on a sample could give you an idea of what the function will look like. As Josh mentioned above, http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html is a good place to start your search.