Maximum Mean Discrepancy (implementation with Python and Scikitlearn) and Uncertanties

Maximum Mean Discrepancy (implementation with Python and Scikitlearn) and Uncertanties - python

I am using this code from Jindong Wang to estimate MMD (Maximum Mean Discrepancy) with the aim of distinguishing between different characteristics of time series that I artificially generate following this skcikit-learn example. I started with simple A*sin(wx+phi) to test if it is possible to differentiate phases, amplitudes or frequencies using such an approach by comparing each data set with the primary sen(x). The idea is that distances must be increasing as I chose larger frequencies or amplitudes. I have two questions.
How can I estimate an uncertainty related to the MMD distances? (this is a more theoretical question)
How can I optimize (in terms of memory and my arrays) to be able to use long time series with more than 10000-time points with x-y elements?
Why it works fine for differences in amplitudes but not for phases or frequencies? Could it be related to the sampling frequency of the data?

Related

Wavelet for time series

I am trying to use wavelets coefficients as feature for neural networks on a time series data and I am bit confused on usage of the same. Do I need to find the coefficients on entire time series at once, or use a sliding window for finding the same. I mean, will finding coefficients on entire time series for once, include the future data points while determining those coefficients? What should be the approach to go about using Wavelets on a time series data without look ahead bias if any?

It is hard to provide you with a detailed answer without knowing what you are trying to achieve.
In a nutshell, you first need to decide whether you want to apply a discrete (DWT) or a continous (CWT) wavelet transform to your time series.
A DWT will allow you to decompose your input data into a set of discrete levels, providing you with information about the frequency content of the signal i.e. determining whether the signal contains high frequency variations or low frequency trends. Think of it as applying several band-pass filters to your input data.
I do not think that you should apply a DWT to your entire time series at once. Since you are working with financial data, maybe decomposing your input signal into 1-day windows and applying a DWT on these subsets would do the trick for you.
In any case, I would suggest:
Installing the pywt toolbox and playing with a dummy time series to understand how wavelet decomposition works.
Checking out the abundant literature available about wavelet analysis of financial data. For instance, if you are interested into financial time series forecasting, you might want to read this paper.
Posting your future questions on the DSP stack exchange, unless you have a specific coding-related answer.

How can I statistically compare a lightcurve data set with the simulated lightcurve?

With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.

If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html

Comparing multiple signals for similarity

I have multiple (between 2 and 100) signals and need to determine when a significant number diverge from the rest. We're exploring machine learning techniques, but we also want tackle this as a signal processing problem and see where we get the best results.
This very informative post suggests that best results come from a weighted ensemble of techniques, including:
Similarity in time domain (static): Multiply in place and sum.
Similarity in time domain (with shift*): Take FFT of each signal, multiply, and IFFT. (matlab's xcorr)
Similarity in frequency domain (static**): Take FFT of each signal, multiply, and sum.
Similarity in frequency domain (with shift*): Multiply the two signals and take FFT. This will show if the signals share similar spectral shapes.
Similarity in energy (or power if different lengths)
But this is a fairly high-level outline. Can anyone point me to a more thorough discussion of these techniques, preferably with some python code or in lieu of that, some code in R?

How to extract the peak at a specific frequency in python

I want to know how much energy is at a specific frequency in a signal. I am using FFT to get the spectrum, and the frequency step is determined by the length of my signal.
My spectrum looks, for example, like this :
I want to get the spectrum peak at a specific frequency, -0.08. However, the discretization of the spectrum only give me a peaks at -0.0729 and -0.0833.
Is there a way to shift the spectrum to make sure there is a data point at the frequency I want? Or a way to get the value without necessarily using fft?
Thank you very much!

What you're actually doing when you take a DFT (or any Fourier Transform) is measuring how much of your signal "intersects" with sines of certain frequencies. This is done by summing product of your signal with the complex conjugate of the wave of whatever frequency. Technically, this is called an inner product, which is a generalization of the dot product, and measures how "close" a signal is to another. So if you're only interested in one frequency, don't take the whole DFT, just look at one you want.
I'm not sure what your units are, so I'll assume you want the peak at f0 = -0.08 Hz (If your units are something else, like normalized to the sampling frequency, then you'll need to account for that). This corresponds to the complex exponential exp(2*pi*j*f0*t). Because you're sampling, your t is discrete, so t = n/fs, where fs is the sampling frequency (in Hz).
# assuming you're using numpy arrays
w = exp(-2*pi*1j*f0*arange(len(signal))/fs)
peak = abs(sum(signal*w))
There are different definitions of the DFT; I'm pretty sure numpy's corresponds to what I have above. The extra minus in the exponential is because it's the complex conjugate.
Notice that it's unlikely that w is actually periodic. If the number of samples is large enough this doesn't really matter. A good heuristic is at least 10 periods.

If you have discrete data but need an output for a continuous variable you'll necessarily need some kind of interpolation function. For a value by request style I would advise Scipy interp1d (example of the use of a interp1d function). I believe it's the fastest way to achieve your intended results.

Strange FFT peaks when working with many tight frequencies

I'm using a slightly modified version of this python code to do frequency analysis:
FFT wrong value?
Lets say I have a pack of sine waves in the time domain that are very close together in frequency, while sharing the same amplitude. This is how they look like in the frequency domain, using FFT on 1024 samples from which I strip out the second half, giving 512 bins of resolution:
This is when I apply a FFT over the same group of waves but this time with 128 samples (64 bins):
I expected a plateau-ish frequency response but it looks like the waves in the center are being cancelled. What are those "horns" I see? Is this normal?

I believe your result is correct. The peaks are at ±f1 and ±f2), corresponding to the respective frequency components of the two signals shown in your first plot.
I assume that you are shifting the DC component back to the center? What "waves in the center" are you referring to?
There are a couple of other potential issues that you should be aware of:
Aliasing: by inspection it appears that you have enough samples across your signal but keep in mind that artificial (or aliased) frequencies can be created by the FFT, if there are not enough sample points to capture the underlying frequency. Specifically, if your frequency is f, then you need your data sample spacing to be at least, Δx = 1/(2*f), or smaller.
Windowing: your signal is windowed (has a finite extent) so there will also be some broadening, ringing, or redistribution of power about each spatial frequency due to edge affects.
Since I don't know the details of your data, I went ahead and created a sinusoid and then sampled the data close to what appears to be your sampling rate. For example, below is a sinusoid with 64 points and with a signal frequency at 10 cycles (count the peaks):
The FFT result is then:
which shows the same quantitative features as yours, but without having your data, its difficult for me to match your exact situation (spacing and taper).
Next I applied a super-Gauss window function (shown below) to simulate the finite extent of your data:
After applying the window to the input signal we have:
The corresponding FFT result shows some additional power redistribution, due to the finite extent of the data:
Although I can't match your exact situation, I believe your results appear as expected and some qualitative features of your data have been identified. Hope this helps.

Sine waves closely spaced in the frequency domain will occasionally nearly cancel out in the time domain. Since your second FFT is 8 times shorter than your first FFT, you may have windowed just such an short area of cancellation. Try a different time location of shorter time window to see something different (or different phases of sinusoids).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.