Comparing multiple signals for similarity

Comparing multiple signals for similarity - python

I have multiple (between 2 and 100) signals and need to determine when a significant number diverge from the rest. We're exploring machine learning techniques, but we also want tackle this as a signal processing problem and see where we get the best results.
This very informative post suggests that best results come from a weighted ensemble of techniques, including:
Similarity in time domain (static): Multiply in place and sum.
Similarity in time domain (with shift*): Take FFT of each signal, multiply, and IFFT. (matlab's xcorr)
Similarity in frequency domain (static**): Take FFT of each signal, multiply, and sum.
Similarity in frequency domain (with shift*): Multiply the two signals and take FFT. This will show if the signals share similar spectral shapes.
Similarity in energy (or power if different lengths)
But this is a fairly high-level outline. Can anyone point me to a more thorough discussion of these techniques, preferably with some python code or in lieu of that, some code in R?

Related

Maximum Mean Discrepancy (implementation with Python and Scikitlearn) and Uncertanties

I am using this code from Jindong Wang to estimate MMD (Maximum Mean Discrepancy) with the aim of distinguishing between different characteristics of time series that I artificially generate following this skcikit-learn example. I started with simple A*sin(wx+phi) to test if it is possible to differentiate phases, amplitudes or frequencies using such an approach by comparing each data set with the primary sen(x). The idea is that distances must be increasing as I chose larger frequencies or amplitudes. I have two questions.
How can I estimate an uncertainty related to the MMD distances? (this is a more theoretical question)
How can I optimize (in terms of memory and my arrays) to be able to use long time series with more than 10000-time points with x-y elements?
Why it works fine for differences in amplitudes but not for phases or frequencies? Could it be related to the sampling frequency of the data?

Clustering method for three-dimensional vectors

I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.

The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.

The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.

How to determine correlation values for angular data?

I have a plurality of timeseries of angular data. These values are not vectors (no magnitude), just angles. I need to determine among the various timeseries how correlated they are with each other (e.g., would like to obtain a correlation matrix) over the duration of the data. For example, some are measured very close to each other and I expect will be highly correlated, but I'm interested in also seeing how correlated the further measurements are.
How would I go about adapting this angular data in order to be able to obtain a correlation matrix? I thought about just vectorizing it (i.e., with unit vectors), but then I'm not sure how to do the correlation analysis with this two-dimensional data, as I've only done it with one dimensional previously. Of course, I can't simply analyze the correlation of the angles themselves, due to the nature of angular data (the reset at 0-360).
I'm working in Python, so if anyone has any recommendations on relevant packages I would appreciate it.

I have found a solution in the Astropy python package. The following function is suitable for circular correlation:
https://docs.astropy.org/en/stable/api/astropy.stats.circcorrcoef.html

Wavelet for time series

I am trying to use wavelets coefficients as feature for neural networks on a time series data and I am bit confused on usage of the same. Do I need to find the coefficients on entire time series at once, or use a sliding window for finding the same. I mean, will finding coefficients on entire time series for once, include the future data points while determining those coefficients? What should be the approach to go about using Wavelets on a time series data without look ahead bias if any?

It is hard to provide you with a detailed answer without knowing what you are trying to achieve.
In a nutshell, you first need to decide whether you want to apply a discrete (DWT) or a continous (CWT) wavelet transform to your time series.
A DWT will allow you to decompose your input data into a set of discrete levels, providing you with information about the frequency content of the signal i.e. determining whether the signal contains high frequency variations or low frequency trends. Think of it as applying several band-pass filters to your input data.
I do not think that you should apply a DWT to your entire time series at once. Since you are working with financial data, maybe decomposing your input signal into 1-day windows and applying a DWT on these subsets would do the trick for you.
In any case, I would suggest:
Installing the pywt toolbox and playing with a dummy time series to understand how wavelet decomposition works.
Checking out the abundant literature available about wavelet analysis of financial data. For instance, if you are interested into financial time series forecasting, you might want to read this paper.
Posting your future questions on the DSP stack exchange, unless you have a specific coding-related answer.

Sum of multiple distributions

Background
I try to estimate the potential energy supply within a geographical area using spatially explicit data. For this purpose, I build a Bayesian network (HydeNet package) and attached it to a raster stack in R. The Bayesian network model reads the input data (e.g resource supply, conversion efficiency) of each cell location from the raster stack and computes the corresponding energy supply (MCMC simulations). As a result I obtain a new raste layer with a specific probability distribution of the expected energy supply for each raster cell.
However, I am equally interested in the total energy supply within the study area. That means I need to aggregate (sum) the potential supply of all the raster cells in order to get the overall supply potential within the area.
Click here for visual example
Research
The mathematical operation I want to do is called convolution. R provides a corresponding function called convolve that makes use of the Fast Fourrier Transfomration.
The examples I found so far (e.g. example 1, 2) were limited to the addition of two distributions at a time. However, I would like to sum-up multiple distributions (thousands, millions).
Question
How can I sum-up (convolve) multiple probabilty distributions?
I have up to 18,000,000 probability distributions. Thus the computation efficiency will certainly be an big issue.
Further, I am mainly interested in a solution in R, but other solutions (notably Python) are appreciated too.

I don't know if convolving multiple distributions at once would result in a speed increase. Wouldn't somthing like a123 = convolve(a1, a2, a3) behind the scenes simplify to a12 = convolve(a1, a2); a123 = convolve(a12, a30)?. Regardless, in R what you could try is using the foreach package and do all convolutions in parallel. on a quad core that would speed up the calculations (theoretically) by a factor 4. If you really want more speed you could try to use the OpenCL package to see if you can do these calculations parallel on a GPU, but this is programmingwise not easy to get into. If I were you I would focus more on these kind of solutions than trying to speed up functions that do convolutions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing multiple signals for similarity - python

Related

Maximum Mean Discrepancy (implementation with Python and Scikitlearn) and Uncertanties

Clustering method for three-dimensional vectors

How to determine correlation values for angular data?

Wavelet for time series

Sum of multiple distributions

Categories

Resources