Clustering of 1D signal

Clustering of 1D signal - python

I've got several 1D signals, showing two or more bands. An example is shown below.
I need to extract the datapoints belonging to a single band.
My first simple approach was taking a moving average of the data, and get the indices where the data is larger than the average.
def seperate(x):
average = scipy.ndimage.gaussian_filter(x, 10)
# this gives me a boolean array with the indices of the upper band.
idx = x > average
# return the indices of the upper and lower band
return idx, ~idx
plotting these and the average curve would look like this, where red denotes the upper and blue the lower band.
This works quite well for this example, but fails when more then two bands are present and/or the bands are not that well separated.
I'm looking for a more robust and general solution. I was looking into scikit-learn and was wondering if one of the clustering algorithms can be used to achieve this.

Have a look a time series similarity measures.
Indeed, I have seen this binary thresholding you tried there called "threshold crossing", and many more.
In general, there is no "one size fits all" time series similarity. Different types of signals require different measures. This can probably best be seen by the fact that some are much better analyzed after FFT, while for others FFT makes absolutely no sense.

Related

How Do I Filter a Numpy Array to Have Only One Y Value per X Value

I'm trying to measure and map out the resistance of some vactrols for an electronics project. I used an Arduino Ohmmeter to measure it's resistance. Sometimes the data goes out of range and the data goes a bit wonky and the extremes.
Here are three sets of data, and I want to remove the part and the right end of each plot when it goes back to the left. I'm really no sure how else to express it. Should be pretty simple but I'm really quite stuck. Thanks in advance!
vactrol curves

If you have a reference threshold value, you can just naively filter data using Numpy with something like data[data < threshold] with threshold set for example to 10_000. Alternatively, you can also put some NaN values if you want to keep these values (because is may not always make sense to just remove them) using data[data < threshold] = np.nan.
If you do not have a reference value, then things stat to be a bit more complex. They are fancy ways to detect efficiently such patterns but most are complex.
The simplest solution is analyse the standard deviation of your input data using a sliding window and detect outliers regarding the resulting local standard deviation. You can see how to do that here (you need to combine this with something like data[sdValues < threshold] to remove the outliers). Note however that this method is very sensitive for values near 0.
An alternative solution is to compute a Gaussian/median filter and then measure the relative difference (or another more advanced distance metric) with your input data (a bit like in a high-pass filter). You can take a look to this post to do that.
For these two methods, you need to define an arbitrary threshold. But unlike the naive method, this threshold is directly related to the data variation and not the raw data itself. It is up to you to find a good threshold regarding the data variations, the outliers and the expected final input.
Note: you might be interested in using scipy.signal (especially to compute filters).

Is there a way to find the n most distant vectors in an array?

I have an array of thousands of doc2vec vectors with 90 dimensions. For my current purposes I would like to find a way to "sample" the different regions of this vector space, to get a sense of the diversity of the corpus. For example, I would like to partition my space into n regions, and get the most relevant word vectors for each of these regions.
I've tried clustering with hdbscan (after reducing the dimensionality with UMAP) to carve the vector space at its natural joints, but it really doesn't work well.
So now I'm wondering whether there is a way to sample the "far out regions" of the space (n vectors that are most distant from each other).
Would that be a good strategy?
How could I do this?
Many thanks in advance!

Wouldn't a random sample from all vectors necessarily encounter any of the various 'regions' in the set?
If there are "natural joints" and clusters to the documents, some clustering algorithm should be able to find the N clusters, then the smaller number of NxN distances between each cluster's centroid to each other cluster's centroid might identify those "furthest out" clusters.
Note for any vector, you can use the Doc2Vec doc-vectors most_similar() with a topn value of 0/false-ish to get the (unsorted) similarities to all other model doc-vectors. You could then find the least-similar vectors in that set. If your dataset is small enough for it to be practical to do this for "all" (or some large sampling) of doc-vectors, then perhaps other docs that appear in the "bottom N" least-similar, for the most number of other vectors, would be the most "far out".
Whether this idea of "far out" is actually shown in the data, or useful, isn't clear. (In high-dimensional spaces, everything can be quite "far" from everything else in ways that don't match our 2d/3d intuitions, and slight differences in some vectors being a little "further" might not correspond to useful distinctions.)

Machine learning : find the closest results to a queried vector

I have thousands of vectors of about 20 features each.
Given one query vector, and a set of potential matches, I would like to be able to select the best N matches.
I have spent a couple of days trying out regression (using SVM), training my model with a data set I have created myself : each vector is the concatenation of the query vector and a result vector, and I give a score (subjectively evaluated) between 0 and 1, 0 for perfect match, 1 for worst match.
I haven't had great results, and I believe one reason could be that it is very hard to subjectively assign these scores. What would be easier on the other hand is to subjectively rank results (score being an unknown function):
score(query, resultA) > score(query, resultB) > score(query, resultC)
So I believe this is more a problem of Learning to rank and I have found various links for Python:
http://fa.bianp.net/blog/2012/learning-to-rank-with-scikit-learn-the-pairwise-transform/
https://gist.github.com/agramfort/2071994
...
but I haven't been able to understand how it works really. I am really confused with all the terminology, pairwise ranking, etc ... (note that I know nothing about machine learning hence my feeling of being a bit lost), etc ... so I don't understand how to apply this to my problem.
Could someone please help me clarify things, point me to the exact category of problem I am trying to solve, and even better how I could implement this in Python (scikit-learn) ?

It seems to me that what you are trying to do is to simply compute the distances between the query and the rest of your data, then return the closest N vectors to your query. This is a search problem.
There is no ordering, you simply measure the distance between your query and "thousands of vectors". Finally, you sort the distances and take the smallest N values. These correspond to the most similar N vectors to your query.
For increased efficiency at making comparisons, you can use KD-Trees or other efficient search structures: http://scikit-learn.org/stable/modules/neighbors.html#kd-tree
Then, take a look at the Wikipedia page on Lp space. Before picking an appropriate metric, you need to think about the data and its representation:
What kind of data are you working with? Where does it come from and what does it represent? Is the feature space comprised of only real numbers or does it contain binary values, categorical values or all of them? Wiki for homogeneous vs heterogeneous data.
For a real valued feature space, the Euclidean distance (L2) is usually the choice metric used, with 20 features you should be fine. Start with this one. Otherwise you might have to think about cityblock distance (L1) or other metrics such as Pearson's correlation, cosine distance, etc.
You might have to do some engineering on the data before you can do anything else.
Are the features on the same scale? e.g. x1 = [0,1], x2 = [0, 100]
If not, then try scaling your features. This is usually a matter of trial and error since some features might be noisy in which case scaling might not help.
To explain this, think about a data set with two features: height and weight. If height is in centimeters (10^3) and weight is in kilograms (10^1), then you should aim to convert the cm to meters so both features weigh equally. This is generally a good idea for feature spaces with a wide range of values, meaning you have a large sample of values for both features. You'd ideally like to have all your features normally distributed, with only a bit of noise - see central limit theorem.
Are all of the features relevant?
If you are working with real valued data, you can use Principal Component Analysis (PCA) to rank the features and keep only the relevant ones.
Otherwise, you can try feature selection http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
Reducing the dimension of the space increases performance, although it is not critical in your case.
If your data consists of continuous, categorical and binary values, then aim to scale or standardize the data. Use your knowledge about the data to come up with an appropriate representation. This is the bulk of the work and is more or less a black art. Trial and error.
As a side note, metric based methods such as knn and kmeans simply store data. Learning begins where memory ends.

Python - Clustering with K-means. Some columns with zero variance

I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like . Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.
However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array.
Before finding the clusters, I have tried whitening the data using scipy.cluster.vq.whiten. whiten normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).
My question is, how should I go about resolving this? So far, I've tried
Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the iter keyword considerably.
Transposing the arrays before I flatten them. The zero variance columns just shift.
Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?
EDIT: I have also tried using my own whiten function which just does
for i in range(arr.shape[1]):
if np.abs(arr[:,i].std()) < 1e-8: continue
arr[:,i] /= arr[:,i].std()
This seems to work, but I'm not sure if this is biasing the clustering in any way.
Thanks

Removing the column of all 0's should not bias the data. If you have N dimensional data, but one dimension is all the same number, it is exactly the same as having N-1 dimensional data. This property of effective-dimensionality is called rank.
Consider 3-D data, but all of your data points are on the x=0 plane. Can you see how this is exactly the same as 2D data?

First of all, dropping constant columns is perfectly fine. Obviously they do not contribute information, so no reason to keep them.
However, K-means is not particularly good for sparse vectors. The problem is that most likely the resulting "centroids" will be more similar to each other than to the cluster members.
See, in sparse data, every object is to some extend an outlier. And K-means is quite sensitive to outliers because it tries to minimize the sum of squares.
I suggest that you do the following:
Find a similarity measure that works for your domain. Spend quite a lot of time on this, how to capture similarity for your particular use case.
Once you have that similarity, compute the 200x200 similarity matrix. As your data set is really tiny, you can actually run expensive clustering methods such as hierarchical clustering, that would not scale to thousands of objects. If you want, you could also try OPTICS clustering or DBSCAN. But in particular DBSCAN is actually more interesting if your data set is much larger. For tiny data sets, hierarchical clustering is fine.

Peak-finding algorithm for Python/SciPy

I can write something myself by finding zero-crossings of the first derivative or something, but it seems like a common-enough function to be included in standard libraries. Anyone know of one?
My particular application is a 2D array, but usually it would be used for finding peaks in FFTs, etc.
Specifically, in these kinds of problems, there are multiple strong peaks, and then lots of smaller "peaks" that are just caused by noise that should be ignored. These are just examples; not my actual data:
1-dimensional peaks:
2-dimensional peaks:
The peak-finding algorithm would find the location of these peaks (not just their values), and ideally would find the true inter-sample peak, not just the index with maximum value, probably using quadratic interpolation or something.
Typically you only care about a few strong peaks, so they'd either be chosen because they're above a certain threshold, or because they're the first n peaks of an ordered list, ranked by amplitude.
As I said, I know how to write something like this myself. I'm just asking if there's a pre-existing function or package that's known to work well.
Update:
I translated a MATLAB script and it works decently for the 1-D case, but could be better.
Updated update:
sixtenbe created a better version for the 1-D case.

The function scipy.signal.find_peaks, as its name suggests, is useful for this. But it's important to understand well its parameters width, threshold, distance and above all prominence to get a good peak extraction.
According to my tests and the documentation, the concept of prominence is "the useful concept" to keep the good peaks, and discard the noisy peaks.
What is (topographic) prominence? It is "the minimum height necessary to descend to get from the summit to any higher terrain", as it can be seen here:
The idea is:
The higher the prominence, the more "important" the peak is.
Test:
I used a (noisy) frequency-varying sinusoid on purpose because it shows many difficulties. We can see that the width parameter is not very useful here because if you set a minimum width too high, then it won't be able to track very close peaks in the high frequency part. If you set width too low, you would have many unwanted peaks in the left part of the signal. Same problem with distance. threshold only compares with the direct neighbours, which is not useful here. prominence is the one that gives the best solution. Note that you can combine many of these parameters!
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.sin(2*np.pi*(2**np.linspace(2,10,1000))*np.arange(1000)/48000) + np.random.normal(0, 1, 1000) * 0.15
peaks, _ = find_peaks(x, distance=20)
peaks2, _ = find_peaks(x, prominence=1) # BEST!
peaks3, _ = find_peaks(x, width=20)
peaks4, _ = find_peaks(x, threshold=0.4) # Required vertical distance to its direct neighbouring samples, pretty useless
plt.subplot(2, 2, 1)
plt.plot(peaks, x[peaks], "xr"); plt.plot(x); plt.legend(['distance'])
plt.subplot(2, 2, 2)
plt.plot(peaks2, x[peaks2], "ob"); plt.plot(x); plt.legend(['prominence'])
plt.subplot(2, 2, 3)
plt.plot(peaks3, x[peaks3], "vg"); plt.plot(x); plt.legend(['width'])
plt.subplot(2, 2, 4)
plt.plot(peaks4, x[peaks4], "xk"); plt.plot(x); plt.legend(['threshold'])
plt.show()

I'm looking at a similar problem, and I've found some of the best references come from chemistry (from peaks finding in mass-spec data). For a good thorough review of peaking finding algorithms read this. This is one of the best clearest reviews of peak finding techniques that I've run across. (Wavelets are the best for finding peaks of this sort in noisy data.).
It looks like your peaks are clearly defined and aren't hidden in the noise. That being the case I'd recommend using smooth savtizky-golay derivatives to find the peaks (If you just differentiate the data above you'll have a mess of false positives.). This is a very effective technique and is pretty easy to implemented (you do need a matrix class w/ basic operations). If you simply find the zero crossing of the first S-G derivative I think you'll be happy.

There is a function in scipy named scipy.signal.find_peaks_cwt which sounds like is suitable for your needs, however I don't have experience with it so I cannot recommend..
http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html

For those not sure about which peak-finding algorithms to use in Python, here a rapid overview of the alternatives: https://github.com/MonsieurV/py-findpeaks
Wanting myself an equivalent to the MatLab findpeaks function, I've found that the detect_peaks function from Marcos Duarte is a good catch.
Pretty easy to use:
import numpy as np
from vector import vector, plot_peaks
from libs import detect_peaks
print('Detect peaks with minimum height and distance filters.')
indexes = detect_peaks.detect_peaks(vector, mph=7, mpd=2)
print('Peaks are: %s' % (indexes))
Which will give you:

To detect both positive and negative peaks, PeakDetect is helpful.
from peakdetect import peakdetect
peaks = peakdetect(data, lookahead=20)
# Lookahead is the distance to look ahead from a peak to determine if it is the actual peak.
# Change lookahead as necessary
higherPeaks = np.array(peaks[0])
lowerPeaks = np.array(peaks[1])
plt.plot(data)
plt.plot(higherPeaks[:,0], higherPeaks[:,1], 'ro')
plt.plot(lowerPeaks[:,0], lowerPeaks[:,1], 'ko')

Detecting peaks in a spectrum in a reliable way has been studied quite a bit, for example all the work on sinusoidal modelling for music/audio signals in the 80ies. Look for "Sinusoidal Modeling" in the literature.
If your signals are as clean as the example, a simple "give me something with an amplitude higher than N neighbours" should work reasonably well. If you have noisy signals, a simple but effective way is to look at your peaks in time, to track them: you then detect spectral lines instead of spectral peaks. IOW, you compute the FFT on a sliding window of your signal, to get a set of spectrum in time (also called spectrogram). You then look at the evolution of the spectral peak in time (i.e. in consecutive windows).

There are standard statistical functions and methods for finding outliers to data, which is probably what you need in the first case. Using derivatives would solve your second. I'm not sure for a method which solves both continuous functions and sampled data, however.

I do not think that what you are looking for is provided by SciPy. I would write the code myself, in this situation.
The spline interpolation and smoothing from scipy.interpolate are quite nice and might be quite helpful in fitting peaks and then finding the location of their maximum.

First things first, the definition of "peak" is vague if without further specifications. For example, for the following series, would you call 5-4-5 one peak or two?
1-2-1-2-1-1-5-4-5-1-1-5-1
In this case, you'll need at least two thresholds: 1) a high threshold only above which can an extreme value register as a peak; and 2) a low threshold so that extreme values separated by small values below it will become two peaks.
Peak detection is a well-studied topic in Extreme Value Theory literature, also known as "declustering of extreme values". Its typical applications include identifying hazard events based on continuous readings of environmental variables e.g. analysing wind speed to detect storm events.

As mentioned at the bottom of this page there is no universal definition of a peak. Therefore a universal algorithm that finds peaks cannot work without bringing in additional assumptions (conditions, parameters etc.). This page provides some of the most stripped down suggestions. All the literature listed in the answers above is a more or less roundabout manner to do the same so feel free to take your pick.
In any case, it is your duty to narrow down the properties a feature needs to have in order to be classified as a peak, based on your experience and properties of spectra (curves) in question (noise, sampling, bandwidths, etc.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.