Peak-finding algorithm for Python/SciPy

Peak-finding algorithm for Python/SciPy - python

I can write something myself by finding zero-crossings of the first derivative or something, but it seems like a common-enough function to be included in standard libraries. Anyone know of one?
My particular application is a 2D array, but usually it would be used for finding peaks in FFTs, etc.
Specifically, in these kinds of problems, there are multiple strong peaks, and then lots of smaller "peaks" that are just caused by noise that should be ignored. These are just examples; not my actual data:
1-dimensional peaks:
2-dimensional peaks:
The peak-finding algorithm would find the location of these peaks (not just their values), and ideally would find the true inter-sample peak, not just the index with maximum value, probably using quadratic interpolation or something.
Typically you only care about a few strong peaks, so they'd either be chosen because they're above a certain threshold, or because they're the first n peaks of an ordered list, ranked by amplitude.
As I said, I know how to write something like this myself. I'm just asking if there's a pre-existing function or package that's known to work well.
Update:
I translated a MATLAB script and it works decently for the 1-D case, but could be better.
Updated update:
sixtenbe created a better version for the 1-D case.

The function scipy.signal.find_peaks, as its name suggests, is useful for this. But it's important to understand well its parameters width, threshold, distance and above all prominence to get a good peak extraction.
According to my tests and the documentation, the concept of prominence is "the useful concept" to keep the good peaks, and discard the noisy peaks.
What is (topographic) prominence? It is "the minimum height necessary to descend to get from the summit to any higher terrain", as it can be seen here:
The idea is:
The higher the prominence, the more "important" the peak is.
Test:
I used a (noisy) frequency-varying sinusoid on purpose because it shows many difficulties. We can see that the width parameter is not very useful here because if you set a minimum width too high, then it won't be able to track very close peaks in the high frequency part. If you set width too low, you would have many unwanted peaks in the left part of the signal. Same problem with distance. threshold only compares with the direct neighbours, which is not useful here. prominence is the one that gives the best solution. Note that you can combine many of these parameters!
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.sin(2*np.pi*(2**np.linspace(2,10,1000))*np.arange(1000)/48000) + np.random.normal(0, 1, 1000) * 0.15
peaks, _ = find_peaks(x, distance=20)
peaks2, _ = find_peaks(x, prominence=1) # BEST!
peaks3, _ = find_peaks(x, width=20)
peaks4, _ = find_peaks(x, threshold=0.4) # Required vertical distance to its direct neighbouring samples, pretty useless
plt.subplot(2, 2, 1)
plt.plot(peaks, x[peaks], "xr"); plt.plot(x); plt.legend(['distance'])
plt.subplot(2, 2, 2)
plt.plot(peaks2, x[peaks2], "ob"); plt.plot(x); plt.legend(['prominence'])
plt.subplot(2, 2, 3)
plt.plot(peaks3, x[peaks3], "vg"); plt.plot(x); plt.legend(['width'])
plt.subplot(2, 2, 4)
plt.plot(peaks4, x[peaks4], "xk"); plt.plot(x); plt.legend(['threshold'])
plt.show()

I'm looking at a similar problem, and I've found some of the best references come from chemistry (from peaks finding in mass-spec data). For a good thorough review of peaking finding algorithms read this. This is one of the best clearest reviews of peak finding techniques that I've run across. (Wavelets are the best for finding peaks of this sort in noisy data.).
It looks like your peaks are clearly defined and aren't hidden in the noise. That being the case I'd recommend using smooth savtizky-golay derivatives to find the peaks (If you just differentiate the data above you'll have a mess of false positives.). This is a very effective technique and is pretty easy to implemented (you do need a matrix class w/ basic operations). If you simply find the zero crossing of the first S-G derivative I think you'll be happy.

There is a function in scipy named scipy.signal.find_peaks_cwt which sounds like is suitable for your needs, however I don't have experience with it so I cannot recommend..
http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html

For those not sure about which peak-finding algorithms to use in Python, here a rapid overview of the alternatives: https://github.com/MonsieurV/py-findpeaks
Wanting myself an equivalent to the MatLab findpeaks function, I've found that the detect_peaks function from Marcos Duarte is a good catch.
Pretty easy to use:
import numpy as np
from vector import vector, plot_peaks
from libs import detect_peaks
print('Detect peaks with minimum height and distance filters.')
indexes = detect_peaks.detect_peaks(vector, mph=7, mpd=2)
print('Peaks are: %s' % (indexes))
Which will give you:

To detect both positive and negative peaks, PeakDetect is helpful.
from peakdetect import peakdetect
peaks = peakdetect(data, lookahead=20)
# Lookahead is the distance to look ahead from a peak to determine if it is the actual peak.
# Change lookahead as necessary
higherPeaks = np.array(peaks[0])
lowerPeaks = np.array(peaks[1])
plt.plot(data)
plt.plot(higherPeaks[:,0], higherPeaks[:,1], 'ro')
plt.plot(lowerPeaks[:,0], lowerPeaks[:,1], 'ko')

Detecting peaks in a spectrum in a reliable way has been studied quite a bit, for example all the work on sinusoidal modelling for music/audio signals in the 80ies. Look for "Sinusoidal Modeling" in the literature.
If your signals are as clean as the example, a simple "give me something with an amplitude higher than N neighbours" should work reasonably well. If you have noisy signals, a simple but effective way is to look at your peaks in time, to track them: you then detect spectral lines instead of spectral peaks. IOW, you compute the FFT on a sliding window of your signal, to get a set of spectrum in time (also called spectrogram). You then look at the evolution of the spectral peak in time (i.e. in consecutive windows).

There are standard statistical functions and methods for finding outliers to data, which is probably what you need in the first case. Using derivatives would solve your second. I'm not sure for a method which solves both continuous functions and sampled data, however.

I do not think that what you are looking for is provided by SciPy. I would write the code myself, in this situation.
The spline interpolation and smoothing from scipy.interpolate are quite nice and might be quite helpful in fitting peaks and then finding the location of their maximum.

First things first, the definition of "peak" is vague if without further specifications. For example, for the following series, would you call 5-4-5 one peak or two?
1-2-1-2-1-1-5-4-5-1-1-5-1
In this case, you'll need at least two thresholds: 1) a high threshold only above which can an extreme value register as a peak; and 2) a low threshold so that extreme values separated by small values below it will become two peaks.
Peak detection is a well-studied topic in Extreme Value Theory literature, also known as "declustering of extreme values". Its typical applications include identifying hazard events based on continuous readings of environmental variables e.g. analysing wind speed to detect storm events.

As mentioned at the bottom of this page there is no universal definition of a peak. Therefore a universal algorithm that finds peaks cannot work without bringing in additional assumptions (conditions, parameters etc.). This page provides some of the most stripped down suggestions. All the literature listed in the answers above is a more or less roundabout manner to do the same so feel free to take your pick.
In any case, it is your duty to narrow down the properties a feature needs to have in order to be classified as a peak, based on your experience and properties of spectra (curves) in question (noise, sampling, bandwidths, etc.)

Related

Approximate maximum of an unknown curve

I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?

I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.

Fitting multiple Lorentzians to data using scipy in Python3

Okay so I appreciate this will require a bit of patience but bear with me. I'm analysing some Raman spectra and have written the basis of a program to use Scipy curve_fit to fit multiple lorentzians to the peaks on my data. The trick is I have so much data that I want the program to automatically identify initial guesses for the lorentzians, rather than manually doing it. On the whole, the program gives it a good go (and might be of use to others with a similar use case with simpler data), but I don't know Scipy well enough to optimise the curve_fit enough to make it work on many different examples.
Code repo here: https://github.com/btjones-me/raman_spectroscopy
An example of it working well can be see in fig 1.
Part of the problem is my peak finding algorithm, which sometimes struggles to find the appropriate starting positions for each lorentzian. You can see this in fig 2.
The next problem is that, for some reason, curve_fit occasionally catastrophically diverges (my best guess is due to rounding errors). You can see this behaviour in fig 3.
Finally while I usually make good guesses with the height and x position of each lorentzian, I haven't found a good way of predicting the width, or FWHM of the curves. Predicting this might help curve_fit.
If anybody can help with either of these problems in any way I would appreciate it greatly. I'm open to any other methods or suggestions including additional third party libraries, so long as they improve upon the current implementation. Many thanks to anybody who attempts this one!
Here it is working exactly as I intend:
Below you can see the peak finding method has failed to identify all the peaks. There are many peak finding algorithms, but the one I use is Scipy's 'find_peaks_cwt()' (it's not usually this bad, this is an extreme case).
Here it's just totally way off. This happens fairly often and I can't really understand why, nor stop it from happening. Possibly it occurs when I tell it to find more/less peaks than are available in the spectra, but just a guess.
I've done this in Python 3.5.2. PS I know I won't be winning any medals for code layout, but as always comments on code style and better practices are welcomed.

I stumbled across this because I'm attempting to solve the exact same problem, here is my solution. For each peak, I only fit my lorentzian in the region of the domain + or - 1/2 the distance to the next closest peak. Here is my function that does breaks up the domain:
def get_local_indices(peak_indices):
#returns the array indices of the points closest to each peak (distance to next peak/2)
chunks = []
for i in range(len(peak_indices)):
peak_index = peak_indices[i]
if i>0 and i<len(peak_indices)-1:
distance = min(abs(peak_index-peak_indices[i+1]),abs(peak_indices[i-1]-peak_index))
elif i==0:
distance = abs(peak_index-peak_indices[i+1])
else:
distance = abs(peak_indices[i-1]-peak_index)
min_index = peak_index-int(distance/2)
max_index = peak_index+int(distance/2)
indices = np.arange(min_index,max_index)
chunks.append(indices)
return chunks
And here is a picture of the resulting graph, the dashed lines indicate which areas the lorentzians are fit in:
Lorentzian Fitting Plot

Interpolaton algorithm to correct a slight clock drift

I have some sampled (univariate) data - but the clock driving the sampling process is inaccurate - resulting in a random slip of (less than) 1 sample every 30. A more accurate clock at approximately 1/30 of the frequency provides reliable samples for the same data ... allowing me to establish a good estimate of the clock drift.
I am looking to interpolate the sampled data to correct for this so that I 'fit' the high frequency data to the low-frequency. I need to do this 'real time' - with no more than the latency of a few low-frequency samples.
I recognise that there is a wide range of interpolation algorithms - and, among those I've considered, a spline based approach looks most promising for this data.
I'm working in Python - and have found the scipy.interpolate package - though I could see no obvious way to use it to 'stretch' n samples to correct a small timing error. Am I overlooking something?
I am interested in pointers to either a suitable published algorithm, or - ideally - a Python library function to achieve this sort of transform. Is this supported by SciPy (or anything else)?
UPDATE...
I'm beginning to realise that what, at first, seemed a trivial problem isn't as straightforward as I first thought. I am no-longer convinced that naive use of splines will suffice. I've also realised that my problem can be better described without reference to 'clock drift'... like this:
A single random variable is sampled at two different frequencies - one low and one high, with no common divisor - e.g. 5hz and 144hz. If we assume sample 0 is identical at both sample rates, sample 1 #5hz falls between samples 28 amd 29. I want to construct a new series - at 720hz, say - that fits all the known data points "as smoothly as possible".
I had hoped to find an 'out of the box' solution.

Before you can ask the programming question, it seems to me you need to investigate a more fundamental scientific one.
Before you can start picking out particular equations to fit badfastclock to goodslowclock, you should investigate the nature of the drift. Let both clocks run a while, and look at their points together. Is badfastclock bad because it drifts linearly away from real time? If so, a simple quadratic equation should fit badfastclock to goodslowclock, just as a quadratic equation describes the linear acceleration of a object in gravity; i.e., if badfastclock is accelerating linearly away from real time, you can deterministically shift badfastclock toward real time. However, if you find that badfastclock is bad because it is jumping around, then smooth curves -- even complex smooth curves like splines -- won't fit. You must understand the data before trying to manipulate it.

Bsed on your updated question, if the data is smooth with time, just place all the samples in a time trace, and interpolate on the sparse grid (time).

Clustering of 1D signal

I've got several 1D signals, showing two or more bands. An example is shown below.
I need to extract the datapoints belonging to a single band.
My first simple approach was taking a moving average of the data, and get the indices where the data is larger than the average.
def seperate(x):
average = scipy.ndimage.gaussian_filter(x, 10)
# this gives me a boolean array with the indices of the upper band.
idx = x > average
# return the indices of the upper and lower band
return idx, ~idx
plotting these and the average curve would look like this, where red denotes the upper and blue the lower band.
This works quite well for this example, but fails when more then two bands are present and/or the bands are not that well separated.
I'm looking for a more robust and general solution. I was looking into scikit-learn and was wondering if one of the clustering algorithms can be used to achieve this.

Have a look a time series similarity measures.
Indeed, I have seen this binary thresholding you tried there called "threshold crossing", and many more.
In general, there is no "one size fits all" time series similarity. Different types of signals require different measures. This can probably best be seen by the fact that some are much better analyzed after FFT, while for others FFT makes absolutely no sense.

Curve_fit not converging means...?

I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?

There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.