Fitting multiple Lorentzians to data using scipy in Python3

Fitting multiple Lorentzians to data using scipy in Python3 - python

Okay so I appreciate this will require a bit of patience but bear with me. I'm analysing some Raman spectra and have written the basis of a program to use Scipy curve_fit to fit multiple lorentzians to the peaks on my data. The trick is I have so much data that I want the program to automatically identify initial guesses for the lorentzians, rather than manually doing it. On the whole, the program gives it a good go (and might be of use to others with a similar use case with simpler data), but I don't know Scipy well enough to optimise the curve_fit enough to make it work on many different examples.
Code repo here: https://github.com/btjones-me/raman_spectroscopy
An example of it working well can be see in fig 1.
Part of the problem is my peak finding algorithm, which sometimes struggles to find the appropriate starting positions for each lorentzian. You can see this in fig 2.
The next problem is that, for some reason, curve_fit occasionally catastrophically diverges (my best guess is due to rounding errors). You can see this behaviour in fig 3.
Finally while I usually make good guesses with the height and x position of each lorentzian, I haven't found a good way of predicting the width, or FWHM of the curves. Predicting this might help curve_fit.
If anybody can help with either of these problems in any way I would appreciate it greatly. I'm open to any other methods or suggestions including additional third party libraries, so long as they improve upon the current implementation. Many thanks to anybody who attempts this one!
Here it is working exactly as I intend:
Below you can see the peak finding method has failed to identify all the peaks. There are many peak finding algorithms, but the one I use is Scipy's 'find_peaks_cwt()' (it's not usually this bad, this is an extreme case).
Here it's just totally way off. This happens fairly often and I can't really understand why, nor stop it from happening. Possibly it occurs when I tell it to find more/less peaks than are available in the spectra, but just a guess.
I've done this in Python 3.5.2. PS I know I won't be winning any medals for code layout, but as always comments on code style and better practices are welcomed.

I stumbled across this because I'm attempting to solve the exact same problem, here is my solution. For each peak, I only fit my lorentzian in the region of the domain + or - 1/2 the distance to the next closest peak. Here is my function that does breaks up the domain:
def get_local_indices(peak_indices):
#returns the array indices of the points closest to each peak (distance to next peak/2)
chunks = []
for i in range(len(peak_indices)):
peak_index = peak_indices[i]
if i>0 and i<len(peak_indices)-1:
distance = min(abs(peak_index-peak_indices[i+1]),abs(peak_indices[i-1]-peak_index))
elif i==0:
distance = abs(peak_index-peak_indices[i+1])
else:
distance = abs(peak_indices[i-1]-peak_index)
min_index = peak_index-int(distance/2)
max_index = peak_index+int(distance/2)
indices = np.arange(min_index,max_index)
chunks.append(indices)
return chunks
And here is a picture of the resulting graph, the dashed lines indicate which areas the lorentzians are fit in:
Lorentzian Fitting Plot

Related

Is having a large outlier value going to be a problem for lightgbm?

I have a classification task at hand. I'm using lightgbm for that.
I have a particular value that has the histogram as below:
Basically, all values are nicely concentrated on the left, with a few values on the right.
Lightgbm uses approximate splits, rather than exact ones. It, therefore, has to build a histogram and find bin edges.
Does anyone happen to know, how exactly the bin ranges are defined? I'm asking because if one does not choose to build enough bins then it will render the whole variable useless since too much stuff will be cluttered in lower bins. Ultimately, the question is: what are the consequences of having a few very high values in a column?
Also, this seems to be a relevant piece of code, but I'm not good enough with C++ to read and understand it relatively quickly.
UPD: Just to clarify, it's a poor visualization here. The largest number is 5.02e03, not 6265.02e03

How to set good parameters clustering high density data with DBSCAN?

I want to cluster some stars based on given position (X,Y,Z) using DBSCAN, I do not know how to adjust the data to get the right numbers of clusters to plot it afterward?
this is how the data looks like
what is the right parameters for these data?
the number of rows are 1.202672e+06
import pandas as pd
data = pd.read_csv('datasets/full_dataset.csv')
from sklearn.cluster import DBSCAN
clusters=DBSCAN(eps=0.5,min_samples=40,metric="euclidean",algorithm="auto")

min_samples is arguably one of the tougher ones to choose, but you can decide that by just looking at the results and deciding how much noise you are okay with.
Choosing eps can be aided by running k-NN to understand the density distribution of your data. I believe that the DBACAN paper recommends in more detail. There might even be a way to plot this in python (in R it is kNNdistplot).
I would prefer to use OPTICS, which is essentially doing all eps values simultaneously. However, I haven't found a decent implementation of this in either in python or R. In fact, there is an incorrect implementation in python which doesn't follow the original OPTICS paper at all.
If you really want to use optics, I recommend using a java implementation available using ELKI.
If anyone else has heard of a proper python implementation, I'd love to hear it.
If you want to go the trial and error route, start eps much smaller and go from there.

Smart way to detect too far away point from a row of points?

I'm working on a python script whose goal is to detect if a point is out of a row of points (gps statement from an agricultural machine).
Input data are shapefile and I use Geopandas library for all geotreatments.
My first idea was to make a buffer around the 2 points around considered point. After that, I watch if my point is in the buffer. But results aren't good.
So I ask myself if there is a mathematical smart method, maybe with Scikit lib... Somebody is able to help me?

try arcgis.
build two new attributes in arcgis with their X and Y coordinate,then calculate the distance between the points you want

Question is kinda vague, but my guess would be to find approximation/regression line (I believe, numpy.polyfit of 2nd degree) and take points with largest distance from line, probably with threshold relative to overall fit loss

How to calculate a hanging rope from points of different heights

Does anyone knows how to calculate how far a curve should dip down at a certain point, given the two points it's connected to (and maybe weight but it doesn't matter as much)?
From what I've been looking up, all the formulas involve tension and a few other things, but I only really need to know the coordinates, and can presume these are all constant.
This is the formula I have at the moment, it's very basic so would only work with two points at the same height. One method is from wikipedia and the other I got from somewhere else, they do unfortunately give different results though so I've no idea which is correct
import math
x=[-2,2]
a=1.0
for i in range(x[0],x[1]+1):
#method 1
y=a*math.cosh(1/a*i)-a
#method 2 (from wikipedia)
y=(a/2)*(math.exp(i/a)+math.exp(-i/a))
Oh also, it's for maya, so it'd be a huge help if it could be done from the normal python functions without having to install scipy.optimise
Thanks :)

Peak-finding algorithm for Python/SciPy

I can write something myself by finding zero-crossings of the first derivative or something, but it seems like a common-enough function to be included in standard libraries. Anyone know of one?
My particular application is a 2D array, but usually it would be used for finding peaks in FFTs, etc.
Specifically, in these kinds of problems, there are multiple strong peaks, and then lots of smaller "peaks" that are just caused by noise that should be ignored. These are just examples; not my actual data:
1-dimensional peaks:
2-dimensional peaks:
The peak-finding algorithm would find the location of these peaks (not just their values), and ideally would find the true inter-sample peak, not just the index with maximum value, probably using quadratic interpolation or something.
Typically you only care about a few strong peaks, so they'd either be chosen because they're above a certain threshold, or because they're the first n peaks of an ordered list, ranked by amplitude.
As I said, I know how to write something like this myself. I'm just asking if there's a pre-existing function or package that's known to work well.
Update:
I translated a MATLAB script and it works decently for the 1-D case, but could be better.
Updated update:
sixtenbe created a better version for the 1-D case.

The function scipy.signal.find_peaks, as its name suggests, is useful for this. But it's important to understand well its parameters width, threshold, distance and above all prominence to get a good peak extraction.
According to my tests and the documentation, the concept of prominence is "the useful concept" to keep the good peaks, and discard the noisy peaks.
What is (topographic) prominence? It is "the minimum height necessary to descend to get from the summit to any higher terrain", as it can be seen here:
The idea is:
The higher the prominence, the more "important" the peak is.
Test:
I used a (noisy) frequency-varying sinusoid on purpose because it shows many difficulties. We can see that the width parameter is not very useful here because if you set a minimum width too high, then it won't be able to track very close peaks in the high frequency part. If you set width too low, you would have many unwanted peaks in the left part of the signal. Same problem with distance. threshold only compares with the direct neighbours, which is not useful here. prominence is the one that gives the best solution. Note that you can combine many of these parameters!
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
x = np.sin(2*np.pi*(2**np.linspace(2,10,1000))*np.arange(1000)/48000) + np.random.normal(0, 1, 1000) * 0.15
peaks, _ = find_peaks(x, distance=20)
peaks2, _ = find_peaks(x, prominence=1) # BEST!
peaks3, _ = find_peaks(x, width=20)
peaks4, _ = find_peaks(x, threshold=0.4) # Required vertical distance to its direct neighbouring samples, pretty useless
plt.subplot(2, 2, 1)
plt.plot(peaks, x[peaks], "xr"); plt.plot(x); plt.legend(['distance'])
plt.subplot(2, 2, 2)
plt.plot(peaks2, x[peaks2], "ob"); plt.plot(x); plt.legend(['prominence'])
plt.subplot(2, 2, 3)
plt.plot(peaks3, x[peaks3], "vg"); plt.plot(x); plt.legend(['width'])
plt.subplot(2, 2, 4)
plt.plot(peaks4, x[peaks4], "xk"); plt.plot(x); plt.legend(['threshold'])
plt.show()

I'm looking at a similar problem, and I've found some of the best references come from chemistry (from peaks finding in mass-spec data). For a good thorough review of peaking finding algorithms read this. This is one of the best clearest reviews of peak finding techniques that I've run across. (Wavelets are the best for finding peaks of this sort in noisy data.).
It looks like your peaks are clearly defined and aren't hidden in the noise. That being the case I'd recommend using smooth savtizky-golay derivatives to find the peaks (If you just differentiate the data above you'll have a mess of false positives.). This is a very effective technique and is pretty easy to implemented (you do need a matrix class w/ basic operations). If you simply find the zero crossing of the first S-G derivative I think you'll be happy.

There is a function in scipy named scipy.signal.find_peaks_cwt which sounds like is suitable for your needs, however I don't have experience with it so I cannot recommend..
http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html

For those not sure about which peak-finding algorithms to use in Python, here a rapid overview of the alternatives: https://github.com/MonsieurV/py-findpeaks
Wanting myself an equivalent to the MatLab findpeaks function, I've found that the detect_peaks function from Marcos Duarte is a good catch.
Pretty easy to use:
import numpy as np
from vector import vector, plot_peaks
from libs import detect_peaks
print('Detect peaks with minimum height and distance filters.')
indexes = detect_peaks.detect_peaks(vector, mph=7, mpd=2)
print('Peaks are: %s' % (indexes))
Which will give you:

To detect both positive and negative peaks, PeakDetect is helpful.
from peakdetect import peakdetect
peaks = peakdetect(data, lookahead=20)
# Lookahead is the distance to look ahead from a peak to determine if it is the actual peak.
# Change lookahead as necessary
higherPeaks = np.array(peaks[0])
lowerPeaks = np.array(peaks[1])
plt.plot(data)
plt.plot(higherPeaks[:,0], higherPeaks[:,1], 'ro')
plt.plot(lowerPeaks[:,0], lowerPeaks[:,1], 'ko')

Detecting peaks in a spectrum in a reliable way has been studied quite a bit, for example all the work on sinusoidal modelling for music/audio signals in the 80ies. Look for "Sinusoidal Modeling" in the literature.
If your signals are as clean as the example, a simple "give me something with an amplitude higher than N neighbours" should work reasonably well. If you have noisy signals, a simple but effective way is to look at your peaks in time, to track them: you then detect spectral lines instead of spectral peaks. IOW, you compute the FFT on a sliding window of your signal, to get a set of spectrum in time (also called spectrogram). You then look at the evolution of the spectral peak in time (i.e. in consecutive windows).

There are standard statistical functions and methods for finding outliers to data, which is probably what you need in the first case. Using derivatives would solve your second. I'm not sure for a method which solves both continuous functions and sampled data, however.

I do not think that what you are looking for is provided by SciPy. I would write the code myself, in this situation.
The spline interpolation and smoothing from scipy.interpolate are quite nice and might be quite helpful in fitting peaks and then finding the location of their maximum.

First things first, the definition of "peak" is vague if without further specifications. For example, for the following series, would you call 5-4-5 one peak or two?
1-2-1-2-1-1-5-4-5-1-1-5-1
In this case, you'll need at least two thresholds: 1) a high threshold only above which can an extreme value register as a peak; and 2) a low threshold so that extreme values separated by small values below it will become two peaks.
Peak detection is a well-studied topic in Extreme Value Theory literature, also known as "declustering of extreme values". Its typical applications include identifying hazard events based on continuous readings of environmental variables e.g. analysing wind speed to detect storm events.

As mentioned at the bottom of this page there is no universal definition of a peak. Therefore a universal algorithm that finds peaks cannot work without bringing in additional assumptions (conditions, parameters etc.). This page provides some of the most stripped down suggestions. All the literature listed in the answers above is a more or less roundabout manner to do the same so feel free to take your pick.
In any case, it is your duty to narrow down the properties a feature needs to have in order to be classified as a peak, based on your experience and properties of spectra (curves) in question (noise, sampling, bandwidths, etc.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.