I'm trying to generalise some code to be able to fit multiple (n from 1 to >10) gaussian curves/peaks within a single dataset.
Using Scipy Optimise Curve_fit I can get pretty good fits when I hard code functions for 1-3 gaussians, and I've managed to produce functions which run without error for a generalise, arbitrary number of gaussians. However, the output fit is very poor. This is despite giving the input parameters which are identical to those used to generate the 'raw' data - i.e. a best case scenario.
Also, there is a non-zero chance the specific function may need to be modified from a simple gaussian at some point, but for now it should be OK.
Below is my code example, and the output figure is shown below.
import numpy as np
import pandas as pd
import scipy
import scipy.optimize
import matplotlib.pyplot as plt
from matplotlib import gridspec
amp1 = 1
cen1 = 1
sigma1 = 0.05
df=pd.DataFrame(index=np.linspace(0,10,num=1000),columns=['int'])
def _ngaussian(x, amps,cens,sigmas):
fn = 0
if len(amps)== len(cens)== len(sigmas):
for i in range(len(amps)):
fn = fn+amps[i]*(1/(sigmas[i]*(np.sqrt(2*np.pi))))*\
(np.exp((-1.0/2.0)*(((x-cens[i])/sigmas[i])**2)))
else:
print('Your inputs have unequal lengths')
return fn
amps = [1,1.1,0.9]
cens = [1,2,1.7]
sigmas=[0.05]*3
popt_peaks = [amps,cens,sigmas]
df['peaks'] = _ngaussian(df.index, *popt_peaks)
# Optionally adding noise to the raw data
#noise = np.random.normal(0,0.1,len(df['peaks']))
#df['peaks'] = df['peaks']+noise
def wrapper_fit_func(x, *args):
N = len(args)
a, b, c = list(args[0][:N]),list(args[0][N:N*2]),list(args[0][2*N:3*N])
return _ngaussian(x, a, b, c)
def unwrapper_fit_func(x, *args):
N = int(len(args)/3)
a, b, c = list(args[:N]),list(args[N:N*2]),list(args[2*N:3*N])
return _ngaussian(x, a, b, c)
popt_fitpeaks, pcov_fitpeaks = scipy.optimize.curve_fit(lambda x, *popt_peaks: wrapper_fit_func(x, popt_peaks),
df.index, df['peaks'], p0=popt_peaks,
method='lm')
df['peaks_fit'] = unwrapper_fit_func(df.index, *popt_fitpeaks)
fig = plt.figure(figsize=(8,8))
gs = gridspec.GridSpec(1,1)
ax1 = fig.add_subplot(gs[0])
ax1.set_xlim(0,3)
ax1.plot(df.index, df['peaks'], "b",label='ideal data')
ax1.plot(df.index, df['peaks_fit'], "g",label='fit data')
ax1.legend(loc='upper right')
If you're interested, the context is in analytical chemistry, nuclear magnetic resonance (NMR) and Fourier transform ion cyclotron resonance mass spectrometry (FTICR MS) signal processing.
You might find lmfit (https://lmfit.github.io/lmfit-py/, disclosure: I am a lead author) useful for this. It provides an easy-to-use Model class for modeling data, including builtin Models for Gaussian, Voigt, and similar lineshapes making it easy to compare model functions.
Lmfit models can be added (or mulitplied) to make a Composite Model, making it easy to support 1, 2, 3, etc Gaussians and include different baseline functions as well. There are docs and several examples at the link above. A small rewrite of your example (including adding a bit of noise) might look like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lmfit.models import GaussianModel
amp1 = 1
cen1 = 1
sigma1 = 0.05
df=pd.DataFrame(index=np.linspace(0,10,num=1000),columns=['int'])
def _ngaussian(x, amps,cens,sigmas):
fn = 0
if len(amps)== len(cens)== len(sigmas):
for i in range(len(amps)):
fn = fn+amps[i]*(1/(sigmas[i]*(np.sqrt(2*np.pi))))*\
(np.exp((-1.0/2.0)*(((x-cens[i])/sigmas[i])**2)))
fn = fn+np.random.normal(size=len(x), scale=0.05)
else:
print('Your inputs have unequal lengths')
return fn
amps = [1.30, 0.92, 2.11]
cens = [1.10, 1.73, 2.06]
sigmas=[0.05, 0.09, 0.07]
popt_peaks = [amps,cens,sigmas]
df['peaks'] = _ngaussian(df.index, *popt_peaks)
# create a model with 3 Gaussians: pretty easy to generalize
# to a loop to make N peaks
model = (GaussianModel(prefix='p1_') +
GaussianModel(prefix='p2_') +
GaussianModel(prefix='p3_') )
# create Parameters (named from function arguments). For
# Gaussian, Lorentzian, Voigt, etc these are "center", "amplitude", "sigma"
params = model.make_params(p1_center=1.0, p1_amplitude=2, p1_sigma=0.1,
p2_center=1.5, p2_amplitude=2, p2_sigma=0.1,
p3_center=2.0, p3_amplitude=2, p3_sigma=0.1)
# Parameters can have min/max bounds, be fixed (`.vary = False`)
# or constrained to a mathematical expression of other Parameter values
params['p1_center'].min = 0.8
params['p1_center'].max = 1.5
params['p2_center'].min = 1.1
params['p2_center'].max = 1.9
params['p3_center'].min = 1.88
params['p3_center'].max = 3.00
# run the fit
result = model.fit(df['peaks'], params, x=df.index)
# print out the fit results
print(result.fit_report())
# plot results
plt.plot(df.index, df['peaks'], 'o', label='data')
plt.plot(df.index, result.best_fit, '-', label='fit')
plt.legend()
plt.gca().set_xlim(0, 3)
plt.show()
This will produce a fit plot like this:
and print out a report of
[[Model]]
((Model(gaussian, prefix='p1_') + Model(gaussian, prefix='p2_')) + Model(gaussian, prefix='p3_'))
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 102
# data points = 1000
# variables = 9
chi-square = 6.88439024
reduced chi-square = 0.00694691
Akaike info crit = -4960.49871
Bayesian info crit = -4916.32892
[[Variables]]
p1_amplitude: 1.29432022 +/- 0.00428720 (0.33%) (init = 2)
p1_center: 1.09993745 +/- 1.9012e-04 (0.02%) (init = 1)
p1_sigma: 0.04970776 +/- 1.9012e-04 (0.38%) (init = 0.1)
p2_amplitude: 0.91875183 +/- 0.00604913 (0.66%) (init = 2)
p2_center: 1.73039597 +/- 6.7594e-04 (0.04%) (init = 1.5)
p2_sigma: 0.09054027 +/- 7.0994e-04 (0.78%) (init = 0.1)
p3_amplitude: 2.10077395 +/- 0.00533617 (0.25%) (init = 2)
p3_center: 2.06019332 +/- 2.0105e-04 (0.01%) (init = 2)
p3_sigma: 0.06970239 +/- 2.0752e-04 (0.30%) (init = 0.1)
p1_fwhm: 0.11705282 +/- 4.4770e-04 (0.38%) == '2.3548200*p1_sigma'
p1_height: 10.3878975 +/- 0.03440799 (0.33%) == '0.3989423*p1_amplitude/max(2.220446049250313e-16, p1_sigma)'
p2_fwhm: 0.21320604 +/- 0.00167179 (0.78%) == '2.3548200*p2_sigma'
p2_height: 4.04824243 +/- 0.02582408 (0.64%) == '0.3989423*p2_amplitude/max(2.220446049250313e-16, p2_sigma)'
p3_fwhm: 0.16413657 +/- 4.8866e-04 (0.30%) == '2.3548200*p3_sigma'
p3_height: 12.0238006 +/- 0.02922330 (0.24%) == '0.3989423*p3_amplitude/max(2.220446049250313e-16, p3_sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(p3_amplitude, p3_sigma) = 0.622
C(p2_amplitude, p2_sigma) = 0.621
C(p1_amplitude, p1_sigma) = 0.577
C(p2_sigma, p3_sigma) = -0.299
C(p2_sigma, p3_amplitude) = -0.271
C(p2_amplitude, p3_sigma) = -0.239
C(p2_sigma, p3_center) = 0.226
C(p2_amplitude, p3_amplitude) = -0.210
C(p2_center, p3_sigma) = -0.192
C(p2_amplitude, p3_center) = 0.171
C(p2_center, p3_amplitude) = -0.160
C(p2_center, p3_center) = 0.126
Related
The acquisition channel of scipy and the same version are used.
The result of least_squares is different depending on the environment.
Differences in the environment, the PC is different.
version:1.9.1 py39h316f440_0
channel:conda-forge
environment:windows
I've attached the source code I ran.
If the conditions are the same except for the environment, I would like to get the same results.
Why different causes? How can I do that?
thank you.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import least_squares
import random
random.seed(134)
import numpy as np
np.random.seed(134)
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import least_squares
def report_params(fit_params_values, fit_param_names):
for each in range(len(fit_param_names)):
print(fit_param_names[each], 'is', fit_params_values[each])
# define your modules
def pCon1():
# This is the module for a specific insubstatiation of a constituitive promoter
# the input is nothing
# the output is a protein production amount per time unit
pCon1_production_rate = 100
return pCon1_production_rate
def pLux1(LuxR, AHL):
# This is the module for a specific insubstatiation of a lux promoter
# the input is a LuxR amount and an AHL amount
# the output is a protein production amount per time unit
# For every promoter there is some function that determines what the promoter's
# maximal and basal expression are based on the amount of transcriptional factor
# is floating around in the cell. These numbers are empircally determined, and
# for demonstration purposes are fictionally and arbitrarily filled in here.
# These functions take the form of hill functions.
basal_n = 2
basal_basal = 2
basal_max = 2
basal_kd = 2
basal_expression_rate = basal_basal + (basal_max * (LuxR**basal_n / (LuxR**basal_n + basal_kd)))
max_n = 2
max_max = 2
max_kd = 2
maximal_expression_rate = (LuxR**max_n / (LuxR**max_n + max_kd))
pLux1_n = 2
pLux1_kd = 10
pLux1_production_rate = basal_expression_rate + maximal_expression_rate*(AHL**pLux1_n / (pLux1_kd + AHL**pLux1_n))
return pLux1_production_rate
def simulation_set_of_equations(y, t, *args):
# Args are strictly for parameters we want to eventually estimate.
# Everything else must be hardcoded below. Sorry for the convience.
# Unpack your parameters
k_pCon_express = args[0] # A summation of transcription and translation from a pCon promoter
k_pLux_express = args[1] # A summation of transcription and translation from a pLux promoter
k_loss = args[2] # A summation of dilution and degredation
# Unpack your current amount of each species
LuxR, GFP, AHL = y
# Determine the change in each species
dLuxR = pCon1() - k_loss*LuxR
dGFP = pLux1(LuxR, AHL)*k_pLux_express - k_loss*GFP
dAHL = 0 # for now we're assuming AHL was added exogenously and never degrades
# Return the change in each species; make sure same order as your init values
# scipy.odeint will take these values and apply them to the current value of each species in the next time step for you
return [dLuxR, dGFP, dAHL]
# Parameters
k_pCon_express = 101
k_pLux_express = 50
k_loss = 0.1
params = (k_pCon_express, k_pLux_express, k_loss)
param_names = ['k_pCon_express', 'k_pLux_express', 'k_loss'] # somehow this is honestly necessary in Python?!
# Initial Conditions
# LuxR, GFP, AHL
init_P = [1000, 0, 11]
# Timesteps
n_steps = 500
t = np.linspace(0, 30, n_steps)
num_P = odeint(simulation_set_of_equations, init_P, t, args = (params))
plt.plot(t, num_P[:,0], c='b', label = 'LuxR')
plt.plot(t, num_P[:,1], c='g', label = 'GFP')
plt.plot(t, num_P[:,2], c='r', label = 'AHL')
plt.xlabel('Time')
plt.ylabel('Concentration')
plt.legend(loc = 'best')
plt.grid()
plt.yscale('log')
plt.show()
noise = np.random.normal(0, 10, num_P.shape)
exp_P = num_P + noise
exp_t = t[::10]
exp_P = exp_P[::10]
# Create experimental data. Just take the regular simulation data and add some gaussian noise to it.
def residuals(params):
params = tuple(params)
sim_P = odeint(simulation_set_of_equations, init_P, exp_t, args = params)
res = sim_P - exp_P
return res.flatten()
initial_guess = (100, 100, 100)
low_bounds = [0, 0, 0]
up_bounds = [1000, 1000, 1000]
fitted_params = least_squares(residuals, initial_guess, bounds=(low_bounds, up_bounds)).x
# small reminder: .x is the fitted parameters attribute of the least_squares output
# With least_squares function, unlike, say, curve_fit, it does not compute the covariance matrix for you
# TODO calculate standard deviation of parameter estimation
# (will this ever be used other than sanity checking?)
print(params)
report_params(fitted_params, param_names)
(101, 50, 0.1)
k_pCon_express is 100.0
k_pLux_express is 49.9942246627
k_loss is 0.100037839987
plt.plot(t, odeint(simulation_set_of_equations, init_P, t, args = tuple(params))[:,1], c='r', label='GFP - Given Param Simulation')
plt.scatter(exp_t, exp_P[:,1], c='b', label='GFP - Fake Experimental Data')
plt.plot(t, odeint(simulation_set_of_equations, init_P, t, args = tuple(fitted_params))[:,1], c='g', label='GFP - Fitted Param Simlulation')
plt.legend(loc = 'best')
plt.xlabel('Time')
plt.ylabel('Concentration')
plt.grid()
plt.yscale('log')
plt.show()
I want to fit the linear equation of the form Y=coff1A+coff2B+C and calculate constant values coff1, coff2, and C. I did several ways but I got a huge difference in coefficient values. What is the efficient way to fit this type of equation?
import pandas as pd
import lmfit
dataset = {'A': [0.021426, -0.003970,0.001040, -0.003789, 0.009423, 0.046421, 0.039426, 0.027010, 0.024423, 0.022277],
'B': [ 0.000056, 0.000098, 0.000057, 0.000066, 0.000047 ,-0.009798,-0.008069,-0.005124,-0.004505,-0.004006],
'y': [242245.852, 153763.713, 205788.950, 161561.380, 250021.084,235739.216, 283089.372, 429715.097, 480362.889, 531978.557]}
data = pd.DataFrame(dataset)
A=data['A']
B=data['B']
y=data['y']
def resid(params,A, B, ydata):
f_rot = params['f_rot'].value
xip = params['xip'].value
xim = params['xim'].value
f_obs =f_rot+A*xip-B*xim
return f_obs - ydata
#data = np.loadtxt("C:/Users/USER/Desktop/Combined_values_FINAL_EAP.txt")
params = lmfit.Parameters()
params.add('f_rot', 15)
params.add('xip', 2)
params.add('xim', 10)
fit =lmfit.minimize(resid, params, args=(y,A,B), method='least_squares')
lmfit.report_fit(fit)
#Answer of this fit
#f_rot: 3.9748e-04 +/- 0.00139658 (351.36%) (init = 150000)
# xip: 3.7470e-10 +/- 4.7904e-09 (1278.48%) (init = 2)
# xim: 0.19744062 +/- 0.03694633 (18.71%) (init = 10)
I want to perform a probability Weibull fit with 0.95% confidence bounds by means of Python. As test data, I use fail cycles of a measurement which are plotted against the reliability R(t).
So far, I found a way to perform the Weibull fit, however, I still do not manage to get the confidence bounds. The Weibull plot with the same test data set was already performed with origin, therfore I know which shape I would "expect" for the confidence interval. But I do not understand how to get there.
I found information about Weibull confidence intervals on reliawiki(cf. Bounds on Reliability based on Fisher Matrix confidence bounds) and used the description there to calculate the variance and the upper and lower confidence bound (R_U and R_L).
Here is a working code example for my Weibull fit and my confidence bounds with the test data set based on the discription of reliawiki (cf. Bounds on Reliability). For the fit, I used a OLS model fit.
import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from scipy.optimize import curve_fit
import math
import statsmodels.api as sm
def weibull_ticks(y, pos):
return "{:.0f}%".format(100 * (1 - np.exp(-np.exp(y))))
def loglog(x):
return np.log(-np.log(1 - np.asarray(x)))
class weibull_example(object):
def __init__(self, dat):
self.fits = {}
dat.index = np.arange(1, len(dat) + 1)
dat.sort_values('data', inplace=True)
#define yaxis-values
dat['percentile'] = dat.index*1/len(dat)
self.data = dat
self.fit()
self.plot_data()
def fit(self):
#fit the data points with a the OLS model
self.data=self.data[:-1]
x0 = np.log(self.data.dropna()['data'].values)
Y = loglog(self.data.dropna()['percentile'])
Yx = sm.add_constant(Y)
model = sm.OLS(x0, Yx)
results = model.fit()
yy = loglog(np.linspace(.001, .999, 100))
YY = sm.add_constant(yy)
XX = np.exp(results.predict(YY))
self.eta = np.exp(results.params[0])
self.beta = 1 / results.params[1]
self.fits['syx'] = {'results': results, 'model': model,
'line': np.row_stack([XX, yy]),
'beta': self.beta,
'eta': self.eta}
cov = results.cov_params()
#get variance and covariance
self.beta_var = cov[1, 1]
self.eta_var = cov[0, 0]
self.cov = cov[1, 0]
def plot_data(self, fit='yx'):
dat = self.data
#plot data points
plt.semilogx(dat['data'], loglog(dat['percentile']), 'o')
fit = 's' + fit
self.plot_fit(fit)
ax = plt.gca()
formatter = mpl.ticker.FuncFormatter(weibull_ticks)
ax.yaxis.set_major_formatter(formatter)
yt_F = np.array([0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 0.95, 0.99])
yt_lnF = loglog(yt_F)
plt.yticks(yt_lnF)
plt.ylim(loglog([.01, .99]))
def plot_fit(self, fit='syx'):
dat = self.fits[fit]['line']
plt.plot(dat[0], dat[1])
#calculate variance to get confidence bound
def variance(x):
return (math.log(x) - math.log(self.eta)) ** 2 * self.beta_var + \
(self.beta/self.eta) ** 2 * self.eta_var - \
2 * (math.log(x) - math.log(self.eta)) * (-self.beta/self.eta) * self.cov
#calculate confidence bounds
def confidence_upper(x):
return 1-np.exp(-np.exp(self.beta*(math.log(x)-math.log(self.eta)) - 0.95*np.sqrt(variance(x))))
def confidence_lower(x):
return 1-np.exp(-np.exp(self.beta*(math.log(x)-math.log(self.eta)) + 0.95*np.sqrt(variance(x))))
yvals_1 = list(map(confidence_upper, dat[0]))
yvals_2 = list(map(confidence_lower, dat[0]))
#plot confidence bounds
plt.semilogx(dat[0], loglog(yvals_1), linestyle="solid", color="black", linewidth=2,
label="fit_u_1", alpha=0.8)
plt.semilogx(dat[0], loglog(yvals_2), linestyle="solid", color="green", linewidth=2,
label="fit_u_1", alpha=0.8)
def main():
fig, ax1 = plt.subplots()
ax1.set_xlabel("$Cycles\ til\ Failure$")
ax1.set_ylabel("$Weibull\ Percentile$")
#my data points
data = pd.DataFrame({'data': [1556, 2595, 11531, 38079, 46046, 57357]})
weibull_example(data)
plt.savefig("Weibull.png")
plt.close(fig)
if __name__ == "__main__":
main()
The confidence bounds in my plot look not like I expected. I tried a lot of different 'variances', just to understand the function and to check, if the problem is just a typing error. Meanwhile, I am convinced that the problem is more general and that I understood something false from the description on reliawiki. Unfortunately, I really do not get what's the problem and I do not know anyone else I can ask. In the internet and on different forums, I did not find an appropriate answer.
That's why I decided to ask this question here. It's the first time I ask a question in a forum. Therefore, I hope that I explained everything sufficiently and that the code example is useful.
Thank you very much :)
Apologies for the very late answer, but I'll provide it for any future readers.
Rather than try implementing this yourself, you may want to consider using a package designed for exactly this called reliability.
Here is the example for your use case.
Remember to upvote this answer if it helps you :)
I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:
You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
Minimize:
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:
Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Outline
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
plt.show()
Result
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.
I would like to use Pseudo-Voigt function to fit the data points below.
I looked at matplotlib and numpy but haven't found a way yet.
The data looks like this:
[3.3487290833206163, 3.441076831745743, 7.7932863251851305, 7.519064207516034, 7.394406511652473, 11.251458210206666, 4.679476113847004, 8.313048016542345, 9.348006472917458, 6.086336477997078, 10.765370342398741, 11.402519337778239, 11.151689287913552, 8.546151698722557, 8.323886291540909, 7.133249200994414, 10.242189407441712, 8.887686444395982, 10.759444780127321, 9.21095463298772, 15.693160143294264, 9.239683298899614, 9.476116297451632, 10.128625585058783, 10.94392508956097, 10.274287987647595, 9.552394167463973, 9.51931115335406, 9.923989117054466, 8.646255122559495, 12.207746464070603, 15.249531807666745, 9.820667193850705, 11.913964012172858, 9.506862412612637, 15.858588835799232, 14.918486963658015, 15.089436171053094, 14.38496801289269, 14.42394419048644, 15.759311758218061, 17.063349232010786, 12.232863723786215, 10.988245956134314, 19.109899560493286, 18.344353100589824, 17.397232553539542, 12.372706600456558, 13.038720878764792, 19.100965014037367, 17.094480819566147, 20.801679461435484, 15.763762333448557, 22.302320507719728, 23.394129891315963, 19.884812694503303, 22.09743700979689, 16.995815335935077, 24.286037929073284, 25.214705826961016, 25.305223543285013, 22.656121668613896, 30.185701748800568, 28.28382587095781, 35.63753811848088, 35.59816270398698, 35.64529822281625, 36.213428394807224, 39.56541841125095, 46.360702383473075, 55.84449512752349, 64.50142387788203, 77.75090937376423, 83.00423387164669, 111.98365374689226, 121.05211901294848, 176.82062069814936, 198.46769832454626, 210.52624393366017, 215.36708238568033, 221.58003148955638, 209.7551225151964, 198.4104196333782, 168.13949002992925, 126.0081896958841, 110.39003569380478, 90.88743461485616, 60.5443025644061, 71.00628698937221, 61.616294708485384, 45.32803695045095, 43.85638472551629, 48.863070901568086, 44.65252243455522, 41.209120125948104, 36.63478075990383, 36.098369542551325, 37.75419965137265, 41.102019290969956, 26.874409332756752, 24.63314900554918, 26.05340465966265, 26.787053802870535, 16.51559065528567, 19.367731289491633, 17.794958746427422, 19.52785218727518, 15.437635249660396, 21.96712662378481, 15.311043443598177, 16.49893493905559, 16.41202114648668, 17.904512123179114, 14.198812322372405, 15.296623848360126, 14.39383356078112, 10.807540004905345, 17.405310725810278, 15.309786310492559, 15.117665282794073, 15.926377010540376, 14.000223621497955, 15.827757539949431, 19.22355433703294, 12.278007446886507, 14.822245428954957, 13.226674931853903, 10.551237809932955, 8.58081654372226, 10.329123069771072, 13.709943935412294, 11.778442391614956, 14.454930746849122, 10.023352452542506, 11.01463585064886, 10.621062477382623, 9.29665510291416, 9.633579419680572, 11.482703531988037, 9.819073927883121, 12.095918617534196, 9.820590920621864, 9.620109753045565, 13.215701804432598, 8.092085538619543, 9.828015669152578, 8.259655585415379, 9.424189583067022, 13.149985946123934, 7.471175119197948, 10.947567075630904, 10.777888096711512, 8.477442195191612, 9.585429992609711, 7.032549866566089, 5.103962051624133, 9.285999577275545, 7.421574444036404, 5.740841317806245, 2.3672530845679]
You can use lmfit (pip install --user lmfit):
http://lmfit.github.io/lmfit-py/builtin_models.html#pseudovoigtmodel
http://lmfit.github.io/lmfit-py/builtin_models.html#example-1-fit-peaked-data-to-gaussian-lorentzian-and-voigt-profiles
import numpy as np
from lmfit.models import PseudoVoigtModel
x = np.arange(0, 160)
y = # grabbed from your post
mod = PseudoVoigtModel()
pars = mod.guess(y, x=x)
out = mod.fit(y, pars, x=x)
print(out.fit_report(min_correl=0.25))
out.plot()
which results in:
[[Model]]
Model(pvoigt)
[[Fit Statistics]]
# function evals = 73
# data points = 160
# variables = 4
chi-square = 10762.372
reduced chi-square = 68.990
[[Variables]]
amplitude: 4405.17064 +/- 83.84199 (1.90%) (init= 2740.16)
sigma: 5.63732815 +/- 0.236117 (4.19%) (init= 5)
center: 79.5249321 +/- 0.103164 (0.13%) (init= 79)
fraction: 1.21222411 +/- 0.052349 (4.32%) (init= 0.5)
fwhm: 11.2746563 +/- 0.472234 (4.19%) == '2.0000000*sigma'
[[Correlations]] (unreported correlations are < 0.250)
C(sigma, fraction) = -0.774
C(amplitude, fraction) = 0.314
You could use the nmrglue library:
from nmrglue import linshapes1d as ls
ls.sim_pvoigt_fwhm(x, x0, fwhm, eta)
where
x: Array of values at which to evalutate distribution.
x0: Center of the distribution.
fwhm: Full-width at half-maximum of the Pseudo Voigt profile.
eta: Lorentzian/Gaussian mixing parameter.