I have the following emission spectra of Neon collected on a Raman (background subtracted data):
x=np.array([[1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981]])
y=np.array([[-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01]])
I have fitted a single Voigt function using lmfit, specifically:
model = VoigtModel()+ ConstantModel()
params=model.make_params(center=1123.096389, amplitude=1000, sigma=0.27)
result = model.fit(y.flatten(), params, x=x.flatten())
There is a second peak on the LH shoulder (sorry can't post image)- people using commercial peak fitting software fit the first voigt, then add the second, and then it adjusts the fits of both. How would I do this in python?
A related question - is there a way to optimize how many points to include in the peak fit. Right now, I am only feeding x and y data covering a set spectral range to do the peak fitting. But commercial software optimizes how much range to include in a given peak fit (I presume using residuals). How would I recreate this?
Thanks!
You can do it manually as so:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import VoigtModel, ConstantModel
x=np.array([1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981])
y=np.array([-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01])
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1123.0, amplitude=1000, sigma=0.27)
result1 = model.fit(y.flatten(), params, x=x.flatten())
rest = y-result1.best_fit
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1120.5, amplitude=200, sigma=0.27)
result2 = model.fit(rest, params, x=x.flatten())
rest -= result2.best_fit
plt.plot(x, y, label='Original')
plt.plot(x, result1.best_fit, label='1123.0')
plt.plot(x, result2.best_fit, label='1120.5')
plt.plot(x, rest, label='residual')
plt.legend()
plt.show()
You have to make sure that the residual makes sense. In this case, is quite close to 0, so I'd argue that it is fine.
lmfit does optimize the fit, so it is not necessary to pinpoint the exact value of the peak position. Also, it is important to point out that because of the resolution of this data (and spectroscopy in general), the highest points are not necessarily the centre of the peak. Additionally, because of the same, some shoulders might not be shoulders, though in this case looks like it is.
For your related question - judging by the documentation of lmfit it uses all the range you input. Residuals seem like not a solution since you fall in the same problem (what range to consider). I believe that the commercial SW you mention uses Multivariate Curve Resolution (MCR). These deconvolution problems have been a hot topic for decades. If you are interested in this kind of solution, I suggest reading about Multivariate Curve Resolution (MCR).
This is my first time posting a question and I'm going to try to make it as clear as I can but feel free to ask questions.
I'm trying to fit a model to a curve using the scipy.curve_fit method as below:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM):
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((x))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
dstart=20.0
xdata=np.array(xdata[::-1])
xdata=xdata-dstart
xdata=list(xdata)
xdata1=[]
ydata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
ydata1.append(ydata[i])
xdata=np.array(xdata1)
ydata=np.array(ydata1)
popt, pcov = curve_fit(func2, xdata, ydata)
a=popt[0]
print "E=", popt[0]/10**6
t=func2(xdata,a)
ax=pyplot.figure().add_subplot(1,1,1)
ax.plot(xdata,t, color="blue",mew=2.0,label="Hertz Fit")
ax.plot(xdata,ydata,ls="",marker="x",color="red",mew=2.0,label="Data")
ax.legend(loc=2)
pyplot.show()
The "dstart" value basically cuts off the lower portion of the code I don't want to fit because it is negative and the model doesn't like negative numbers. Currently I have to manually set "dstart" before running the code and then I see the final result.
I started by doing this fitting in Excel with Solver to vary both the "EM" variable and the "dstart" variable simultaneously by nesting the code which adjusts the xdata by "dstart" and cuts off the negative values into the function being fit.
Essentially what I want is:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM,dstart):
xdata=np.array(x[::-1])
xdata=dstart-xdata
xdata=list(xdata)
xdata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
global xdata2
xdata2=np.array(xdata1)
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((xdata2))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
popt, pcov = curve_fit(func2, xdata, ydata)
But this doesn't work as I get a value error "ValueError: operands could not be broadcast together with shapes (28,) (30,)". I think what I need is for the the curve_fit to bring in the xdata, adjust by the first guessed "dstart", guess EM and check for fit and minimized error, try new "dstart" to adjust xdata, guess EM and check for fit and minimized error, so on and so forth. As I'm still fairly new to Python I'm definitely out of my element with the curve fit and I would just use Excel if I didn't have potentially thousands of curves to run.
Any help would be appreciated!
I'll split this in two: conceptual and coding related
Conceptual:
Let's start by rephrasing your question. As it stands the answer is: Yes, obviously. Simply absorb the parameter-dependent change of x in the target function. But that won't solve your problem. What you really seem to be interested in is what to do with parameters for which some of the x cannot be processed by your function. There is no one-size-fits-all for that.
You could choose to deem such parameters as unacceptable in which case you'd have to resort to constrained optimisation. There are a few solvers in scipy that can do that.
You could choose to remove the difficult points from the data set before fitting.
You could introduce soft constraints and penalise bad values instead of ruling them out completely.
Programming style:
for loops in numerical programs. There are gazillions of posts on that on this site, so I'll only give one example:
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
can be written in one line that will execute much faster and return an array instead of a list:
ydata1 = ydata[xdata2 > 0]
look at the numpy tutorial/docs or search this site for "vectorization" if you want to learn this technique.
Apart from that, no complaints.
Why your second program doesn't work.
You are sieving both your x and your y, so they should have the same shape. But then you go on and use an old copy instead of the new y whereas you do use the new x. That's why the shapes don't match
Btw. the way you've set it up (modify x within func2) is more or less implementing the absorb strategy I mention earlier. Only, since you have no access to y you cannot change the shape of x.
Here is my problem: I have experimental data to fit with a model. To do this, i used curve_fit from scipy. The script goes without any error or warning, but doesn't give a satisfying result (it gives me a quasi line instead of two Lorentzian shaped graph).
But the strangest part is that when I give a guessing array to the fitting function, none of the guessed parameters is modified, except for the third one (nevertheless it stay far from the expected value). However I pay attention to the order of the guessed parameters.
I give you the part of the code that does the fit.
X = 927.
Z = 88.
M = 5.e-15
O1 = 92975.
O2 = 93570.
bm = np.arctan2(Z,X)
P0 = 0.
T = np.pi/2.
TM = np.pi/3.
G = 20.
File ="Data.txt"
open(File, "rb")
dat = np.loadtxt(File)
O = dat[:,1]
D = np.sqrt(1./20. *10**(dat[:,7]/10.)*1/((X**2+Z**2)*10**(6)))
def model(W,o1,o2,p0,t,tm,g):
DB = np.abs((1./M)*(np.cos(bm-tm)*(p0*np.cos(t-tm)/(o1**2-W**2-1.j*g*W))+np.sin(bm-tm)*(p0*np.sin(t-tm)/((o2**2-W**2-1.j*g*W)))))
return DB
guess = np.array([O1,O2,P0,T,TM,G])
fit , pcov = curve_fit(model, O , D , guess)
I search an research during a complete month to find any error, but still noting. Is It possible that the function is to complex for curve_fit?
Thank you in advance for your help. Don't hesitate if you need further informations or data
Here is a plot of O v D. The red points are the experiment and the blue line is the function with the returned fit parameters (not modified, so they are the guessing values)
D = model(O)
it's very hard to tell what is going on with the mixture of constants and very long formulae. But a couple points to consider:
If variables are not changing from their initial values, you should be careful about scaling. Your (X**2+Z**2)*10**(6) will be around ~1e16, which might make it hard to make good numerical derivatives. You may need to modify the value of epsfcn sent to leastsq().
It looks like your model function calculates a complex array. I believe curve_fit() can handle only strictly real values.
You might find the lmfit module useful.
Well people, thank to you i finally found a solution!
instead of using curve_fit, i tried to use directly leastsq following this tutorial in order to see what would happend. It works better than expected since the fit did succeed and gives me the right positions of the peaks and there amplitudes. I give you the corrected code as it works for me.
X = 927.0
Z = 88.
M = 5.e-15
O1 = 92975.
O2 = 93570.
bm = np.arctan2(Z,X)
P0=1.e-12
T=np.pi/2.
TM=np.pi/3.
G=20.
File ="Data.csv"
open(File, "rb")
dat = np.loadtxt(File)
O = dat[:,1]
D = np.sqrt(1/1000. *10**(dat[:,7]/10.)*50.*1/((X**2+Z**2)*10**(6)))
def resid(p, y, W) :
o1,o2,p0,t,tm,g = p
err=y-(np.abs((1./M)*(np.cos(bm-tm)*(p0*np.cos(t-tm)/(o1**2-W**2-1.j*g*W))+np.sin(bm-tm)*(p0*np.sin(t-tm)/((o2**2-W**2-1.j*g*W))))))
return err
def peval(W,p) :
return np.abs((1./M)*(np.cos(bm-p[4])*(p[2]*np.cos(p[3]-p[4])/(p[0]**2-W**2-1.j*p[5]*W))+np.sin(bm-p[4])*(p[2]*np.sin(p[3]-p[4])/((p[1]**2-W**2-1.j*p[5]*W)))))
guess = np.array([O1,O2,P0,T,TM,G])
plsq = leastsq(resid,guess,args=(D,O))
print plsq[0]
plt.yscale('log')
Again, thank you for your attention
I'm trying to get a simple PyMC2 model working in PyMC3. I've gotten the model to run but the models give very different MAP estimates for the variables. Here is my PyMC2 model:
import pymc
theta = pymc.Normal('theta', 0, .88)
X1 = pymc.Bernoulli('X2', p=pymc.Lambda('a', lambda theta=theta:1./(1+np.exp(-(theta-(-0.75))))), value=[1],observed=True)
X2 = pymc.Bernoulli('X3', p=pymc.Lambda('b', lambda theta=theta:1./(1+np.exp(-(theta-0)))), value=[1],observed=True)
model = pymc.Model([theta, X1, X2])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=25000, burn=5000)
trace = (mcmc.trace('theta')[:])
print "\nThe MAP value for theta is", trace.sum()/len(trace)
That seems to work as expected. I had all sorts of trouble figuring out how to use the equivalent of the pymc.Lambda object in PyMC3. I eventually came across the Deterministic object. The following is my code:
import pymc3
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+np.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+np.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
pymc3.traceplot(trace)
The problem I'm having is that my MAP estimate for theta using PyMC2 is ~0.68 (correct), while the estimate PyMC3 gives is ~0.26 (incorrect). I suspect this has something to do with the way I'm defining the deterministic function. PyMC3 won't let me use a lambda function, so I just have to write the expression in-line. When I try to use lambda theta=theta:... I get this error:
AsTensorError: ('Cannot convert <function <lambda> at 0x157323e60> to TensorType', <type 'function'>)
Something to do with Theano?? Any suggestions would be greatly appreciated!
It works when you use a theano tensor instead of a numpy function in your Deterministic.
import pymc3
import theano.tensor as tt
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+tt.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+tt.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
print "\nThe MAP value for theta is", np.median(trace['theta'])
pymc3.traceplot(trace);
Here's the output:
Just in case someone else has the same problem, I think I found an answer. After trying different sampling algorithms I found that:
find_MAP gave the incorrect answer
the NUTS sampler gave the incorrect answer
the Metropolis sampler gave the correct answer, yay!
I read somewhere else that the NUTS sampler doesn't work with Deterministic. I don't know why. Maybe that's the case with find_MAP too? But for now I'll stick with Metropolis.
Also, NUTS doesn't handle discrete variables. If you want to use NUTS, you have to split up the samplers:
step1 = pymc3.NUTS([theta])
step2 = pymc3.BinaryMetropolis([X1,X2])
trace = pymc3.sample(10000, [step1, step2], start)
EDIT:
Missed that 'b' and 'c' were defined inline. Removed them from the NUTS function call
The MAP value is not defined as the mean of a distribution, but as its maximum. With pymc2 you can find it with:
M = pymc.MAP(model)
M.fit()
theta.value
which returns array(0.6253614422469552)
This agrees with the MAP that you find with find_MAP in pymc3, which you call start:
{'theta': array(0.6253614811102668)}
The issue of which is a better sampler is a different one, and does not depend on the calculation of the MAP. The MAP calculation is an optimization.
See: https://pymc-devs.github.io/pymc/modelfitting.html#maximum-a-posteriori-estimates for pymc2.