Fitting two voigt curves, one after the other using lmfit - python

I have the following emission spectra of Neon collected on a Raman (background subtracted data):
x=np.array([[1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981]])
y=np.array([[-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01]])
I have fitted a single Voigt function using lmfit, specifically:
model = VoigtModel()+ ConstantModel()
params=model.make_params(center=1123.096389, amplitude=1000, sigma=0.27)
result = model.fit(y.flatten(), params, x=x.flatten())
There is a second peak on the LH shoulder (sorry can't post image)- people using commercial peak fitting software fit the first voigt, then add the second, and then it adjusts the fits of both. How would I do this in python?
A related question - is there a way to optimize how many points to include in the peak fit. Right now, I am only feeding x and y data covering a set spectral range to do the peak fitting. But commercial software optimizes how much range to include in a given peak fit (I presume using residuals). How would I recreate this?
Thanks!

You can do it manually as so:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import VoigtModel, ConstantModel
x=np.array([1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981])
y=np.array([-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01])
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1123.0, amplitude=1000, sigma=0.27)
result1 = model.fit(y.flatten(), params, x=x.flatten())
rest = y-result1.best_fit
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1120.5, amplitude=200, sigma=0.27)
result2 = model.fit(rest, params, x=x.flatten())
rest -= result2.best_fit
plt.plot(x, y, label='Original')
plt.plot(x, result1.best_fit, label='1123.0')
plt.plot(x, result2.best_fit, label='1120.5')
plt.plot(x, rest, label='residual')
plt.legend()
plt.show()
You have to make sure that the residual makes sense. In this case, is quite close to 0, so I'd argue that it is fine.
lmfit does optimize the fit, so it is not necessary to pinpoint the exact value of the peak position. Also, it is important to point out that because of the resolution of this data (and spectroscopy in general), the highest points are not necessarily the centre of the peak. Additionally, because of the same, some shoulders might not be shoulders, though in this case looks like it is.
For your related question - judging by the documentation of lmfit it uses all the range you input. Residuals seem like not a solution since you fall in the same problem (what range to consider). I believe that the commercial SW you mention uses Multivariate Curve Resolution (MCR). These deconvolution problems have been a hot topic for decades. If you are interested in this kind of solution, I suggest reading about Multivariate Curve Resolution (MCR).

Related

Scipy curve fit doesn't perform a fit and raises "Covariance of the parameters could not be estimated" error

I am trying to do a simple linear curve fit with scipy, normally this method works fine for me. This time however for a reason unknown to me it doesn't work.
(I suspect that maybe the numbers are so big that it reaches the limit of what can be stored under a given data type.)
Regardless of the reason, the idea is to make a plot that looks like this:
As you see on the axis here the numbers are of a common order of magnitude. However this time I tried to make a fit to much bigger data points on the order of 1E10, for this I tried to use the following code (here I present only the code for making a scatter plot and then fitting only one data set).
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
ucrt_T = 2/np.sqrt(3)
ucrt_U = 0.1/np.sqrt(3)
T = [314.1, 325.1, 335.1, 345.1, 355.1, 365.1, 374.1, 384.1, 393.1]
T_to_4th = [9733560790.61, 11170378213.80, 12609495509.84, 14183383217.88, 15900203737.92, 17768359469.96, 19586229219.65, 21765930026.49, 23878782252.31]
ucrt_T_lst = [143130823.11, 158701221.00, 173801148.95, 189829733.26, 206814686.75, 224783722.22, 241820148.88, 261735288.93, 280568229.17]
UBlack = [1.9,3.1, 4.4, 5.6, 7.0, 8.7, 10.2, 11.8, 13.4]
def lin_function(x,a,b):
return a*x + b
def line_fit_2():
#Dodanie pozostałych punktów na wykresie
plt.scatter(UBlack, T_to_4th, color='blue')
plt.errorbar(UBlack, T_to_4th, yerr=ucrt_T, fmt='o')
#Seria CZARNA
VltBlack = np.array(UBlack)
Tt4 = np.array(T_to_4th)
popt, pcov = curve_fit(lin_function, VltBlack, Tt4, absolute_sigma=False)
perr = np.sqrt(np.diag(pcov))
y = lin_function(VltBlack, *popt)
#Stylistyka i wygląd wykresu
#plt.plot(Pressure1, y, '--', color = 'g', label="fit with: $a={:.3f}\pm{:.3f}$, $b={:.3f}\pm{:.3f}$" .format(popt[0], perr[0], popt[1], perr[1]))
plt.plot(VltBlack, y, '--', color='green')
plt.ylabel(r'$T^4$ w $[K^4]$')
plt.xlabel(r'Napięcie termometru U w [mV]')
plt.legend(['Fit', 'Data points'])
plt.grid()
plt.show()
line_fit_2()
If you will run it you will find out that the scatter plot is created however the fit isn't executed properly, as only a horizontal line will be added. Additionally an error OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning) is raised.
I would be very happy to know what I am doing wrong or how to resolve this problem. All help is appreciated!
You've pretty much already answered your question, so I'll just confirm your suspicion: the reason the OptimizeWarning is raised is because the underlying optimization algorithm doesn't work properly/diverges due to large parameter numbers.
The solution is very simple, just scale your input parameters before using the fitting tool. Just keep the scaling in mind when you add labels to your x/y axis:
T_to_4th = np.array([9733560790.61, 11170378213.80, 12609495509.84, 14183383217.88, 15900203737.92, 17768359469.96, 19586229219.65, 21765930026.49, 23878782252.31])/10e6
ucrt_T_lst = np.array([143130823.11, 158701221.00, 173801148.95, 189829733.26, 206814686.75, 224783722.22, 241820148.88, 261735288.93, 280568229.17])/10e6
What I did is just divide the lists with big numbers by 10e6. This means that the values are no longer in kPa for example, but in mega kPa (which would be GPa now).
To divide the entire list by the same value, first convert it to a numpy array.
Hope this helps :)

Python - fitting data with exponential function

I am aware that there are a few questions about a similar subject, although I couldn't find a proper answer.
I would like to fit some data with a function (called Bastenaire) and iget the parameters values. Here is the code:
import numpy as np
from matplotlib import pyplot as plt
from scipy import optimize
def bastenaire(s, A,B, C,sd):
logNB=np.log(A)-C*(s-sd)-np.log(s-sd)
return np.exp(logNB)-B
S=np.array([659,646,634,623,613,595,580,565,551,535,515,493,473,452,432,413,394,374,355,345])
N=np.array([46963,52934,59975,65522,74241,87237,101977,116751,133665,157067,189426,233260,281321,355558,428815,522582,630257,768067,902506,1017280])
fitmb,fitmob=optimize.curve_fit(bastenaire,S,N,p0=(30000,2000000000,0.2,250))
plt.scatter(N,S)
plt.plot(bastenaire(S,*fitmb),S,label='bastenaire')
plt.legend()
plt.show()
However, the curve fit cannot identify the correct parameters and I get: OptimizeWarning: Covariance of the parameters could not be estimated.
Same results when I give no input parameters values.
Figure
Is there any way to tweak something and get results? Should my dataset cover a wider range and values?
Thank you!
Broc
Fitting is tough, you need to restrain the parameter space using bounds and (often) check a bit your initial values.
To make it work, I search for an initial value where the function had the correct look, then estimated some constraints:
bounds = np.array([(1e4, 1e12), (-np.inf, np.inf), (1e-20, 1e-2), (-2000., 20000)]).T
fitmb, fitmob = optimize.curve_fit(bastenaire,S, N,p0=(1e7,-100.,1e-5,250.), bounds=bounds)
returns
(array([ 1.00000000e+10, 1.03174824e+04, 7.53169772e-03, -7.32901325e+01]), array([[ 2.24128391e-06, 6.17858390e+00, -1.44693602e-07,
-5.72040842e-03],
[ 6.17858390e+00, 1.70326029e+07, -3.98881486e-01,
-1.57696515e+04],
[-1.44693602e-07, -3.98881486e-01, 1.14650323e-08,
4.68707940e-04],
[-5.72040842e-03, -1.57696515e+04, 4.68707940e-04,
1.93358414e+01]]))

Can I use scipy.curve fit in python when one of the fitted parameters changes the xdata input array values?

This is my first time posting a question and I'm going to try to make it as clear as I can but feel free to ask questions.
I'm trying to fit a model to a curve using the scipy.curve_fit method as below:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM):
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((x))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
dstart=20.0
xdata=np.array(xdata[::-1])
xdata=xdata-dstart
xdata=list(xdata)
xdata1=[]
ydata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
ydata1.append(ydata[i])
xdata=np.array(xdata1)
ydata=np.array(ydata1)
popt, pcov = curve_fit(func2, xdata, ydata)
a=popt[0]
print "E=", popt[0]/10**6
t=func2(xdata,a)
ax=pyplot.figure().add_subplot(1,1,1)
ax.plot(xdata,t, color="blue",mew=2.0,label="Hertz Fit")
ax.plot(xdata,ydata,ls="",marker="x",color="red",mew=2.0,label="Data")
ax.legend(loc=2)
pyplot.show()
The "dstart" value basically cuts off the lower portion of the code I don't want to fit because it is negative and the model doesn't like negative numbers. Currently I have to manually set "dstart" before running the code and then I see the final result.
I started by doing this fitting in Excel with Solver to vary both the "EM" variable and the "dstart" variable simultaneously by nesting the code which adjusts the xdata by "dstart" and cuts off the negative values into the function being fit.
Essentially what I want is:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM,dstart):
xdata=np.array(x[::-1])
xdata=dstart-xdata
xdata=list(xdata)
xdata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
global xdata2
xdata2=np.array(xdata1)
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((xdata2))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
popt, pcov = curve_fit(func2, xdata, ydata)
But this doesn't work as I get a value error "ValueError: operands could not be broadcast together with shapes (28,) (30,)". I think what I need is for the the curve_fit to bring in the xdata, adjust by the first guessed "dstart", guess EM and check for fit and minimized error, try new "dstart" to adjust xdata, guess EM and check for fit and minimized error, so on and so forth. As I'm still fairly new to Python I'm definitely out of my element with the curve fit and I would just use Excel if I didn't have potentially thousands of curves to run.
Any help would be appreciated!
I'll split this in two: conceptual and coding related
Conceptual:
Let's start by rephrasing your question. As it stands the answer is: Yes, obviously. Simply absorb the parameter-dependent change of x in the target function. But that won't solve your problem. What you really seem to be interested in is what to do with parameters for which some of the x cannot be processed by your function. There is no one-size-fits-all for that.
You could choose to deem such parameters as unacceptable in which case you'd have to resort to constrained optimisation. There are a few solvers in scipy that can do that.
You could choose to remove the difficult points from the data set before fitting.
You could introduce soft constraints and penalise bad values instead of ruling them out completely.
Programming style:
for loops in numerical programs. There are gazillions of posts on that on this site, so I'll only give one example:
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
can be written in one line that will execute much faster and return an array instead of a list:
ydata1 = ydata[xdata2 > 0]
look at the numpy tutorial/docs or search this site for "vectorization" if you want to learn this technique.
Apart from that, no complaints.
Why your second program doesn't work.
You are sieving both your x and your y, so they should have the same shape. But then you go on and use an old copy instead of the new y whereas you do use the new x. That's why the shapes don't match
Btw. the way you've set it up (modify x within func2) is more or less implementing the absorb strategy I mention earlier. Only, since you have no access to y you cannot change the shape of x.

How to overplot fit results for discrete values in pymc3?

I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.

How can I obtain segmented linear regressions with a priori breakpoints?

I need to explain this in excruciating detail because I don't have the basics of statistics to explain in a more succinct way. Asking here in SO because I am looking for a python solution, but might go to stats.SE if more appropriate.
I have downhole well data, it might be a bit like this:
Rt T
0.0000 15.0000
4.0054 15.4523
25.1858 16.0761
27.9998 16.2013
35.7259 16.5914
39.0769 16.8777
45.1805 17.3545
45.6717 17.3877
48.3419 17.5307
51.5661 17.7079
64.1578 18.4177
66.8280 18.5750
111.1613 19.8261
114.2518 19.9731
121.8681 20.4074
146.0591 21.2622
148.8134 21.4117
164.6219 22.1776
176.5220 23.4835
177.9578 23.6738
180.8773 23.9973
187.1846 24.4976
210.5131 25.7585
211.4830 26.0231
230.2598 28.5495
262.3549 30.8602
266.2318 31.3067
303.3181 37.3183
329.4067 39.2858
335.0262 39.4731
337.8323 39.6756
343.1142 39.9271
352.2322 40.6634
367.8386 42.3641
380.0900 43.9158
388.5412 44.1891
390.4162 44.3563
395.6409 44.5837
(the Rt variable can be considered a proxy for depth, and T is temperature). I also have 'a priori' data giving me the temperature at Rt=0 and, not shown, some markers that i can use as breakpoints, guides to breakpoints, or at least compare to any discovered breakpoints.
The linear relationship of these two variables is in some depth intervals affected by some processes. A simple linear regression is
q, T0, r_value, p_value, std_err = stats.linregress(Rt, T)
and looks like this, where you can see the deviations clearly, and the poor fit for T0 (which should be 15):
I want to be able to perform a series of linear regressions (joining at ends of each segment), but I want to do it:
(a) by NOT specifying the number or locations of breaks,
(b) by specifying the number and location of breaks, and
(c) calculate the coefficients for each segment
I think I can do (b) and (c) by just splitting the data up and doing each bit separately with a bit of care, but I don't know about (a), and wonder if there's a way someone knows this can be done more simply.
I have seen this: https://stats.stackexchange.com/a/20210/9311, and I think MARS might be a good way to deal with it, but that's just because it looks good; I don't really understand it. I tried it with my data in a blind cut'n'paste way and have the output below, but again, I don't understand it:
The short answer is that I solved my problem using R to create a linear regression model, and then used the segmented package to generate the piecewise linear regression from the linear model. I was able to specify the expected number of breakpoints (or knots) n as shown below using psi=NA and K=n.
The long answer is:
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
# example data:
bullard <- structure(list(Rt = c(5.1861, 10.5266, 11.6688, 19.2345, 59.2882,
68.6889, 320.6442, 340.4545, 479.3034, 482.6092, 484.048, 485.7009,
486.4204, 488.1337, 489.5725, 491.2254, 492.3676, 493.2297, 494.3719,
495.2339, 496.3762, 499.6819, 500.253, 501.1151, 504.5417, 505.4038,
507.6278, 508.4899, 509.6321, 522.1321, 524.4165, 527.0027, 529.2871,
531.8733, 533.0155, 544.6534, 547.9592, 551.4075, 553.0604, 556.9397,
558.5926, 561.1788, 562.321, 563.1831, 563.7542, 565.0473, 566.1895,
572.801, 573.9432, 575.6674, 576.2385, 577.1006, 586.2382, 587.5313,
589.2446, 590.1067, 593.4125, 594.5547, 595.8478, 596.99, 598.7141,
599.8563, 600.2873, 603.1429, 604.0049, 604.576, 605.8691, 607.0113,
610.0286, 614.0263, 617.3321, 624.7564, 626.4805, 628.1334, 630.9889,
631.851, 636.4198, 638.0727, 638.5038, 639.646, 644.8184, 647.1028,
647.9649, 649.1071, 649.5381, 650.6803, 651.5424, 652.6846, 654.3375,
656.0508, 658.2059, 659.9193, 661.2124, 662.3546, 664.0787, 664.6498,
665.9429, 682.4782, 731.3561, 734.6619, 778.1154, 787.2919, 803.9261,
814.335, 848.1552, 898.2568, 912.6188, 924.6932, 940.9083), Tem = c(12.7813,
12.9341, 12.9163, 14.6367, 15.6235, 15.9454, 27.7281, 28.4951,
34.7237, 34.8028, 34.8841, 34.9175, 34.9618, 35.087, 35.1581,
35.204, 35.2824, 35.3751, 35.4615, 35.5567, 35.6494, 35.7464,
35.8007, 35.8951, 36.2097, 36.3225, 36.4435, 36.5458, 36.6758,
38.5766, 38.8014, 39.1435, 39.3543, 39.6769, 39.786, 41.0773,
41.155, 41.4648, 41.5047, 41.8333, 41.8819, 42.111, 42.1904,
42.2751, 42.3316, 42.4573, 42.5571, 42.7591, 42.8758, 43.0994,
43.1605, 43.2751, 44.3113, 44.502, 44.704, 44.8372, 44.9648,
45.104, 45.3173, 45.4562, 45.7358, 45.8809, 45.9543, 46.3093,
46.4571, 46.5263, 46.7352, 46.8716, 47.3605, 47.8788, 48.0124,
48.9564, 49.2635, 49.3216, 49.6884, 49.8318, 50.3981, 50.4609,
50.5309, 50.6636, 51.4257, 51.6715, 51.7854, 51.9082, 51.9701,
52.0924, 52.2088, 52.3334, 52.3839, 52.5518, 52.844, 53.0192,
53.1816, 53.2734, 53.5312, 53.5609, 53.6907, 55.2449, 57.8091,
57.8523, 59.6843, 60.0675, 60.8166, 61.3004, 63.2003, 66.456,
67.4, 68.2014, 69.3065)), .Names = c("Rt", "Tem"), class = "data.frame", row.names = c(NA,
-109L))
library(segmented) # Version: segmented_0.2-9.4
# create a linear model
out.lm <- lm(Tem ~ Rt, data = bullard)
# Set X breakpoints: Set psi=NA and K=n:
o <- segmented(out.lm, seg.Z=~Rt, psi=NA, control=seg.control(display=FALSE, K=3))
slope(o) # defaults to confidence level of 0.95 (conf.level=0.95)
# Trickery for placing text labels
r <- o$rangeZ[, 1]
est.psi <- o$psi[, 2]
v <- sort(c(r, est.psi))
xCoord <- rowMeans(cbind(v[-length(v)], v[-1]))
Z <- o$model[, o$nameUV$Z]
id <- sapply(xCoord, function(x) which.min(abs(x - Z)))
yCoord <- broken.line(o)[id]
# create the segmented plot, add linear regression for comparison, and text labels
plot(o, lwd=2, col=2:6, main="Segmented regression", res=TRUE)
abline(out.lm, col="red", lwd=1, lty=2) # dashed line for linear regression
text(xCoord, yCoord,
labels=formatC(slope(o)[[1]][, 1] * 1000, digits=1, format="f"),
pos = 4, cex = 1.3)
What you want is technically called spline interpolation, particularly order-1 spline interpolation (which would join straight line segments; order-2 joins parabolas, etc).
There is already a question here on Stack Overflow dealing with Spline Interpolation in Python, which will help you on your question. Here's the link. Post back if you have further questions after trying those tips.
A very simple method ( not iterative, without initial guess, no bound to specify) is provided pages 30-31 in the paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf . The result is :
NOTE : The method is based on the fitting of an integral equation. The present exemple is not a favourable case because the distribution of the abscisses of the points is far to be regular (no points in large ranges). This makes the numerical integration less accurate. Nevertheless, the piecewise fitting is surprisingly not bad.

Categories

Resources