How can I obtain segmented linear regressions with a priori breakpoints? - python

I need to explain this in excruciating detail because I don't have the basics of statistics to explain in a more succinct way. Asking here in SO because I am looking for a python solution, but might go to stats.SE if more appropriate.
I have downhole well data, it might be a bit like this:
Rt T
0.0000 15.0000
4.0054 15.4523
25.1858 16.0761
27.9998 16.2013
35.7259 16.5914
39.0769 16.8777
45.1805 17.3545
45.6717 17.3877
48.3419 17.5307
51.5661 17.7079
64.1578 18.4177
66.8280 18.5750
111.1613 19.8261
114.2518 19.9731
121.8681 20.4074
146.0591 21.2622
148.8134 21.4117
164.6219 22.1776
176.5220 23.4835
177.9578 23.6738
180.8773 23.9973
187.1846 24.4976
210.5131 25.7585
211.4830 26.0231
230.2598 28.5495
262.3549 30.8602
266.2318 31.3067
303.3181 37.3183
329.4067 39.2858
335.0262 39.4731
337.8323 39.6756
343.1142 39.9271
352.2322 40.6634
367.8386 42.3641
380.0900 43.9158
388.5412 44.1891
390.4162 44.3563
395.6409 44.5837
(the Rt variable can be considered a proxy for depth, and T is temperature). I also have 'a priori' data giving me the temperature at Rt=0 and, not shown, some markers that i can use as breakpoints, guides to breakpoints, or at least compare to any discovered breakpoints.
The linear relationship of these two variables is in some depth intervals affected by some processes. A simple linear regression is
q, T0, r_value, p_value, std_err = stats.linregress(Rt, T)
and looks like this, where you can see the deviations clearly, and the poor fit for T0 (which should be 15):
I want to be able to perform a series of linear regressions (joining at ends of each segment), but I want to do it:
(a) by NOT specifying the number or locations of breaks,
(b) by specifying the number and location of breaks, and
(c) calculate the coefficients for each segment
I think I can do (b) and (c) by just splitting the data up and doing each bit separately with a bit of care, but I don't know about (a), and wonder if there's a way someone knows this can be done more simply.
I have seen this: https://stats.stackexchange.com/a/20210/9311, and I think MARS might be a good way to deal with it, but that's just because it looks good; I don't really understand it. I tried it with my data in a blind cut'n'paste way and have the output below, but again, I don't understand it:

The short answer is that I solved my problem using R to create a linear regression model, and then used the segmented package to generate the piecewise linear regression from the linear model. I was able to specify the expected number of breakpoints (or knots) n as shown below using psi=NA and K=n.
The long answer is:
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
# example data:
bullard <- structure(list(Rt = c(5.1861, 10.5266, 11.6688, 19.2345, 59.2882,
68.6889, 320.6442, 340.4545, 479.3034, 482.6092, 484.048, 485.7009,
486.4204, 488.1337, 489.5725, 491.2254, 492.3676, 493.2297, 494.3719,
495.2339, 496.3762, 499.6819, 500.253, 501.1151, 504.5417, 505.4038,
507.6278, 508.4899, 509.6321, 522.1321, 524.4165, 527.0027, 529.2871,
531.8733, 533.0155, 544.6534, 547.9592, 551.4075, 553.0604, 556.9397,
558.5926, 561.1788, 562.321, 563.1831, 563.7542, 565.0473, 566.1895,
572.801, 573.9432, 575.6674, 576.2385, 577.1006, 586.2382, 587.5313,
589.2446, 590.1067, 593.4125, 594.5547, 595.8478, 596.99, 598.7141,
599.8563, 600.2873, 603.1429, 604.0049, 604.576, 605.8691, 607.0113,
610.0286, 614.0263, 617.3321, 624.7564, 626.4805, 628.1334, 630.9889,
631.851, 636.4198, 638.0727, 638.5038, 639.646, 644.8184, 647.1028,
647.9649, 649.1071, 649.5381, 650.6803, 651.5424, 652.6846, 654.3375,
656.0508, 658.2059, 659.9193, 661.2124, 662.3546, 664.0787, 664.6498,
665.9429, 682.4782, 731.3561, 734.6619, 778.1154, 787.2919, 803.9261,
814.335, 848.1552, 898.2568, 912.6188, 924.6932, 940.9083), Tem = c(12.7813,
12.9341, 12.9163, 14.6367, 15.6235, 15.9454, 27.7281, 28.4951,
34.7237, 34.8028, 34.8841, 34.9175, 34.9618, 35.087, 35.1581,
35.204, 35.2824, 35.3751, 35.4615, 35.5567, 35.6494, 35.7464,
35.8007, 35.8951, 36.2097, 36.3225, 36.4435, 36.5458, 36.6758,
38.5766, 38.8014, 39.1435, 39.3543, 39.6769, 39.786, 41.0773,
41.155, 41.4648, 41.5047, 41.8333, 41.8819, 42.111, 42.1904,
42.2751, 42.3316, 42.4573, 42.5571, 42.7591, 42.8758, 43.0994,
43.1605, 43.2751, 44.3113, 44.502, 44.704, 44.8372, 44.9648,
45.104, 45.3173, 45.4562, 45.7358, 45.8809, 45.9543, 46.3093,
46.4571, 46.5263, 46.7352, 46.8716, 47.3605, 47.8788, 48.0124,
48.9564, 49.2635, 49.3216, 49.6884, 49.8318, 50.3981, 50.4609,
50.5309, 50.6636, 51.4257, 51.6715, 51.7854, 51.9082, 51.9701,
52.0924, 52.2088, 52.3334, 52.3839, 52.5518, 52.844, 53.0192,
53.1816, 53.2734, 53.5312, 53.5609, 53.6907, 55.2449, 57.8091,
57.8523, 59.6843, 60.0675, 60.8166, 61.3004, 63.2003, 66.456,
67.4, 68.2014, 69.3065)), .Names = c("Rt", "Tem"), class = "data.frame", row.names = c(NA,
-109L))
library(segmented) # Version: segmented_0.2-9.4
# create a linear model
out.lm <- lm(Tem ~ Rt, data = bullard)
# Set X breakpoints: Set psi=NA and K=n:
o <- segmented(out.lm, seg.Z=~Rt, psi=NA, control=seg.control(display=FALSE, K=3))
slope(o) # defaults to confidence level of 0.95 (conf.level=0.95)
# Trickery for placing text labels
r <- o$rangeZ[, 1]
est.psi <- o$psi[, 2]
v <- sort(c(r, est.psi))
xCoord <- rowMeans(cbind(v[-length(v)], v[-1]))
Z <- o$model[, o$nameUV$Z]
id <- sapply(xCoord, function(x) which.min(abs(x - Z)))
yCoord <- broken.line(o)[id]
# create the segmented plot, add linear regression for comparison, and text labels
plot(o, lwd=2, col=2:6, main="Segmented regression", res=TRUE)
abline(out.lm, col="red", lwd=1, lty=2) # dashed line for linear regression
text(xCoord, yCoord,
labels=formatC(slope(o)[[1]][, 1] * 1000, digits=1, format="f"),
pos = 4, cex = 1.3)

What you want is technically called spline interpolation, particularly order-1 spline interpolation (which would join straight line segments; order-2 joins parabolas, etc).
There is already a question here on Stack Overflow dealing with Spline Interpolation in Python, which will help you on your question. Here's the link. Post back if you have further questions after trying those tips.

A very simple method ( not iterative, without initial guess, no bound to specify) is provided pages 30-31 in the paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf . The result is :
NOTE : The method is based on the fitting of an integral equation. The present exemple is not a favourable case because the distribution of the abscisses of the points is far to be regular (no points in large ranges). This makes the numerical integration less accurate. Nevertheless, the piecewise fitting is surprisingly not bad.

Related

Fitting two voigt curves, one after the other using lmfit

I have the following emission spectra of Neon collected on a Raman (background subtracted data):
x=np.array([[1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981]])
y=np.array([[-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01]])
I have fitted a single Voigt function using lmfit, specifically:
model = VoigtModel()+ ConstantModel()
params=model.make_params(center=1123.096389, amplitude=1000, sigma=0.27)
result = model.fit(y.flatten(), params, x=x.flatten())
There is a second peak on the LH shoulder (sorry can't post image)- people using commercial peak fitting software fit the first voigt, then add the second, and then it adjusts the fits of both. How would I do this in python?
A related question - is there a way to optimize how many points to include in the peak fit. Right now, I am only feeding x and y data covering a set spectral range to do the peak fitting. But commercial software optimizes how much range to include in a given peak fit (I presume using residuals). How would I recreate this?
Thanks!
You can do it manually as so:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import VoigtModel, ConstantModel
x=np.array([1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981])
y=np.array([-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01])
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1123.0, amplitude=1000, sigma=0.27)
result1 = model.fit(y.flatten(), params, x=x.flatten())
rest = y-result1.best_fit
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1120.5, amplitude=200, sigma=0.27)
result2 = model.fit(rest, params, x=x.flatten())
rest -= result2.best_fit
plt.plot(x, y, label='Original')
plt.plot(x, result1.best_fit, label='1123.0')
plt.plot(x, result2.best_fit, label='1120.5')
plt.plot(x, rest, label='residual')
plt.legend()
plt.show()
You have to make sure that the residual makes sense. In this case, is quite close to 0, so I'd argue that it is fine.
lmfit does optimize the fit, so it is not necessary to pinpoint the exact value of the peak position. Also, it is important to point out that because of the resolution of this data (and spectroscopy in general), the highest points are not necessarily the centre of the peak. Additionally, because of the same, some shoulders might not be shoulders, though in this case looks like it is.
For your related question - judging by the documentation of lmfit it uses all the range you input. Residuals seem like not a solution since you fall in the same problem (what range to consider). I believe that the commercial SW you mention uses Multivariate Curve Resolution (MCR). These deconvolution problems have been a hot topic for decades. If you are interested in this kind of solution, I suggest reading about Multivariate Curve Resolution (MCR).

StandardScaler.inverse_transform() return the same array as input :/ Is sklearn broken or am I?

Good evening,
I'm currently pursuing a PhD in chemistry and in this framework I'm trying to apply my few knowledge in python and stats to discriminate sample based on their IR spectrum.
After a few of weeks of data acquisition I'm finally able to build my data set and was about to see what PCA can offer (this was the easy part).
I was able to build my script and get the loadings, scores and everything else that I could possibly need or want. However I used the StandardScaler from sklearn.preprocessing to scale down my data so (correct my if i'm wrong) I should get back loadings in this "standard scaled" space.
As my data are actual IR spectra those loadings have a chemical meanings (even thought there are not real spectrum) e.g. if my PC1 loadings have a peak at XX cm-1 i know that samples with high PC1 are likely to contain compounds that absorb at this wavenumber .
So i want to reverse the StandardScaler transformation. I've tried to used StandardScaler.inverse_transform() however it appears to return me the same array that I gave him... which is very frustrating...
I'm trying to do the same thing with my samples spectrum but it gave me the same result again : here is the portion of my script where I tried this :
Wavenumbers = DFF.columns
#in fact this is a little more complicated but that's the spirit
Spectre = DFF.values.tolist()
#btw DFF is my pandas.dataframe containing spectrum with features = wavenumber
SS = StandardScaler(copy=True)
DFF = SS.fit_transform(DFF) #at this point I use SS for preprocessing before PCA
#I'm then trying to inverse SS and get back the 1rst spectrum of the dataset
D = SS.inverse_transform(DFF[0])
#However at this point DFF[0] and D are almost-exactly the same I'm sure because :
plt.plot(Wavenumbers,D)
plt.plot(Wavenumbers,DFF[0]) #the curves are the sames, and :
for i,j in enumerate(D) :
if j==DFF[0][i] : pass
else : print("{}".format(j-DFF[0][i] )) #return nothing bigger than 10e-16
The problem is more than likely syntax or how i used StandardScaler, however i have no one around me to search for help with that . Can anyone tell me what i did wrong ? or give me an hint on how i could get back my loadings in the "actual real IR spectra" space ?
PS: sorry for the wacky English and i hope to be understandable
Good evening,
After putting the problem aside for a few days I finally re-coded the function I needed (as suggested by Robert Dodier).
For reminder, I wanted to have a function that could take my data from a pandas dataframe and mean-centered it in order to do PCA, but also that could reverse the preprocessing for latter uses.
Here is the code I ended up with :
import pandas as pd
import numpy as np
class Scaler:
std =[]
mean = []
def fit(self,DF):
self.std=[]
self.mean=[]
for c in DF.columns:
self.std.append(DF[c].std())
self.mean.append(DF[c].mean())
def transform(self,DF):
X = np.zeros(shape=DF.shape)
for i,c in enumerate(DF.columns):
for j in range(len(DF.index)):
X[j][i] = (DF[c][j] - self.mean[i]) / self.std[i]
return X
def reverse(self,X):
Y = np.zeros(shape=X.shape)
for i in range(len(X[0])):
for j in range(len(X)):
Y[j][i] = X[j][i] * self.std[i] + self.mean[i]
return Y
def fit_transform(self,DF):
self.fit(DF)
X = self.transform(DF)
return X
It's pretty slow and surely very low-tech but it seems to do the job just fine. Hope it will save some time to other python beginners.
I designed it to be as close as I think sklearn.preprocessing.StandardScaler does it.
example :
S = Scaler() #create scaler object
S.fit(DF) #fit the scaler to the dataframe (calculate mean and std for every columns in DF /!\ DF must be a pd.dataframe)
X=S.transform(DF) # return a np.array with mean centered data
Y = S.reverse(X) # reverse the transformation to get back original data
Again sorry for the fast tipped English. And thanks to Robert for taking the time to answer.

Can I use scipy.curve fit in python when one of the fitted parameters changes the xdata input array values?

This is my first time posting a question and I'm going to try to make it as clear as I can but feel free to ask questions.
I'm trying to fit a model to a curve using the scipy.curve_fit method as below:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM):
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((x))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
dstart=20.0
xdata=np.array(xdata[::-1])
xdata=xdata-dstart
xdata=list(xdata)
xdata1=[]
ydata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
ydata1.append(ydata[i])
xdata=np.array(xdata1)
ydata=np.array(ydata1)
popt, pcov = curve_fit(func2, xdata, ydata)
a=popt[0]
print "E=", popt[0]/10**6
t=func2(xdata,a)
ax=pyplot.figure().add_subplot(1,1,1)
ax.plot(xdata,t, color="blue",mew=2.0,label="Hertz Fit")
ax.plot(xdata,ydata,ls="",marker="x",color="red",mew=2.0,label="Data")
ax.legend(loc=2)
pyplot.show()
The "dstart" value basically cuts off the lower portion of the code I don't want to fit because it is negative and the model doesn't like negative numbers. Currently I have to manually set "dstart" before running the code and then I see the final result.
I started by doing this fitting in Excel with Solver to vary both the "EM" variable and the "dstart" variable simultaneously by nesting the code which adjusts the xdata by "dstart" and cuts off the negative values into the function being fit.
Essentially what I want is:
import numpy as np
import matplotlib.pyplot as pyplot
import scipy
from scipy.optimize import curve_fit
def func2(x,EM,dstart):
xdata=np.array(x[::-1])
xdata=dstart-xdata
xdata=list(xdata)
xdata1=[]
for i in range(len(xdata)):
if xdata[i]>0:
xdata1.append(xdata[i])
global xdata2
xdata2=np.array(xdata1)
return (((4.0*EM*(np.sqrt(8*10**-9)))/(3.0*(1.0-(0.5**2))*8*10**-9))*(((((xdata2))*1*10**-9)**((3.0/2.0)))))
ydata=[-0.003428768, -0.009050058, -0.0037997673999999996, -0.0003833233, -0.007557649, -0.0034860994, -0.0009856887, -0.0017508664, -0.00036931394999999996,
-0.0040713947, -0.005737315000000001, 0.0005120568, -0.007336486, -0.00719302, -0.0039941817, -0.0029785274, -0.0013044578, -0.008190335, -0.00833507,
-0.0074282060000000006, -0.009629990000000001, -0.009425125, -0.008662485999999999, -0.0019445216, -0.008331748, -0.009513038, -0.0047609017, -0.004364422,
-0.010325097, -0.0036570733, -0.0060091914, -0.005655772, -0.0045517069999999995, -0.00066998035, 0.006374902, 0.006445733, 0.0019101816,
0.010262737999999999, 0.011139007, 0.018161469, 0.016963122, 0.022915895, 0.027177791, 0.028707139, 0.040105638, 0.044088004, 0.041657403,
0.052325636999999994, 0.062399405, 0.07020844, 0.076979915, 0.08888523, 0.099634745, 0.10961602, 0.12188646, 0.13677225, 0.15639512, 0.16833586,
0.18849944000000002, 0.21515548, 0.23989769000000002, 0.26319308, 0.29388397, 0.321042, 0.35637776, 0.38564656999999997, 0.4185209, 0.44986692,
0.48931552999999994, 0.52583893, 0.5626885, 0.6051665, 0.6461075, 0.69644346, 0.7447817, 0.7931281, 0.8381386000000001, 0.8883482, 0.9395609999999999,
0.9853629, 1.0377034, 1.0889026, 1.1334094]
xdata=[34.51388, 33.963736999999995,
33.510695, 33.04127, 32.477253, 32.013624, 31.536019999999997, 31.02925, 30.541649999999997,
30.008646, 29.493828, 29.049707, 28.479668, 27.980956, 27.509590000000003, 27.018721, 26.533737, 25.972296,
25.471065, 24.979228000000003, 24.459624, 23.961517, 23.46839, 23.028454, 22.471411, 21.960924, 21.503428000000003,
21.007033, 20.453855, 20.013475, 19.492528, 18.995746999999998, 18.505670000000002, 18.040403, 17.603387, 17.104082,
16.563634, 16.138298000000002, 15.646187, 15.20897, 14.69833, 14.25156, 13.789688, 13.303409, 12.905278, 12.440909, 11.919262,
11.514609, 11.104646, 10.674512, 10.235055, 9.84145, 9.437523, 9.026733, 8.63639, 8.2694065, 7.944733, 7.551445, 7.231599999999999,
6.9697434, 6.690793299999999, 6.3989780000000005, 6.173159, 5.9157856, 5.731453, 5.4929328, 5.2866156, 5.066648000000001, 4.9190496,
4.745381399999999, 4.574569599999999, 4.4540283, 4.3197597000000005, 4.2694026, 4.2012034, 4.133134, 4.035212, 3.9837262, 3.9412007, 3.8503475999999996,
3.8178950000000005, 3.7753053999999997, 3.6728842]
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
popt, pcov = curve_fit(func2, xdata, ydata)
But this doesn't work as I get a value error "ValueError: operands could not be broadcast together with shapes (28,) (30,)". I think what I need is for the the curve_fit to bring in the xdata, adjust by the first guessed "dstart", guess EM and check for fit and minimized error, try new "dstart" to adjust xdata, guess EM and check for fit and minimized error, so on and so forth. As I'm still fairly new to Python I'm definitely out of my element with the curve fit and I would just use Excel if I didn't have potentially thousands of curves to run.
Any help would be appreciated!
I'll split this in two: conceptual and coding related
Conceptual:
Let's start by rephrasing your question. As it stands the answer is: Yes, obviously. Simply absorb the parameter-dependent change of x in the target function. But that won't solve your problem. What you really seem to be interested in is what to do with parameters for which some of the x cannot be processed by your function. There is no one-size-fits-all for that.
You could choose to deem such parameters as unacceptable in which case you'd have to resort to constrained optimisation. There are a few solvers in scipy that can do that.
You could choose to remove the difficult points from the data set before fitting.
You could introduce soft constraints and penalise bad values instead of ruling them out completely.
Programming style:
for loops in numerical programs. There are gazillions of posts on that on this site, so I'll only give one example:
xdata2=list(xdata2)
ydata1=[]
for i in range(len(xdata2)):
if xdata2[i]>0:
ydata1.append(ydata[i])
can be written in one line that will execute much faster and return an array instead of a list:
ydata1 = ydata[xdata2 > 0]
look at the numpy tutorial/docs or search this site for "vectorization" if you want to learn this technique.
Apart from that, no complaints.
Why your second program doesn't work.
You are sieving both your x and your y, so they should have the same shape. But then you go on and use an old copy instead of the new y whereas you do use the new x. That's why the shapes don't match
Btw. the way you've set it up (modify x within func2) is more or less implementing the absorb strategy I mention earlier. Only, since you have no access to y you cannot change the shape of x.

Non-linear fit in Python 2.7 doesn't give any good result

Here is my problem: I have experimental data to fit with a model. To do this, i used curve_fit from scipy. The script goes without any error or warning, but doesn't give a satisfying result (it gives me a quasi line instead of two Lorentzian shaped graph).
But the strangest part is that when I give a guessing array to the fitting function, none of the guessed parameters is modified, except for the third one (nevertheless it stay far from the expected value). However I pay attention to the order of the guessed parameters.
I give you the part of the code that does the fit.
X = 927.
Z = 88.
M = 5.e-15
O1 = 92975.
O2 = 93570.
bm = np.arctan2(Z,X)
P0 = 0.
T = np.pi/2.
TM = np.pi/3.
G = 20.
File ="Data.txt"
open(File, "rb")
dat = np.loadtxt(File)
O = dat[:,1]
D = np.sqrt(1./20. *10**(dat[:,7]/10.)*1/((X**2+Z**2)*10**(6)))
def model(W,o1,o2,p0,t,tm,g):
DB = np.abs((1./M)*(np.cos(bm-tm)*(p0*np.cos(t-tm)/(o1**2-W**2-1.j*g*W))+np.sin(bm-tm)*(p0*np.sin(t-tm)/((o2**2-W**2-1.j*g*W)))))
return DB
guess = np.array([O1,O2,P0,T,TM,G])
fit , pcov = curve_fit(model, O , D , guess)
I search an research during a complete month to find any error, but still noting. Is It possible that the function is to complex for curve_fit?
Thank you in advance for your help. Don't hesitate if you need further informations or data
Here is a plot of O v D. The red points are the experiment and the blue line is the function with the returned fit parameters (not modified, so they are the guessing values)
D = model(O)
it's very hard to tell what is going on with the mixture of constants and very long formulae. But a couple points to consider:
If variables are not changing from their initial values, you should be careful about scaling. Your (X**2+Z**2)*10**(6) will be around ~1e16, which might make it hard to make good numerical derivatives. You may need to modify the value of epsfcn sent to leastsq().
It looks like your model function calculates a complex array. I believe curve_fit() can handle only strictly real values.
You might find the lmfit module useful.
Well people, thank to you i finally found a solution!
instead of using curve_fit, i tried to use directly leastsq following this tutorial in order to see what would happend. It works better than expected since the fit did succeed and gives me the right positions of the peaks and there amplitudes. I give you the corrected code as it works for me.
X = 927.0
Z = 88.
M = 5.e-15
O1 = 92975.
O2 = 93570.
bm = np.arctan2(Z,X)
P0=1.e-12
T=np.pi/2.
TM=np.pi/3.
G=20.
File ="Data.csv"
open(File, "rb")
dat = np.loadtxt(File)
O = dat[:,1]
D = np.sqrt(1/1000. *10**(dat[:,7]/10.)*50.*1/((X**2+Z**2)*10**(6)))
def resid(p, y, W) :
o1,o2,p0,t,tm,g = p
err=y-(np.abs((1./M)*(np.cos(bm-tm)*(p0*np.cos(t-tm)/(o1**2-W**2-1.j*g*W))+np.sin(bm-tm)*(p0*np.sin(t-tm)/((o2**2-W**2-1.j*g*W))))))
return err
def peval(W,p) :
return np.abs((1./M)*(np.cos(bm-p[4])*(p[2]*np.cos(p[3]-p[4])/(p[0]**2-W**2-1.j*p[5]*W))+np.sin(bm-p[4])*(p[2]*np.sin(p[3]-p[4])/((p[1]**2-W**2-1.j*p[5]*W)))))
guess = np.array([O1,O2,P0,T,TM,G])
plsq = leastsq(resid,guess,args=(D,O))
print plsq[0]
plt.yscale('log')
Again, thank you for your attention

PyMC2 and PyMC3 give different results...?

I'm trying to get a simple PyMC2 model working in PyMC3. I've gotten the model to run but the models give very different MAP estimates for the variables. Here is my PyMC2 model:
import pymc
theta = pymc.Normal('theta', 0, .88)
X1 = pymc.Bernoulli('X2', p=pymc.Lambda('a', lambda theta=theta:1./(1+np.exp(-(theta-(-0.75))))), value=[1],observed=True)
X2 = pymc.Bernoulli('X3', p=pymc.Lambda('b', lambda theta=theta:1./(1+np.exp(-(theta-0)))), value=[1],observed=True)
model = pymc.Model([theta, X1, X2])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=25000, burn=5000)
trace = (mcmc.trace('theta')[:])
print "\nThe MAP value for theta is", trace.sum()/len(trace)
That seems to work as expected. I had all sorts of trouble figuring out how to use the equivalent of the pymc.Lambda object in PyMC3. I eventually came across the Deterministic object. The following is my code:
import pymc3
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+np.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+np.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
pymc3.traceplot(trace)
The problem I'm having is that my MAP estimate for theta using PyMC2 is ~0.68 (correct), while the estimate PyMC3 gives is ~0.26 (incorrect). I suspect this has something to do with the way I'm defining the deterministic function. PyMC3 won't let me use a lambda function, so I just have to write the expression in-line. When I try to use lambda theta=theta:... I get this error:
AsTensorError: ('Cannot convert <function <lambda> at 0x157323e60> to TensorType', <type 'function'>)
Something to do with Theano?? Any suggestions would be greatly appreciated!
It works when you use a theano tensor instead of a numpy function in your Deterministic.
import pymc3
import theano.tensor as tt
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+tt.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+tt.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
print "\nThe MAP value for theta is", np.median(trace['theta'])
pymc3.traceplot(trace);
Here's the output:
Just in case someone else has the same problem, I think I found an answer. After trying different sampling algorithms I found that:
find_MAP gave the incorrect answer
the NUTS sampler gave the incorrect answer
the Metropolis sampler gave the correct answer, yay!
I read somewhere else that the NUTS sampler doesn't work with Deterministic. I don't know why. Maybe that's the case with find_MAP too? But for now I'll stick with Metropolis.
Also, NUTS doesn't handle discrete variables. If you want to use NUTS, you have to split up the samplers:
step1 = pymc3.NUTS([theta])
step2 = pymc3.BinaryMetropolis([X1,X2])
trace = pymc3.sample(10000, [step1, step2], start)
EDIT:
Missed that 'b' and 'c' were defined inline. Removed them from the NUTS function call
The MAP value is not defined as the mean of a distribution, but as its maximum. With pymc2 you can find it with:
M = pymc.MAP(model)
M.fit()
theta.value
which returns array(0.6253614422469552)
This agrees with the MAP that you find with find_MAP in pymc3, which you call start:
{'theta': array(0.6253614811102668)}
The issue of which is a better sampler is a different one, and does not depend on the calculation of the MAP. The MAP calculation is an optimization.
See: https://pymc-devs.github.io/pymc/modelfitting.html#maximum-a-posteriori-estimates for pymc2.

Categories

Resources