Fitting data in python using curve_fit - python

I'm trying to fit data in python to obtain the coefficients for the best fit.
The equation I need to fit is:
Vs = a*(qt^b)(fs^c)(ov^d)
Whereby I have the data for qt, fs and ov and need to obtain the values for a,b,c,d.
The code I'm using is:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
qt = [10867.073, 8074.986, 2208.366, 3066.566, 2945.326, 4795.766, 2249.813, 2018.3]
fs = [229.6, 17.4, 5.3, 0.1, 0.1, 0.1, 0.1, 0.1]
ov = [19.159, 29.054, 37.620, 44.854, 51.721, 58.755, 65.622, 72.492]
Vs = [149.787, 125.3962, 133.927, 110.047, 149.787, 137.809, 201.506, 154.925]
d = [1.018, 1.518, 2.0179, 2.517, 3.017, 3.517, 4.018, 4.52]
def func(a, b, c, d):
return a*qt**b*fs**c*ov**d
popt, pcov = curve_fit(func, Vs, d)
print(popt[0], popt[1], popt[2])
plt.plot(Vs, d, 'ro',label="Original Data")
plt.plot(Vs, func(Vs,*popt), label="Fitted Curve")
plt.gca().invert_yaxis()
plt.show()
Which produces the following output (Significant figures cut by me):
-0.333528 -0.1413381 -0.3553966
I was hoping to get something more like below where the data it fitted but it hasn't been done perfectly (note the one below is just an example and is not correct).

The main hitch is the graphical representation of the result. What is drawn has no signifiance : Even with perfect data and with perfect fitting the points will appear verry scattered. This is misleading.
Better draw (Vs from computation) divided by (Vs from data) and compare to 1 which should be the exact value if the fitting was perfect.
Note that your problem is a simple linear regression in logarithm scale.
ln(Vs) = A + b * ln(qt) + c * ln(fs) + d * ln(ov)
Linear regression straight away gives A , b , c , d and a = exp(A) .

Related

Curve fitting in Python with constraint of zero slope at edges

I am looking to curve fit the following data, such that I get it to fit a trend with the condition of zero slope at the edges. The output of polyfit fits that data, but not with zero slopes at the edges.
Here is what I'm looking to output - pardon my Paint job. I need to it to fit like this so I can properly remove this sine/cosine bias of the data that isn't real towards the center.
Here is the data:
[0.23353535 0.25586247 0.26661164 0.26410896 0.24963951 0.22670266
0.19955422 0.17190263 0.1598439 0.17351905 0.18212444 0.18438673
0.17952432 0.18314894 0.19265689 0.19432385 0.19605163 0.20326011
0.20890851 0.20590997 0.21856518 0.23771665 0.24530019 0.23940831
0.22078396 0.23075128 0.2346082 0.22466281 0.24384843 0.26339594
0.26414153 0.24664183 0.24278978 0.31023648 0.3614195 0.37773436
0.3505998 0.28893167 0.23965877 0.24063917 0.27922502 0.32716477
0.36553767 0.42293146 0.50968856 0.5458872 0.52192533 0.45243764
0.36313155 0.3683921 0.40942553 0.4420537 0.46145585 0.4648034
0.4523771 0.4272876 0.39404616 0.3570107 0.35060245 0.3860975
0.3996996 0.44551122 0.46611032 0.45998383 0.4309985 0.38563925
0.37105605 0.4074444 0.48815584 0.5704579 0.6448988 0.7018853
0.73397845 0.73739105 0.7122451 0.6618154 0.591451 0.5076601
0.48578677 0.47347385 0.4791471 0.48306277 0.47025493 0.43479836
0.44380915 0.45868078 0.5341566 0.57549906 0.55790776 0.56244135
0.57668275 0.561856 0.67564166 0.7512851 0.76957643 0.7266262
0.734133 0.7231936 0.6776926 0.60511285 0.51599765 0.5579323
0.56723005 0.5440337 0.5775593 0.5950776 0.5722321 0.57858473
0.5652703 0.54723704 0.59561515 0.7071321 0.8169259 0.91443264
0.9883759 1.0275097 1.0235045 0.9737119 1.029139 1.1354861
1.1910824 1.1826864 1.1092159 0.9832138 0.9643041 0.92324203
0.9093703 0.88915515 1.0007693 1.0542978 1.0857164 1.0211861
0.88474303 0.8458009 0.76522666 0.7478076 0.90081936 1.0690157
1.1569089 1.1493248 1.0622779 1.0327609 0.9805119 0.9583969
0.8973544 0.9543319 0.9777171 0.94951093 0.97323567 1.0244237
1.0569099 1.0951824 1.0771195 1.3078191 1.7212077 2.09409
2.320331 2.3279085 2.125451 1.7908521 1.4180487 1.0744424
1.0218129 1.0916439 1.1255138 1.125803 1.1139745 1.2187989
1.300092 1.3025533 1.2312403 1.221301 1.2535597 1.2298189
1.1458241 1.1012102 1.0889369 1.1558667 1.3051153 1.4143198
1.6345526 1.8093723 1.9037704 1.8961821 1.7866236 1.5958548
1.3865516 1.5308585 1.6140417 1.627337 1.5733193 1.4981418
1.5048542 1.4935548 1.4798748 1.4131776 1.3792214 1.3728334
1.3683671 1.3593615 1.2995907 1.2965002 1.366058 1.4795257
1.5462885 1.61591 1.5968509 1.5222199 1.6210756 1.7074443
1.8351102 2.3187535 2.6568012 2.7676315 2.6480794 2.3636303
2.0673316 1.9607923 1.8074365 1.713272 1.5893831 1.4734347
1.507817 1.5213271 1.6091452 1.7162323 1.7608733 1.7497622
1.9187828 2.0197518 2.0487514 2.01107 1.9193696 1.7904462
1.8558109 2.1955926 2.4700975 2.6562278 2.675197 2.6645825
2.6295316 2.4182043 2.2114453 2.2506614 2.2086055 2.0497518
1.9557768 1.901191 2.067513 2.1077373 2.0159333 1.8138607
1.5413624 1.600069 1.7631899 1.9541935 1.9340311 1.805134
2.0671906 2.2247658 2.2641945 2.3594956 2.2504601 1.9749025
1.8905054 2.0679731 2.1193469 2.0307171 2.0717037 2.0340347
1.925536 1.7820351 1.9467943 2.315468 2.4834185 2.3751369
2.0240622 1.9363666 2.1732547 2.3113241 2.3264208 2.22015
2.0187428 1.7619076 1.796859 1.8757095 2.0501778 2.44711
2.6179967 2.508112 2.1694388 1.7242104 1.7671669 1.862043
1.8392721 1.7120028 1.6650634 1.6319505 1.482931 1.5240219
1.5815579 1.5691646 1.4766116 1.3731087 1.4666644 1.4061015
1.3652745 1.425564 1.4006845 1.5000012 1.581379 1.6329607
1.6444355 1.6098644 1.5300899 1.6876912 1.8968476 2.048039
2.1006014 2.0271482 1.8300935 1.6986666 1.9628603 2.0521066
1.9337255 1.6407858 1.2583638 1.2110122 1.2476432 1.2360718
1.2886397 1.2862154 1.2343681 1.1458222 1.209224 1.2475786
1.2353342 1.1797879 1.0963987 1.0928186 1.1553882 1.1569618
1.1932304 1.3002363 1.3386917 1.2973225 1.1816871 1.0557054
0.9350373 0.896656 0.8565816 0.90168726 0.9897751 1.02342
1.0232298 1.1199353 1.1466643 1.1081418 1.0377598 1.0348651
1.0223045 1.0607077 1.0089502 0.885213 1.023178 1.1131796
1.1331098 1.0779471 0.9626393 0.81472665 0.85455835 0.87542623
0.87286425 0.89130884 0.9545931 1.0355722 1.0201533 0.93568784
0.9180018 0.8202782 0.7450139 0.72550577 0.68578506 0.6431666
0.66193295 0.6386373 0.7060119 0.7650972 0.80093855 0.803342
0.76590335 0.7151591 0.6946282 0.7136788 0.7714012 0.8022328
0.79840165 0.8543819 0.8586749 0.8028453 0.7383879 0.73423904
0.65107304 0.61139977 0.5940311 0.6151931 0.59349155 0.54995483
0.5837645 0.5891752 0.56406695 0.5638191 0.5762535 0.58305734
0.5830114 0.57470953 0.5568098 0.52852243 0.49031836 0.45275375
0.47168964 0.46634504 0.4600581 0.45332378 0.41508177 0.3834329
0.4137769 0.41392407 0.3824464 0.36310086 0.434278 0.48041886
0.49433306 0.475708 0.43060693 0.36886734 0.34740242 0.34108457
0.36160505 0.40907663 0.43613982 0.4394311 0.42070773 0.38575593
0.3827834 0.4338096 0.46581286 0.45669746 0.40830874 0.3505502
0.32584783 0.3381971 0.33949164 0.36409503 0.3759155 0.3610108
0.37174097 0.39990777 0.38925973 0.34376588 0.32478797 0.32705626
0.3228174 0.30941254 0.28542265 0.2687348 0.25517422 0.26127565
0.27331188 0.3028561 0.31277937 0.29953563 0.2660389 0.27051866
0.2913383 0.30363902 0.30684754 0.3011791 0.28737035 0.26648855
0.26413882 0.25501928 0.23947525 0.21937743 0.19659272 0.18965112
0.21511254 0.23329383 0.24157354 0.2391297 0.22697571 0.20739041
0.1855308 0.18856761 0.19565174 0.20542233 0.21473111 0.22244582
0.22726117 0.22789808 0.22336568 0.21322969 0.20314343 0.2031754
0.19738965 0.1959791 0.20284075 0.20859875 0.21363212 0.21804498
0.22160804 0.22381367]
This came close, but not exactly it as the edges aren't zero slope: How do I fit a sine curve to my data with pylab and numpy?
Is there anything available that will let me do this without having to write up a custom algorithm to handle this? Thanks.
Here is a Lorentzian type of peak equation fitted to your data, and for the "x" values I used an index similar to what I see on the Example Output plot in your post. I have also zoomed in on the peak center to better display the sinusoidal shapes you mention. You may be able to subtract the predicted value from this peak equation to condition or pre-process the raw data as you discuss.
a = 1.7056067124682076E+02
b = 7.2900803359572393E+01
c = 2.5047064423525464E+02
d = 1.4184767800540945E+01
Offset = -2.4940994412221318E-01
y = a/ (b + pow((x-c)/d, 2.0)) + Offset
Starting from you own example based on a sine fit, I added the constraints such that the derivatives of the model have to be zero at the end points. I did this using symfit, a package I wrote to make this kind of thing easier. If you prefer to do this using scipy you can adapt the example to that syntax if you want, symfit is just a wrapper around their minimizers that adds symbolical manipulations using sympy.
# Make variables and parameters
x, y = variables('x, y')
a, b, c, d = parameters('a, b, c, d')
# Initial guesses
b.value = 1e-2
c.value = 100
# Create a model object
model = Model({y: a * sin(b * x + c) + d})
# Take the derivative and constrain the end-points to be equal to zero.
dydx = D(model[y], x).doit()
constraints = [Eq(dydx.subs(x, xdata[0]), 0),
Eq(dydx.subs(x, xdata[-1]), 0)]
# Do the fit!
fit = Fit(model, x=xdata, y=ydata, constraints=constraints)
fit_result = fit.execute()
print(fit_result)
plt.plot(xdata, ydata)
plt.plot(xdata, model(x=xdata, **fit_result.params).y)
plt.show()
This prints: (from current symfit PR#221, which has better reporting of the results.)
Parameter Value Standard Deviation
a 8.790393e-01 1.879788e-02
b 1.229586e-02 3.824249e-04
c 9.896017e+01 1.011472e-01
d 1.001717e+00 2.928506e-02
Status message Optimization terminated successfully.
Number of iterations 10
Objective <symfit.core.objectives.LeastSquares object at 0x0000016F670DF080>
Minimizer <symfit.core.minimizers.SLSQP object at 0x0000016F78057A58>
Goodness of fit qualifiers:
chi_squared 29.72125657199736
objective_value 14.86062828599868
r_squared 0.8695978050586373
Constraints:
--------------------
Question: a*b*cos(c) == 0?
Answer: 1.5904051811454707e-17
Question: a*b*cos(511*b + c) == 0?
Answer: -6.354261416082215e-17

scipy curve_fit returns initial estimates

To fit a hyperbolic function I am trying to use the following code:
import numpy as np
from scipy.optimize import curve_fit
def hyperbola(x, s_1, s_2, o_x, o_y, c):
# x > Input x values
# s_1 > slope of line 1
# s_2 > slope of line 2
# o_x > x offset of crossing of asymptotes
# o_y > y offset of crossing of asymptotes
# c > curvature of hyperbola
b_2 = (s_1 + s_2) / 2
b_1 = (s_2 - s_1) / 2
return o_y + b_1 * (x - o_x) + b_2 * np.sqrt((x - o_x) ** 2 + c ** 2 / 4)
min_fit = np.array([-3.0, 0.0, -2.0, -10.0, 0.0])
max_fit = np.array([0.0, 3.0, 3.0, 0.0, 10.0])
guess = np.array([-2.5/3.0, 4/3.0, 1.0, -4.0, 0.5])
vars, covariance = curve_fit(f=hyperbola, xdata=n_step, ydata=n_mean, p0=guess, bounds=(min_fit, max_fit))
Where n_step and n_mean are measurement values generated earlier on. The code runs fine and gives no error message, but it only returns the initial guess with a very small change. Also, the covariance matrix contains only zeros. I tried to do the same fit with a better initial guess, but that does not have any influence.
Further, I plotted the exact same function with the initial guess as input and that gives me indeed a function which is close to the real values. Does anyone know where I make a mistake here? Or do I use the wrong function to make my fit?
The issue must lie with n_step and n_mean (which are not given in the question as currently stated); when trying to reproduce the issue with some arbitrarily chosen set of input parameters, the optimization works as expected. Let's try it out.
First, let's define some arbitrarily chosen input parameters in the given parameter space by
params = [-0.1, 2.95, -1, -5, 5]
Let's see what that looks like:
import matplotlib.pyplot as plt
xs = np.linspace(-30, 30, 100)
plt.plot(xs, hyperbola(xs, *params))
Based on this, let us define some rather crude inputs for xdata and ydata by
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xs, *params)
With these, let us run the optimization and see if we match our given parameters:
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [-0.1 2.95 -1. -5. 5. ]
That is, the fit is perfect even though our params are rather different from our guess. In other words, if we are free to choose n_step and n_mean, then the method works as expected.
In order to try to challenge the optimization slightly, we could also try to add a bit of noise:
np.random.seed(42)
xdata = np.linspace(-30, 30, 10)
ydata = hyperbola(xdata, *params) + np.random.normal(0, 10, size=len(xdata))
vars, covariance = curve_fit(f=hyperbola, xdata=xdata, ydata=ydata, p0=guess, bounds=(min_fit, max_fit))
print(vars) # [ -1.18173287e-01 2.84522636e+00 -1.57023215e+00 -6.90851334e-12 6.14480856e-08]
plt.plot(xdata, ydata, '.')
plt.plot(xs, hyperbola(xs, *vars))
Here we note that the optimum ends up being different from both our provided params and the guess, still within the bounds provided by min_fit and max_fit, and still provided a good fit.

How to run non-linear regression in python

i am having the following information(dataframe) in python
product baskets scaling_factor
12345 475 95.5
12345 108 57.7
12345 2 1.4
12345 38 21.9
12345 320 88.8
and I want to run the following non-linear regression and estimate the parameters.
a ,b and c
Equation that i want to fit:
scaling_factor = a - (b*np.exp(c*baskets))
In sas we usually run the following model:(uses gauss newton method )
proc nlin data=scaling_factors;
parms a=100 b=100 c=-0.09;
model scaling_factor = a - (b * (exp(c*baskets)));
output out=scaling_equation_parms
parms=a b c;
is there a similar way to estimate the parameters in Python using non linear regression, how can i see the plot in python.
For problems like these I always use scipy.optimize.minimize with my own least squares function. The optimization algorithms don't handle large differences between the various inputs well, so it is a good idea to scale the parameters in your function so that the parameters exposed to scipy are all on the order of 1 as I've done below.
import numpy as np
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
def lsq(arg):
a = arg[0]*100
b = arg[1]*100
c = arg[2]*0.1
now = a - (b*np.exp(c * baskets)) - scaling_factor
return np.sum(now**2)
guesses = [1, 1, -0.9]
res = scipy.optimize.minimize(lsq, guesses)
print(res.message)
# 'Optimization terminated successfully.'
print(res.x)
# [ 0.97336709 0.98685365 -0.07998282]
print([lsq(guesses), lsq(res.x)])
# [7761.0093358076601, 13.055053196410928]
Of course, as with all minimization problems it is important to use good initial guesses since all of the algorithms can get trapped in a local minimum. The optimization method can be changed by using the method keyword; some of the possibilities are
‘Nelder-Mead’
‘Powell’
‘CG’
‘BFGS’
‘Newton-CG’
The default is BFGS according to the documentation.
Agreeing with Chris Mueller, I'd also use scipy but scipy.optimize.curve_fit.
The code looks like:
###the top two lines are required on my linux machine
import matplotlib
matplotlib.use('Qt4Agg')
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import numpy as np
from scipy.optimize import curve_fit #we could import more, but this is what we need
###defining your fitfunction
def func(x, a, b, c):
return a - b* np.exp(c * x)
###OP's data
baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])
###let us guess some start values
initialGuess=[100, 100,-.01]
guessedFactors=[func(x,*initialGuess ) for x in baskets]
###making the actual fit
popt,pcov = curve_fit(func, baskets, scaling_factor,initialGuess)
#one may want to
print popt
print pcov
###preparing data for showing the fit
basketCont=np.linspace(min(baskets),max(baskets),50)
fittedData=[func(x, *popt) for x in basketCont]
###preparing the figure
fig1 = plt.figure(1)
ax=fig1.add_subplot(1,1,1)
###the three sets of data to plot
ax.plot(baskets,scaling_factor,linestyle='',marker='o', color='r',label="data")
ax.plot(baskets,guessedFactors,linestyle='',marker='^', color='b',label="initial guess")
ax.plot(basketCont,fittedData,linestyle='-', color='#900000',label="fit with ({0:0.2g},{1:0.2g},{2:0.2g})".format(*popt))
###beautification
ax.legend(loc=0, title="graphs", fontsize=12)
ax.set_ylabel("factor")
ax.set_xlabel("baskets")
ax.grid()
ax.set_title("$\mathrm{curve}_\mathrm{fit}$")
###putting the covariance matrix nicely
tab= [['{:.2g}'.format(j) for j in i] for i in pcov]
the_table = plt.table(cellText=tab,
colWidths = [0.2]*3,
loc='upper right', bbox=[0.483, 0.35, 0.5, 0.25] )
plt.text(250,65,'covariance:',size=12)
###putting the plot
plt.show()
###done
Eventually, giving you:

scipy curve fitting negative value

I would like to fit a curve with curve_fit and prevent it from becoming negative. Unfortunately, the code below does not work. Any hints? Thanks a lot!
# Imports
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
xData = [0.0009824379203203417, 0.0011014182912933933, 0.0012433979929054324, 0.0014147106052612918, 0.0016240300315499524, 0.0018834904507916608, 0.002210485320720769, 0.002630660216394964, 0.0031830988618379067, 0.003929751681281367, 0.0049735919716217296, 0.0064961201261998095, 0.008841941282883075, 0.012732395447351627, 0.019894367886486918, 0.0353677651315323, 0.07957747154594767, 0.3183098861837907]
yData = [99.61973156923796, 91.79478510744039, 92.79302188621314, 84.32927272723863, 77.75060981602016, 75.62801782349504, 70.48026800610839, 72.21240551953743, 68.14019252499526, 55.23015406920851, 57.212682880377464, 50.777016257727176, 44.871140881319626, 40.544138806850846, 32.489105158795525, 25.65367127756607, 19.894206907130403, 13.057996247388862]
def func(x,m,c,d):
'''
Fitting Function
I put d as an absolute number to prevent negative values for d?
'''
return x**m * c + abs(d)
p0 = [-1, 1, 1]
coeff, _ = curve_fit(func, xData, yData, p0) # Fit curve
m, c, d = coeff[0], coeff[1], coeff[2]
print("d: " + str(d)) # Why is it negative!!
Your model actually works fine as the following plot shows. I used your code and plotted the original data and the data you obtain with the fitted parameters:
As you can see, the data can nicely be reproduced but you indeed obtain a negative value for d (which must not be a bad thing depending on the context of the model). If you want to avoid it, I recommend to use lmfit where you can constrain your parameters to certain ranges. The next plot shows the outcome.
As you can see, it also reproduces the data well and you obtain a positive value for d as desired.
namely:
m: -0.35199747
c: 8.48813181
d: 0.05775745
Here is the entire code that reproduces the figures:
# Imports
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
#additional import
from lmfit import minimize, Parameters, Parameter, report_fit
xData = [0.0009824379203203417, 0.0011014182912933933, 0.0012433979929054324, 0.0014147106052612918, 0.0016240300315499524, 0.0018834904507916608, 0.002210485320720769, 0.002630660216394964, 0.0031830988618379067, 0.003929751681281367, 0.0049735919716217296, 0.0064961201261998095, 0.008841941282883075, 0.012732395447351627, 0.019894367886486918, 0.0353677651315323, 0.07957747154594767, 0.3183098861837907]
yData = [99.61973156923796, 91.79478510744039, 92.79302188621314, 84.32927272723863, 77.75060981602016, 75.62801782349504, 70.48026800610839, 72.21240551953743, 68.14019252499526, 55.23015406920851, 57.212682880377464, 50.777016257727176, 44.871140881319626, 40.544138806850846, 32.489105158795525, 25.65367127756607, 19.894206907130403, 13.057996247388862]
def func(x,m,c,d):
'''
Fitting Function
I put d as an absolute number to prevent negative values for d?
'''
print m,c,d
return np.power(x,m)*c + d
p0 = [-1, 1, 1]
coeff, _ = curve_fit(func, xData, yData, p0) # Fit curve
m, c, d = coeff[0], coeff[1], coeff[2]
print("d: " + str(d)) # Why is it negative!!
plt.scatter(xData, yData, s=30, marker = "v",label='P')
plt.scatter(xData, func(xData, *coeff), s=30, marker = "v",color="red",label='curvefit')
plt.show()
#####the new approach starts here
def func2(params, x, data):
m = params['m'].value
c = params['c'].value
d = params['d'].value
model = np.power(x,m)*c + d
return model - data #that's what you want to minimize
# create a set of Parameters
params = Parameters()
params.add('m', value= -2) #value is the initial condition
params.add('c', value= 8.)
params.add('d', value= 10.0, min=0) #min=0 prevents that d becomes negative
# do fit, here with leastsq model
result = minimize(func2, params, args=(xData, yData))
# calculate final result
final = yData + result.residual
# write error report
report_fit(params)
try:
import pylab
pylab.plot(xData, yData, 'k+')
pylab.plot(xData, final, 'r')
pylab.show()
except:
pass
You could use the scipy.optimize.curve_fit method's bounds option to specify the maximum bound and the minimum bound.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
Bounds is a two tuple array. In your case, you just need to specify the lower bound for d. You could use,
bounds=([-np.inf, -np.inf, 0], np.inf)
Note: If you provide a scalar as a parameter (eg:- as the second variable above), it automatically applies as the upper bound for all three coefficients.
You just need to add one little argument to constrain your parameters. That is:
curve_fit(func, xData, yData, p0, bounds=([m1,c1,d1],[m2,c2,d2]))
where m1,c1,d1 are the lower bounds of the parameters (in your case they should be 0) and
m2,c2,d2 are the upper bounds.
If u want all m,c,d to be positive, the code should goes like the following:
curve_fit(func, xData, yData, p0, bounds=(0,numpy.inf))
where all the parameters have a lower bound of 0 and an upper bound of infinity(no bound)

Fit a non-linear function to data/observations with pyMCMC/pyMC

I am trying to fit some data with a Gaussian (and more complex) function(s). I have created a small example below.
My first question is, am I doing it right?
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
It is very hard to find nice guides on how to do this kind of regression in pyMC. Perhaps because its easier to use some least squares, or similar approach, I however have many parameters in the end and need to see how well we can constrain them and compare different models, pyMC seemed like the good choice for that.
import pymc
import numpy as np
import matplotlib.pyplot as plt; plt.ion()
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
# Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
# define the model/function to be fitted.
def model(x, f):
amp = pymc.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pymc.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pymc.Normal('ps', 0.13, 40, value=0.15)
#pymc.deterministic(plot=False)
def gauss(x=x, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pymc.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pymc.MCMC(model(x,f))
MDL.sample(1e4)
# extract and plot results
y_min = MDL.stats()['gauss']['quantiles'][2.5]
y_max = MDL.stats()['gauss']['quantiles'][97.5]
y_fit = MDL.stats()['gauss']['mean']
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=2, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
I realize that I might have to run more iterations, use burn in and thinning in the end. The figure plotting the data and the fit is seen here below.
The pymc.Matplot.plot(MDL) figures looks like this, showing nicely peaked distributions. This is good, right?
My first question is, am I doing it right?
Yes! You need to include a burn-in period, which you know. I like to throw out the first half of my samples. You don't need to do any thinning, but sometimes it will make your post-MCMC work faster to process and smaller to store.
The only other thing I advise is to set a random seed, so that your results are "reproducible": np.random.seed(12345) will do the trick.
Oh, and if I was really giving too much advice, I'd say import seaborn to make the matplotlib results a little more beautiful.
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
One way is to include a latent variable for each error. This works in your example, but will not be feasible if you have many more observations. I'll give a little example to get you started down this road:
# add noise to observed x values
x_obs = pm.rnormal(mu=x, tau=(1e4)**-2)
# define the model/function to be fitted.
def model(x_obs, f):
amp = pm.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pm.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pm.Normal('ps', 0.13, 40, value=0.15)
x_pred = pm.Normal('x', mu=x_obs, tau=(1e4)**-2) # this allows error in x_obs
#pm.deterministic(plot=False)
def gauss(x=x_pred, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pm.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pm.MCMC(model(x_obs, f))
MDL.use_step_method(pm.AdaptiveMetropolis, MDL.x_pred) # use AdaptiveMetropolis to "learn" how to step
MDL.sample(200000, 100000, 10) # run chain longer since there are more dimensions
It looks like it may be hard to get good answers if you have noise in x and y:
Here is a notebook collecting this all up.
EDIT: Important note
This has been bothering me for a while now. The answers given by myself and Abraham here are correct in the sense that they add variability to x. HOWEVER: Note that you cannot simply add uncertainty in this way to cancel out the errors you have in your x-values, so that you regress against "true x". The methods in this answer can show you how adding errors to x affects your regression if you have the true x. If you have a mismeasured x, these answers will not help you. Having errors in the x-values is a very tricky problem to solve, as it leads to "attenuation" and an "errors-in-variables effect". The short version is: having unbiased, random errors in x leads to bias in your regression estimates. If you have this problem, check out Carroll, R.J., Ruppert, D., Crainiceanu, C.M. and Stefanski, L.A., 2006. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC., or for a Bayesian approach, Gustafson, P., 2003. Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments. CRC Press. I ended up solving my specific problem using Carroll et al.'s SIMEX method along with PyMC3. The details are in Carstens, H., Xia, X. and Yadavalli, S., 2017. Low-cost energy meter calibration method for measurement and verification. Applied energy, 188, pp.563-575. It is also available on ArXiv
I converted Abraham Flaxman's answer above into PyMC3, in case someone needs it. Some very minor changes, but can be confusing nevertheless.
The first is that the deterministic decorator #Deterministic is replaced by a distribution-like call function var=pymc3.Deterministic(). Second, when generating a vector of normally distributed random variables,
rvs = pymc2.rnormal(mu=mu, tau=tau)
is replaced by
rvs = pymc3.Normal('var_name', mu=mu, tau=tau,shape=size(var)).random()
The complete code is as follows:
import numpy as np
from pymc3 import *
import matplotlib.pyplot as plt
# set random seed for reproducibility
np.random.seed(12345)
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
#Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
with Model() as model3:
amp = Uniform('amp', 0.05, 0.4, testval= 0.15)
size = Uniform('size', 0.5, 2.5, testval= 1.0)
ps = Normal('ps', 0.13, 40, testval=0.15)
gauss=Deterministic('gauss',amp*np.exp(-1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.)))+ps)
y =Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
trace=sample(2000,start=start)
# extract and plot results
y_min = np.percentile(trace.gauss,2.5,axis=0)
y_max = np.percentile(trace.gauss,97.5,axis=0)
y_fit = np.percentile(trace.gauss,50,axis=0)
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=1, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
Which results in
y_error
For errors in x (note the 'x' suffix to variables):
# define the model/function to be fitted in PyMC3:
with Model() as modelx:
x_obsx = pm3.Normal('x_obsx',mu=x, tau=(1e4)**-2, shape=40)
ampx = Uniform('ampx', 0.05, 0.4, testval=0.15)
sizex = Uniform('sizex', 0.5, 2.5, testval=1.0)
psx = Normal('psx', 0.13, 40, testval=0.15)
x_pred = Normal('x_pred', mu=x_obsx, tau=(1e4)**-2*np.ones_like(x_obsx),testval=5*np.ones_like(x_obsx),shape=40) # this allows error in x_obs
gauss=Deterministic('gauss',ampx*np.exp(-1*(np.pi**2*sizex*x_pred/(3600.*180.))**2/(4.*np.log(2.)))+psx)
y = Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
tracex=sample(20000,start=start)
Which results in:
x_error_graph
the last observation is that when doing
traceplot(tracex[100:])
plt.tight_layout();
(result not shown), we can see that sizex seems to be suffering from 'attenuation' or 'regression dilution' due to the error in the measurement of x.

Categories

Resources