Fit two normal distributions (histograms) with MCMC using pymc? - python

I am trying to fit line profiles as detected with a spectrograph on a CCD. For ease of consideration, I have included a demonstration that, if solved, is very similar to the one I actually want to solve.
I've looked at this:
https://stats.stackexchange.com/questions/46626/fitting-model-for-two-normal-distributions-in-pymc
and various other questions and answers, but they are doing something fundamentally different than what I want to do.
import pymc as mc
import numpy as np
import pylab as pl
def GaussFunc(x, amplitude, centroid, sigma):
return amplitude * np.exp(-0.5 * ((x - centroid) / sigma)**2)
wavelength = np.arange(5000, 5050, 0.02)
# Profile 1
centroid_one = 5025.0
sigma_one = 2.2
height_one = 0.8
profile1 = GaussFunc(wavelength, height_one, centroid_one, sigma_one, )
# Profile 2
centroid_two = 5027.0
sigma_two = 1.2
height_two = 0.5
profile2 = GaussFunc(wavelength, height_two, centroid_two, sigma_two, )
# Measured values
noise = np.random.normal(0.0, 0.02, len(wavelength))
combined = profile1 + profile2 + noise
# If you want to plot what this looks like
pl.plot(wavelength, combined, label="Measured")
pl.plot(wavelength, profile1, color='red', linestyle='dashed', label="1")
pl.plot(wavelength, profile2, color='green', linestyle='dashed', label="2")
pl.title("Feature One and Two")
pl.legend()
My question: Can PyMC (or some variant) give me the mean, amplitude, and sigma for the two components used above?
Please note that the functions that I will actually fit on my real problem are NOT Gaussians -- so please provide the example using a generic function (like GaussFunc in my example), and not a "built-in" pymc.Normal() type function.
Also, I understand model selection is another issue: so with the current noise, 1 component (profile) might be all that is statistically justified. But I'd like to see what the best solution for 1, 2, 3, etc. components would be.
I'm also not wed to the idea of using PyMC -- if scikit-learn, astroML, or some other package seems perfect, please let me know!
EDIT:
I failed a number of ways, but one of the things that I think was on the right track was the following:
sigma_mc_one = mc.Uniform('sig', 0.01, 6.5)
height_mc_one = mc.Uniform('height', 0.1, 2.5)
centroid_mc_one = mc.Uniform('cen', 5015., 5040.)
But I could not construct a mc.model that worked.

Not the most concise PyMC code, but I made that decision to help the reader. This should run, and give (really) accurate results.
I made the decision to use Uniform priors, with liberal ranges, because I really have no idea what we are modelling. But probably one has an idea about the centroid locations, and can use a better priors there.
### Suggested one runs the above code first.
### Unknowns we are interested in
est_centroid_one = mc.Uniform("est_centroid_one", 5000, 5050 )
est_centroid_two = mc.Uniform("est_centroid_two", 5000, 5050 )
est_sigma_one = mc.Uniform( "est_sigma_one", 0, 5 )
est_sigma_two = mc.Uniform( "est_sigma_two", 0, 5 )
est_height_one = mc.Uniform( "est_height_one", 0, 5 )
est_height_two = mc.Uniform( "est_height_two", 0, 5 )
#std deviation of the noise, converted to precision by tau = 1/sigma**2
precision= 1./mc.Uniform("std", 0, 1)**2
#Set up the model's relationships.
#mc.deterministic( trace = False)
def est_profile_1(x = wavelength, centroid = est_centroid_one, sigma = est_sigma_one, height= est_height_one):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False)
def est_profile_2(x = wavelength, centroid = est_centroid_two, sigma = est_sigma_two, height= est_height_two):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False )
def mean( profile_1 = est_profile_1, profile_2 = est_profile_2 ):
return profile_1 + profile_2
observations = mc.Normal("obs", mean, precision, value = combined, observed = True)
model = mc.Model([est_centroid_one,
est_centroid_two,
est_height_one,
est_height_two,
est_sigma_one,
est_sigma_two,
precision])
#always a good idea to MAP it prior to MCMC, so as to start with good initial values
map_ = mc.MAP( model )
map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 50000,40000 ) #try running for longer if not happy with convergence.
Important
Keep in mind the algorithm is agnostic to labeling, so the results might show profile1 with all the characteristics from profile2 and vice versa.

Related

Adding constraints to my fitting model using lmfit

I am trying to fit a complex conductivity model (the drude-smith-anderson model) using lmfit.minimize. In that fitting, I want constraints on my parameters c and c1 such that 0<c<1, -1<c1<0 and 0<1+c1-c<1. So, I am using the following code:
#reference: Juluri B.K. "Fitting Complex Metal Dielectric Functions with Differential Evolution Method". http://juluribk.com/?p=1597.
#reference: https://lmfit.github.io/lmfit-py/fitting.html
#import libraries (numdifftools needs to be installed but doesn't need to be imported)
import matplotlib.pyplot as plt
import numpy as np
import lmfit as lmf
import math as mt
#define the complex conductivity model
def model(params,w):
sigma0 = params["sigma0"].value
tau = params["tau"].value
c = params["c"].value
d = params["d"].value
c1 = params["c1"].value
druidanderson = (sigma0/(1-1j*2*mt.pi*w*tau))*(1 + c1/(1-1j*2*mt.pi*w*tau)) - sigma0*c/(1-1j*2*mt.pi*w*d*tau)
return druidanderson
#defining the complex residues (chi squared is sum of squares of residues)
def complex_residuals(params,w,exp_data):
delta = model(params,w)
residual = (abs((delta.real - exp_data.real) / exp_data.real) + abs(
(delta.imag - exp_data.imag) / exp_data.imag))
return residual
# importing data from CSV file
importpath = input("Path of CSV file: ") #Asking the location of where your data file is kept (give input in form of path\name.csv)
frequency = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(0)) #path to be changed to the file from which data is taken
conductivity = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(1)) + 1j*np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(2)) #path to be changed to the file from which data is taken
frequency = frequency[np.logical_not(np.isnan(frequency))]
conductivity = conductivity[np.logical_not(np.isnan(conductivity))]
w_for_fit = frequency
eps_for_fit = conductivity
#defining the bounds and initial guesses for the fitting parameters
params = lmf.Parameters()
params.add("sigma0", value = float(input("Guess for \u03C3\u2080: ")), min =10 , max = 5000) #bounds have to be changed manually
params.add("tau", value = float(input("Guess for \u03C4: ")), min = 0.0001, max =10) #bounds have to be changed manually
params.add("c1", value = float(input("Guess for c1: ")), min = -1 , max = 0) #bounds have to be changed manually
params.add("constraint", value = float(input("Guess for constraint: ")), min = 0, max=1)
params.add("c", expr="1+c1-constraint", min = 0, max = 1) #bounds have to be changed manually
params.add("d", value = float(input("Guess for \u03C4_1/\u03C4: ")),min = 100, max = 100000) #bounds have to be changed manually
# minimizing the chi square
minimizer_results = lmf.minimize(complex_residuals, params, args=(w_for_fit, eps_for_fit), method = 'differential_evolution', strategy='best1bin',
popsize=50, tol=0.01, mutation=(0, 1), recombination=0.9, seed=None, callback=None, disp=True, polish=True, init='latinhypercube')
lmf.printfuncs.report_fit(minimizer_results, show_correl=False)
As a result for the fit, I get the following output:
sigma0: 3489.38961 (init = 1000)
tau: 1.2456e-04 (init = 0.01)
c1: -0.99816132 (init = -1)
constraint: 0.98138820 (init = 1)
c: 0.00000000 == '1+c1-constraint'
d: 7333.82306 (init = 1000)
These values don't make any sense as 1+c1-c = -0.97954952 which is not 0 and is thus invalid. How to fix this issue?
Your code is not runnable. The use of input() is sort of stunning - please do not do that. Write code that is pleasant to read and separates i/o from logic.
To make a floating point residual from a complex array, use complex_array.view(float)
Guessing any parameter value to be at or very close to its limit (here, c) is a very bad idea, likely to make the fit harder.
More to your question, you defined c as "evaluate 1+c1-constant and then apply the bounds min=0, max=1". That is literally, precisely, and exactly what your
params.add("c", expr="1+c1-constraint", min = 0, max = 1)
means: calculate c as 1+c1-constraint, and then apply the bounds [0, 1]. The code is doing exactly what you told it to do.
Unless you know what you are doing (I suspect maybe not ;)), I would strongly advise doing a fit with the default leastsq method before trying to use differential_evolution. It turns out that differential_evolution is not a very good global fitting method (shgo is generally better, though no "global" solver should be considered as very reliable). But, unless you know that you need such a method, you probably do not.
I would also strongly advise you to plot your data and some models evaluated with what you think are reasonable parameters.

lmfit - SineModel+ConstantModel appears inaccurate fit

I'm trying to fit a simple sine function to some experimental data using lmfit and I find that the SineModel with a constant model offset returns, what looks like an inaccurate fit to the data (to me). I suppose it may be helpful to highlight that I am most interested in the frequency of the peaks (and I appreciate that I can simply use a scipy.find_peaks() but would prefer to show a fit to the data).
I use the function below for lmfit model:
def Sine(self, x_axis, y_axis):
sine = SineModel()
const = ConstantModel()
x_fit = np.linspace(min(x_axis), max(x_axis), x_axis.size)
guess_sine = sine.guess(y_axis, x=x_fit)
pars = sine.guess(y_axis, x=x_fit)
sine_offset = SineModel() + ConstantModel()
pars.add('c', value=1, vary=True)
result = sine_offset.fit(y_axis, pars, x=x_fit)
return result
Sine function output (graph and report results) are provided here:
SineModel+ConstModel
I then tried to define my own function, defining my own parameters and evaluating in the same lmfit method, providing sensible "guess" initial values etc.
def Sine_User2(self, x_axis, y_axis):
def sine_func(x, amplitude, freq, shift, c):
return amplitude * np.sin(freq * x + shift) + c
sinemodel = Model(sine_func)
# Take a FFT of the data to provide a guess starting location for the curve fitting
x = np.array(x_axis)
y = np.array(y_axis)
ff = np.fft.fftfreq(len(x), (x[1] - x[0])) # assume uniform spacing
Fyy = abs(np.fft.fft(y))
guess_freq = abs(ff[np.argmax(Fyy[1:]) + 1]) * 2. * np.pi
guess_amp = np.std(y) * 2.**0.5
guess_offset = np.mean(y)
x_fit = np.linspace(min(x_axis), max(x_axis), x_axis.size)
params = sinemodel.make_params(amplitude = guess_amp, freq = guess_freq, shift = 0, c = guess_offset )
result = sinemodel.fit(y_axis, params, x = x_fit)
return result
The output of the user defined model appears to provide a much closer fit to the data, however, the report does not provide uncertainties citing a warning that the "Uncertainties could not be estimated":
SineUser2 function outputs (graph and report results) are provided here: User Defined Model
I then tried to include min/max values to the parameters by replacing the "sinmodel.make_params" line with:
params = Parameters()
params.add('amplitude', value=guess_amp, min = 0)
params.add('freq', value=guess_freq, min=0)
params.add('shift', value=0, min=-2*np.pi, max=2*np.pi)
params.add('c', value=guess_offset)
But the results resort back to the SineModel+ConstModel results seen in the first linked graph/report results. Therefore it must be something to do with the way I'm setting initial values.
The fit using the "SineUser2" function appears to be better. Is there a way to improve the fit for "Sine" function in the first block of code.
Why are the uncertainties not calculated in the second function "Sine_User2"?
Data (.csv):
Wavelength (nm),Power (dBm),,,,,
1549.9,-13.76008731,,,,,
1549.905,-13.69423162,,,,,
1549.91,-12.59004339,,,,,
1549.915,-11.31061848,,,,,
1549.92,-10.58731809,,,,,
1549.925,-10.19024329,,,,,
1549.93,-10.07301418,,,,,
1549.935,-10.19513172,,,,,
1549.94,-10.45582159,,,,,
1549.945,-11.15984161,,,,,
1549.95,-12.15876596,,,,,
1549.955,-13.44674933,,,,,
1549.96,-13.56388277,,,,,
1549.965,-12.2513065,,,,,
1549.97,-11.08699015,,,,,
1549.975,-10.43829185,,,,,
1549.98,-10.12861158,,,,,
1549.985,-10.0962929,,,,,
1549.99,-10.1852173,,,,,
1549.995,-10.55438183,,,,,
1550,-11.19555345,,,,,
1550.005,-12.28715299,,,,,
1550.01,-13.5153863,,,,,
1550.015,-13.47019261,,,,,
1550.02,-12.12394732,,,,,
1550.025,-11.01946751,,,,,
1550.03,-10.42138778,,,,,
1550.035,-10.14438079,,,,,
1550.04,-10.05681218,,,,,
1550.045,-10.17148605,,,,,
1550.05,-10.56046759,,,,,
1550.055,-11.11621478,,,,,
1550.06,-12.19930263,,,,,
1550.065,-13.48428349,,,,,
1550.07,-13.43424913,,,,,
1550.075,-12.08019952,,,,,
1550.08,-11.08731704,,,,,
1550.085,-10.45730899,,,,,
1550.09,-10.11278169,,,,,
1550.095,-10.00651194,,,,,
,,,,,,

Old PyMC3 style grouping traceplot plotted with Arviz

I have an old blogpost where I am training a PyMC3 model. You can find the blogpost here but the gist of the model is shown below.
with pm.Model() as model:
mu_intercept = pm.Normal('mu_intercept', mu=40, sd=5)
mu_slope = pm.HalfNormal('mu_slope', 10, shape=(n_diets,))
mu = mu_intercept + mu_slope[df.diet-1] * df.time
sigma_intercept = pm.HalfNormal('sigma_intercept', sd=2)
sigma_slope = pm.HalfNormal('sigma_slope', sd=2, shape=n_diets)
sigma = sigma_intercept + sigma_slope[df.diet-1] * df.time
weight = pm.Normal('weight', mu=mu, sd=sigma, observed=df.weight)
approx = pm.fit(20000, random_seed=42, method="fullrank_advi")
In this dataset I'm estimating the effect of Diet on the weight of chickens. This is what the traceplot looks like.
Look at how pretty it is! Each diet has its own line! Beautiful!
Arviz Changes
This traceplot was made using the older PyMC3 API. Nowadays this functionality has moved to arviz. So tried redo-ing this work but ... the plot looks very different.
The code that I'm using here is slightly different. I'm using pm.Data now but I doubt that's supposed to cause this difference.
with pm.Model() as mod:
time_in = pm.Data("time_in", df['time'].astype(float))
diet_in = pm.Data("diet_in", dummies)
intercept = pm.Normal("intercept", 0, 2)
time_effect = pm.Normal("time_weight_effect", 0, 2, shape=(4,))
diet = pm.Categorical("diet", p=[0.25, 0.25, 0.25, 0.25], shape=(4,), observed=diet_in)
sigma = pm.HalfNormal("sigma", 2)
sigma_time_effect = pm.HalfNormal("time_sigma_effect", 2, shape=(4,))
weight = pm.Normal("weight",
mu=intercept + time_effect.dot(diet_in.T)*time_in,
sd=sigma + sigma_time_effect.dot(diet_in.T)*time_in,
observed=df.weight)
trace = pm.sample(5000, return_inferencedata=True)
What do I need to do to get the different colors per DIET back in?
There's a parameter for it in the new plot_trace function. This does the trick;
az.plot_trace(trace, compact=True)

How to get the confidence interval of a Weibull distribution using Python?

I want to perform a probability Weibull fit with 0.95% confidence bounds by means of Python. As test data, I use fail cycles of a measurement which are plotted against the reliability R(t).
So far, I found a way to perform the Weibull fit, however, I still do not manage to get the confidence bounds. The Weibull plot with the same test data set was already performed with origin, therfore I know which shape I would "expect" for the confidence interval. But I do not understand how to get there.
I found information about Weibull confidence intervals on reliawiki(cf. Bounds on Reliability based on Fisher Matrix confidence bounds) and used the description there to calculate the variance and the upper and lower confidence bound (R_U and R_L).
Here is a working code example for my Weibull fit and my confidence bounds with the test data set based on the discription of reliawiki (cf. Bounds on Reliability). For the fit, I used a OLS model fit.
import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from scipy.optimize import curve_fit
import math
import statsmodels.api as sm
def weibull_ticks(y, pos):
return "{:.0f}%".format(100 * (1 - np.exp(-np.exp(y))))
def loglog(x):
return np.log(-np.log(1 - np.asarray(x)))
class weibull_example(object):
def __init__(self, dat):
self.fits = {}
dat.index = np.arange(1, len(dat) + 1)
dat.sort_values('data', inplace=True)
#define yaxis-values
dat['percentile'] = dat.index*1/len(dat)
self.data = dat
self.fit()
self.plot_data()
def fit(self):
#fit the data points with a the OLS model
self.data=self.data[:-1]
x0 = np.log(self.data.dropna()['data'].values)
Y = loglog(self.data.dropna()['percentile'])
Yx = sm.add_constant(Y)
model = sm.OLS(x0, Yx)
results = model.fit()
yy = loglog(np.linspace(.001, .999, 100))
YY = sm.add_constant(yy)
XX = np.exp(results.predict(YY))
self.eta = np.exp(results.params[0])
self.beta = 1 / results.params[1]
self.fits['syx'] = {'results': results, 'model': model,
'line': np.row_stack([XX, yy]),
'beta': self.beta,
'eta': self.eta}
cov = results.cov_params()
#get variance and covariance
self.beta_var = cov[1, 1]
self.eta_var = cov[0, 0]
self.cov = cov[1, 0]
def plot_data(self, fit='yx'):
dat = self.data
#plot data points
plt.semilogx(dat['data'], loglog(dat['percentile']), 'o')
fit = 's' + fit
self.plot_fit(fit)
ax = plt.gca()
formatter = mpl.ticker.FuncFormatter(weibull_ticks)
ax.yaxis.set_major_formatter(formatter)
yt_F = np.array([0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 0.95, 0.99])
yt_lnF = loglog(yt_F)
plt.yticks(yt_lnF)
plt.ylim(loglog([.01, .99]))
def plot_fit(self, fit='syx'):
dat = self.fits[fit]['line']
plt.plot(dat[0], dat[1])
#calculate variance to get confidence bound
def variance(x):
return (math.log(x) - math.log(self.eta)) ** 2 * self.beta_var + \
(self.beta/self.eta) ** 2 * self.eta_var - \
2 * (math.log(x) - math.log(self.eta)) * (-self.beta/self.eta) * self.cov
#calculate confidence bounds
def confidence_upper(x):
return 1-np.exp(-np.exp(self.beta*(math.log(x)-math.log(self.eta)) - 0.95*np.sqrt(variance(x))))
def confidence_lower(x):
return 1-np.exp(-np.exp(self.beta*(math.log(x)-math.log(self.eta)) + 0.95*np.sqrt(variance(x))))
yvals_1 = list(map(confidence_upper, dat[0]))
yvals_2 = list(map(confidence_lower, dat[0]))
#plot confidence bounds
plt.semilogx(dat[0], loglog(yvals_1), linestyle="solid", color="black", linewidth=2,
label="fit_u_1", alpha=0.8)
plt.semilogx(dat[0], loglog(yvals_2), linestyle="solid", color="green", linewidth=2,
label="fit_u_1", alpha=0.8)
def main():
fig, ax1 = plt.subplots()
ax1.set_xlabel("$Cycles\ til\ Failure$")
ax1.set_ylabel("$Weibull\ Percentile$")
#my data points
data = pd.DataFrame({'data': [1556, 2595, 11531, 38079, 46046, 57357]})
weibull_example(data)
plt.savefig("Weibull.png")
plt.close(fig)
if __name__ == "__main__":
main()
The confidence bounds in my plot look not like I expected. I tried a lot of different 'variances', just to understand the function and to check, if the problem is just a typing error. Meanwhile, I am convinced that the problem is more general and that I understood something false from the description on reliawiki. Unfortunately, I really do not get what's the problem and I do not know anyone else I can ask. In the internet and on different forums, I did not find an appropriate answer.
That's why I decided to ask this question here. It's the first time I ask a question in a forum. Therefore, I hope that I explained everything sufficiently and that the code example is useful.
Thank you very much :)
Apologies for the very late answer, but I'll provide it for any future readers.
Rather than try implementing this yourself, you may want to consider using a package designed for exactly this called reliability.
Here is the example for your use case.
Remember to upvote this answer if it helps you :)

Fit a non-linear function to data/observations with pyMCMC/pyMC

I am trying to fit some data with a Gaussian (and more complex) function(s). I have created a small example below.
My first question is, am I doing it right?
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
It is very hard to find nice guides on how to do this kind of regression in pyMC. Perhaps because its easier to use some least squares, or similar approach, I however have many parameters in the end and need to see how well we can constrain them and compare different models, pyMC seemed like the good choice for that.
import pymc
import numpy as np
import matplotlib.pyplot as plt; plt.ion()
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
# Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
# define the model/function to be fitted.
def model(x, f):
amp = pymc.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pymc.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pymc.Normal('ps', 0.13, 40, value=0.15)
#pymc.deterministic(plot=False)
def gauss(x=x, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pymc.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pymc.MCMC(model(x,f))
MDL.sample(1e4)
# extract and plot results
y_min = MDL.stats()['gauss']['quantiles'][2.5]
y_max = MDL.stats()['gauss']['quantiles'][97.5]
y_fit = MDL.stats()['gauss']['mean']
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=2, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
I realize that I might have to run more iterations, use burn in and thinning in the end. The figure plotting the data and the fit is seen here below.
The pymc.Matplot.plot(MDL) figures looks like this, showing nicely peaked distributions. This is good, right?
My first question is, am I doing it right?
Yes! You need to include a burn-in period, which you know. I like to throw out the first half of my samples. You don't need to do any thinning, but sometimes it will make your post-MCMC work faster to process and smaller to store.
The only other thing I advise is to set a random seed, so that your results are "reproducible": np.random.seed(12345) will do the trick.
Oh, and if I was really giving too much advice, I'd say import seaborn to make the matplotlib results a little more beautiful.
My second question is, how do I add an error in the x-direction, i.e. in the x-position of the observations/data?
One way is to include a latent variable for each error. This works in your example, but will not be feasible if you have many more observations. I'll give a little example to get you started down this road:
# add noise to observed x values
x_obs = pm.rnormal(mu=x, tau=(1e4)**-2)
# define the model/function to be fitted.
def model(x_obs, f):
amp = pm.Uniform('amp', 0.05, 0.4, value= 0.15)
size = pm.Uniform('size', 0.5, 2.5, value= 1.0)
ps = pm.Normal('ps', 0.13, 40, value=0.15)
x_pred = pm.Normal('x', mu=x_obs, tau=(1e4)**-2) # this allows error in x_obs
#pm.deterministic(plot=False)
def gauss(x=x_pred, amp=amp, size=size, ps=ps):
e = -1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.))
return amp*np.exp(e)+ps
y = pm.Normal('y', mu=gauss, tau=1.0/f_error**2, value=f, observed=True)
return locals()
MDL = pm.MCMC(model(x_obs, f))
MDL.use_step_method(pm.AdaptiveMetropolis, MDL.x_pred) # use AdaptiveMetropolis to "learn" how to step
MDL.sample(200000, 100000, 10) # run chain longer since there are more dimensions
It looks like it may be hard to get good answers if you have noise in x and y:
Here is a notebook collecting this all up.
EDIT: Important note
This has been bothering me for a while now. The answers given by myself and Abraham here are correct in the sense that they add variability to x. HOWEVER: Note that you cannot simply add uncertainty in this way to cancel out the errors you have in your x-values, so that you regress against "true x". The methods in this answer can show you how adding errors to x affects your regression if you have the true x. If you have a mismeasured x, these answers will not help you. Having errors in the x-values is a very tricky problem to solve, as it leads to "attenuation" and an "errors-in-variables effect". The short version is: having unbiased, random errors in x leads to bias in your regression estimates. If you have this problem, check out Carroll, R.J., Ruppert, D., Crainiceanu, C.M. and Stefanski, L.A., 2006. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC., or for a Bayesian approach, Gustafson, P., 2003. Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments. CRC Press. I ended up solving my specific problem using Carroll et al.'s SIMEX method along with PyMC3. The details are in Carstens, H., Xia, X. and Yadavalli, S., 2017. Low-cost energy meter calibration method for measurement and verification. Applied energy, 188, pp.563-575. It is also available on ArXiv
I converted Abraham Flaxman's answer above into PyMC3, in case someone needs it. Some very minor changes, but can be confusing nevertheless.
The first is that the deterministic decorator #Deterministic is replaced by a distribution-like call function var=pymc3.Deterministic(). Second, when generating a vector of normally distributed random variables,
rvs = pymc2.rnormal(mu=mu, tau=tau)
is replaced by
rvs = pymc3.Normal('var_name', mu=mu, tau=tau,shape=size(var)).random()
The complete code is as follows:
import numpy as np
from pymc3 import *
import matplotlib.pyplot as plt
# set random seed for reproducibility
np.random.seed(12345)
x = np.arange(5,400,10)*1e3
# Parameters for gaussian
amp_true = 0.2
size_true = 1.8
ps_true = 0.1
#Gaussian function
gauss = lambda x,amp,size,ps: amp*np.exp(-1*(np.pi**2/(3600.*180.)*size*x)**2/(4.*np.log(2.)))+ps
f_true = gauss(x=x,amp=amp_true, size=size_true, ps=ps_true )
# add noise to the data points
noise = np.random.normal(size=len(x)) * .02
f = f_true + noise
f_error = np.ones_like(f_true)*0.05*f.max()
with Model() as model3:
amp = Uniform('amp', 0.05, 0.4, testval= 0.15)
size = Uniform('size', 0.5, 2.5, testval= 1.0)
ps = Normal('ps', 0.13, 40, testval=0.15)
gauss=Deterministic('gauss',amp*np.exp(-1*(np.pi**2*size*x/(3600.*180.))**2/(4.*np.log(2.)))+ps)
y =Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
trace=sample(2000,start=start)
# extract and plot results
y_min = np.percentile(trace.gauss,2.5,axis=0)
y_max = np.percentile(trace.gauss,97.5,axis=0)
y_fit = np.percentile(trace.gauss,50,axis=0)
plt.plot(x,f_true,'b', marker='None', ls='-', lw=1, label='True')
plt.errorbar(x,f,yerr=f_error, color='r', marker='.', ls='None', label='Observed')
plt.plot(x,y_fit,'k', marker='+', ls='None', ms=5, mew=1, label='Fit')
plt.fill_between(x, y_min, y_max, color='0.5', alpha=0.5)
plt.legend()
Which results in
y_error
For errors in x (note the 'x' suffix to variables):
# define the model/function to be fitted in PyMC3:
with Model() as modelx:
x_obsx = pm3.Normal('x_obsx',mu=x, tau=(1e4)**-2, shape=40)
ampx = Uniform('ampx', 0.05, 0.4, testval=0.15)
sizex = Uniform('sizex', 0.5, 2.5, testval=1.0)
psx = Normal('psx', 0.13, 40, testval=0.15)
x_pred = Normal('x_pred', mu=x_obsx, tau=(1e4)**-2*np.ones_like(x_obsx),testval=5*np.ones_like(x_obsx),shape=40) # this allows error in x_obs
gauss=Deterministic('gauss',ampx*np.exp(-1*(np.pi**2*sizex*x_pred/(3600.*180.))**2/(4.*np.log(2.)))+psx)
y = Normal('y', mu=gauss, tau=1.0/f_error**2, observed=f)
start=find_MAP()
step=NUTS()
tracex=sample(20000,start=start)
Which results in:
x_error_graph
the last observation is that when doing
traceplot(tracex[100:])
plt.tight_layout();
(result not shown), we can see that sizex seems to be suffering from 'attenuation' or 'regression dilution' due to the error in the measurement of x.

Categories

Resources