Python: MCMC fit to data with variable uncertainties (using lmfit) - python

Please bear with me as I try to explain what I'm trying to do.
I'm trying fit an arctangent model to some data. I have two independent measurements in my dataset; one of these has unknown uncertainties.
The model I'm trying to fit has the form:
def model(x, s, d, c):
return (s/np.pi) * np.arctan(x/d) + c
I can fit the model to the point cloud data (with unknown uncertainties). Using something like:
params = lmfit.Parameters()
params['s'] = lmfit.Parameter(name='s', value=-3, min=-10, max=10)
params['d'] = lmfit.Parameter(name='d', value=15, min=0, max=30)
params['c'] = lmfit.Parameter(name='c', value=5, min=-10, max=10)
emcee_kws = dict(steps=10000, burn=300, thin=20, progress=True)
m = lmfit.Model(model)
result_emcee = m.fit(data=y, x=x, params=params, method='emcee', fit_kws=emcee_kws)
But what I would really like to do is fit both of these datasets simultaneously while taking into account the variable data uncertainties.
Any help very much appreciated!

First, please give a more complete example. As the lmfit documentation shows, you can provide uncertainties for the data. If you think you don't know the uncertainties, then try setting them to "Infinity", and then hopefully you will realize that you do have some idea of the scale of the uncertainties.
Second, don't use the emcee method. It is not appropriate for fitting data to a model.

Related

Python code for curve fitting by convolution of a gausian and multi exponential decay

I'm developing a code for fitting a data with a model which is convolution of two functions (Gaussian with multi exponential decay exp(Ax)+exp(Bx)+...). basically the fitting with only Gaussian and/or Gaussian modified https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution is working perfectly fine in Lmfit but using the builtin convolution (i.e if np.convolve of two functions is used Lmfit doesn't work.
I have tried many examples on internet, so far I realized that my functions returns inf or nan values and also data is not equally spaced for being used in convolution. I found a detour for the issue by using the mathematical expression of convolution and by using scipy.optimize.curve_fit .But it is a very clumsy and time consuming, I would like to find a way to making it more sophisticated and general by using a convolution of two functions and using lmfit where I can control the parameters a lot easier.
The data set is also included in comments as your reference.
w=0.1 # is constant
def CONVSum(x,w,*p):
n=np.int(len(p)/3)
A=p[:n]
B=p[n:2*n]
C=p[2*n:3*n]
# =======================================================================
# below formula is derived as mathematical expression of convoluted multi exponential components with a gaussian distribution based on the instruction given in http://www.np.ph.bham.ac.uk/research_resources/programs/halflife/gauss_exp_conv.pdf
# ======================================================================
fnct=sum(np.float64([A[i]*np.exp(-B[i]*((x-C[i])-(0.5*np.square(w)*B[i])))*(1+scipy.special.erf(((x-C[i])-(np.square(w)*B[i]))/(np.sqrt(2)*w))) for i in range(n)]))
fnct[np.isnan(fnct)]=0
fnct[fnct<1e-12]=0
return fnct
N=4 #number of exponential functions to be fitted
params = np.linspace(1, 0.0001, N*3); #parameters for a multiple exponential
popt,pcov = curve_fit(CONVSum,x,y,p0=params,
bounds=((0,0,0,0,-np.inf,-np.inf,-np.inf,-np.inf,-3,-3,-3,-3),
(1,1,1,1, np.inf, np.inf, np.inf, np.inf, 3, 3, 3, 3)),
maxfev = 1000000)
fitted data with curve fitt
Any help or hint regarding the fitting with convolution of Gaussian and multiple exponential decay is highly appreciated, I prefer using lmfit since I can identify parameters very nicely and also to relate them to each other.
Ideally I want to fit my data with the parameters where some of them are shared among the data sets, some are delayed (+off_set).
Well, your script is a bit hard to read and follow closely with lots of stuff that is not related to your question. Your exgauss function is not guarding against infinities. np.exp(x) for x>~ 710 will give Inf, and the fit will not be able to proceed.
Here is the equvalent of the cure fitting code given in question. I managed to creat this by using very great instruction and infromation in here and here. But still it needs to be developed.
# =============================================================================
# below formula is drived as mathematical expresion of convoluted multi exponential components with a gausian distribution based on the instruction given in http://www.np.ph.bham.ac.uk/research_resources/programs/halflife/gauss_exp_conv.pdf
# =============================================================================
def CONVSum(x,params):
fnct=sum(
np.float64([
(params['amp%s_%s'%(n,i)].value)*np.exp(-(params['dec%s_%s'%(n,i)].value)*((x-(params['cen%s_%s'%(n,i)].value))-
(0.5*np.square((params['sig%s_%s'%(n,i)].value))*(params['dec%s_%s'%(n,i)].value))))*
(1+scipy.special.erf(((x-(params['cen%s_%s'%(n,i)].value))-(np.square((params['sig%s_%s'%(n,i)].value))*
(params['dec%s_%s'%(n,i)].value)))/(np.sqrt(2)*(params['sig%s_%s'%(n,i)].value)))) for n in range(N) for i in wav
])
)
fnct=fnct/fnct.max()
return fnct
# =============================================================================
# this global fit were adapted from https://stackoverflow.com/questions/20339234/python-and-lmfit-how-to-fit-multiple-datasets-with-shared-parameters/20341726#20341726
# it is of very important thet we can identify the shared parameteres for datasets
# =============================================================================
def objective(params, x, data):
""" calculate total residual for fits to several data sets"""
ndata = data.shape[0]
resid = 0.0*data[:]
# make residual per data set
resid = data- CONVSum(x,params)
# now flatten this to a 1D array, as minimize() needs
return resid.flatten()
# selec datasets
x = df[949].index
data =df[949].values
# create required sets of parameters, one per data set
N=4 #number of exponential decays
wav=[949] #the desired data to be fitted
fit_params = Parameters()
for i in wav:
for n in range(N):
fit_params.add( 'amp%s_%s'%(n,i), value=1, min=0.0, max=1)
fit_params.add( 'dec%s_%s'%(n,i), value=0.5, min=-1e10, max=1e10)
fit_params.add( 'cen%s_%s'%(n,i), value=0.1, min=-3.0, max=1000)
fit_params.add( 'sig%s_%s'%(n,i), value=0.1, min=0.05, max=0.5)
# now we constrain some values to have the same value
# for example assigning sig_2, sig_3, .. sig_5 to be equal to sig_1
for i in wav:
for n in (1,2,3):
print(n,i)
fit_params['sig%s_%s'%(n,i)].expr='sig0_949'
fit_params['cen%s_%s'%(n,i)].expr='cen0_949'
# it will run the global fit to all the data sets
result = minimize(objective, fit_params, args=(x,data))
report_fit(result.params)
# plot the data sets and fits
plt.close('all')
plt.figure()
for i in wav:
y_fit = CONVSum(x,result.params)
plt.plot(x, data, 'o-', x, y_fit, '-')
plt.xscale('symlog')
plt.show()
fitted data with convolution of multi exponential and gausian
unfortunately the fitted results are not very satisfying, I am still looking for some advice to improve this.

PYMC3 Bayesian Prediction Cones

I'm still learning PYMC3, but I cannot find anything on the following problem in the docs. Consider the Bayesian Structure Time Series (BSTS) model from this question with no seasonality. This can be modeled in PYMC3 as follows:
import pymc3, numpy, matplotlib.pyplot
# generate some test data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
y_train = y_full[:90]
y_test = y_full[90:]
# specify the model
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_train.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_train)
trace = pymc3.sample(1000)
y_mean_pred = pymc3.sample_ppc(trace,samples=1000,model=model)['y'].mean(axis=0)
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='b')
ax.plot(t[:90],y_mean_pred,c='r')
matplotlib.pyplot.show()
Now I would like to predict the behavior for the next 10 time steps, i.e., y_test. I would also like to include credible regions over this area produce a Bayesian cone, e.g., see here. Unfortunately the mechanism for producing the cones in the aforementioned link is a little vague. In a more conventional AR model one could learn the mean regression coefficients and manually extend the mean curve. However, in this BSTS model there is no obvious way to do this. Alternatively, if there were regressors, then I could use a theano.shared and update it with a finer/extended grid to impute and extrapolate with sample_ppc, but thats not really an option in this setting. Perhaps sample_ppc is a red herring here, but its unclear how else to proceed. Any help would be welcome.
I think the following work. However, its super clunky and requires that I know how far in advance I want to predict before I train (in particular it percludes streaming usage or simple EDA). I suspect there is a better way and I would much rather accept a better solution by someone with more Pymc3 experience
import numpy, pymc3, matplotlib.pyplot, seaborn
# generate some data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
# mask the data that I want to predict (requires knowledge
# that one might not always have at training time).
cutoff_idx = 80
y_obs = numpy.ma.MaskedArray(y_full,numpy.arange(t.size)>cutoff_idx)
# specify and train the model, used the masked array to supply only
# the observed data
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_obs.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_obs)
trace = pymc3.sample(5000)
y_pred = pymc3.sample_ppc(trace,samples=20000,model=model)['y']
y_pred_mean = y_pred.mean(axis=0)
# compute percentiles
dfp = numpy.percentile(y_pred,[2.5,25,50,70,97.5],axis=0)
# plot actual data and summary posterior information
pal = seaborn.color_palette('Purples')
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='g',label='true value',alpha=0.5)
ax.plot(t,y_pred_mean,c=pal[5],label='posterior mean',alpha=0.5)
ax.plot(t,dfp[2,:],alpha=0.75,color=pal[3],label='posterior median')
ax.fill_between(t,dfp[0,:],dfp[4,:],alpha=0.5,color=pal[1],label='CR 95%')
ax.fill_between(t,dfp[1,:],dfp[3,:],alpha=0.4,color=pal[2],label='CR 50%')
ax.axvline(x=t[cutoff_idx],linestyle='--',color='r',alpha=0.25)
ax.legend()
matplotlib.pyplot.show()
This outputs the following which seems like a really bad prediction, but at least the code is supplying out of sample values.

Possible to manually set parameters in linear model and get R Squared, etc?

I am solving a linear model with bounds on the parameters. The simple statsmodels OLS method doesn't allow for bounds on the fitted parameters, so to do this, I maximize a likelihood function using scipy.optimize.minimize. From this, I have my set of parameters for a linear model. All good so far.
All I need to acheive now is to be able to access statistics for my model, such as R^2, F-Stat, etc. For an OLS, these things all come with the object returned by model.fit() along with other nice features.
I'm wondering if it is possible to create this object, manually assign my parameters from the bounded fit, and have it compute the data fields on the fit result object? Obviously, I could just manually compute these things but I want it such that whether I am calling for a bounded or unbounded fit, I get the same object type returned and life is easy downstream.
Pseudo code:
bounded_params = fitBoundedLinear(x, y) # solution to bounded problem - a list of floats
model = statsmodels.api.OLS(y, x)
unbounded_fitResult = model.fit() # solution to unbounded problem - a regression results object
want to do something like:
aFitResult.params = bounded_params # manually set the parameters
aFitResult.calculate() # force it to compute data fields based on these params
rsq = aFitResult.rsquared # etc...
I have something that works - but it is probably not an ideal solution:
aFitResult = statsmodels.regression.linear_model.RegressionResultsWrapper(statsmodels.regression.linear_model.OLSResults(model, bounded_params))
you can add upper_bound and lower_bound to fit_elasticnet in elastic_net.py as:
def fit_elasticnet(model, method="coord_descent", maxiter=100,
alpha=0., L1_wt=1., start_params=None, cnvrg_tol=1e-7,
zero_tol=1e-8, refit=False, check_step=True,
loglike_kwds=None, score_kwds=None, hess_kwds=None, upper_bound=None, lower_bound=None):
then inside that function after the following line:
params[k] = _opt_1d(func, grad, hess, model_1var, params[k], alpha[k]*L1_wt,
tol=btol, check_step=check_step)
add:
if upper_bound is not None:
params[k] = min(params[k], upper_bound[k])
if lower_bound is not None:
params[k] = max(params[k], lower_bound[k])
then call the function similar to:
model = lm.OLS(y, x)
results_fu = model.fit()
#results_fu.summary()
results_fr = model.fit_regularized(alpha=0.001
,start_params=results_fu.params
,upper_bound=(.60,0,0,1,1,1,1,1)
,lower_bound=(-1, 0,0,0,-1,1,1,1,-10) )
Set the model's initial parameters to the desired values via start_params= and then fit them using maxiter=0 to do a fit with 0 steps (i.e. don't fit, but still run through all the initialization and metric computation).
result = model.fit(start_params=your_parameters_here, maxiter=0)
result.rsquared # or any other fit index

Problems with curve_fit from scipy.optimze

I know that there are some similar questions, but since none of them brought me any further, I decided to ask one of my own.
I am sorry, if the answer to my problem is already somewhere out there, but I really couldn't find it.
I tried fitting f(x) = a*x**b to rather linear data using curve_fit. It compiles properly, but the result is way off as shown below:
The thing is, that I don't really know what I am doing, but on the other hand fitting always is more of an art than science and there was at least one general bug with scipy.optimize.
My data looks like this:
x-values:
[16.8, 2.97, 0.157, 0.0394, 14.000000000000002, 8.03, 0.378, 0.192, 0.0428, 0.029799999999999997, 0.000781, 0.0007890000000000001]
y-values:
[14561.766666666666, 7154.7950000000001, 661.53750000000002, 104.51446666666668, 40307.949999999997, 15993.933333333332, 1798.1166666666666, 1015.0476666666667, 194.93800000000002, 136.82833333333332, 9.9531566666666684, 12.073133333333333]
That's my code (using a really nice example in the last answer to that question):
def func(x,p0,p1): # HERE WE DEFINE A FUNCTION THAT WE THINK WILL FOLLOW THE DATA DISTRIBUTION
return p0*(x**p1)
# Here you give the initial parameters for p0 which Python then iterates over to find the best fit
popt, pcov = curve_fit(func,xvalues,yvalues, p0=(1.0,1.0))#p0=(3107,0.944)) #THESE PARAMETERS ARE USER DEFINED
print(popt) # This contains your two best fit parameters
# Performing sum of squares
p0 = popt[0]
p1 = popt[1]
residuals = yvalues - func(xvalues,p0,p1)
fres = sum(residuals**2)
print 'chi-square'
print(fres) #THIS IS YOUR CHI-SQUARE VALUE!
xaxis = np.linspace(5e-4,20) # we can plot with xdata, but fit will not look good
curve_y = func(xaxis,p0,p1)
The starting values are from a fit with gnuplot, that is plausible but I need to cross-check.
This is printed output (first fitted p0, p1, then chi-square):
[ 4.67885857e+03 6.24149549e-01]
chi-square
424707043.407
I guess this is a difficult question, therefore much thanks in advance!
When fitting curve_fit optimizes the sum of (data - model)^2 / (error)^2
If you don't pass in errors (as you are doing here) curve_fit assumes that all of the points have an error of 1.
In this case, as your data spans many orders of magnitude, the points with the largest y values dominate the objective function, and causes curve_fit to attempt to fit them at the expense of the others.
The best way of fixing this would be including the errors in your yvalues in the fit (it looks like you do as you have error bars in the plot you have made!). You can do this by passing them in as the sigma parameter of curve_fit.
I would rethink the experimental part. Two datapoints are questionable:
The image you showed us looks pretty good because you took the log:
You could do a linear fit on log(x) and log(y). In this way you might limit the impact of the largest residuals. Another approach would be robust regression (RANSAC from sklearn or least_squares from scipy).
Nevertheless you should either gather more datapoints or repeat the measurements.

Why does scipy.optimize.curve_fit produce parameters which are barely different from the guess?

I've been trying to fit some histogram data with scipy.optimize.curve_fit, but so far I haven't once been able to produce fit parameters that differ significantly from my guess parameters.
I wouldn't be terribly surprised to find that the more arcane parameters in my fit get stuck in local minima, but even linear coefficients won't move from my initial guesses!
If you've seen anything like this before, I'd love some advice. Do least-squared minimization routines just not work for certain classes of functions?
I try this,
import numpy as np
from matplotlib.pyplot import *
from scipy.optimize import curve_fit
def grating_hist(x,frac,xmax,x0):
# model data to be turned into a histogram
dx = x[1]-x[0]
z = np.linspace(0,1,20000,endpoint=True)
grating = np.cos(frac*np.pi*z)
norm_grating = xmax*(grating-grating[-1])/(1-grating[-1])+x0
# produce the histogram
bin_edges = np.append(x,x[-1]+x[1]-x[0])
hist,bin_edges = np.histogram(norm_grating,bins=bin_edges)
return hist
x = np.linspace(0,5,512)
p_data = [0.7,1.1,0.8]
pct = grating_hist(x,*p_data)
p_guess = [1,1,1]
p_fit,pcov = curve_fit(grating_hist,x,pct,p0=p_guess)
plot(x,pct,label='Data')
plot(x,grating_hist(x,*p_fit),label='Fit')
legend()
show()
print 'Data Parameters:', p_data
print 'Guess Parameters:', p_guess
print 'Fit Parameters:', p_fit
print 'Covariance:',pcov
and I see this: http://i.stack.imgur.com/GwXzJ.png (I'm new here, so I can't post images)
Data Parameters: [0.7, 1.1, 0.8]
Guess Parameters: [1, 1, 1]
Fit Parameters: [ 0.97600854 0.99458336 1.00366634]
Covariance: [[ 3.50047574e-06 -5.34574971e-07 2.99306123e-07]
[ -5.34574971e-07 9.78688795e-07 -6.94780671e-07]
[ 2.99306123e-07 -6.94780671e-07 7.17068753e-07]]
Whaaa? I'm pretty sure this isn't a local minimum for variations in xmax and x0, and it's a long way from the global minimum best fit. The fit parameters still don't change, even with better guesses. Different choices for curve functions (e.g. the sum of two normal distributions) do produce new parameters for the same data, so I know it's not the data itself. I also tried the same thing with scipy.optimize.leastsq itself just in case, but no dice; the parameters still don't move. If you have any thoughts on this, I'd love to hear them!
The problem you're facing is actually not due to curve_fit (or leastsq). It is due to the landscape of the objective of your optimisation problem. In your case the objective is the sum of residuals' squares, which you are trying to minimise. Now, if you look closely at your objective in a close surrounding of your initial conditions, for example using the code below, which only focuses on the first parameter:
p_ind = 0
eps = 1e-6
n_points = 100
frac_surroundings = np.linspace(p_guess[p_ind] - eps, p_guess[p_ind] + eps, n_points)
obj = []
temp_guess = p_guess.copy()
for p in frac_surroundings:
temp_guess[0] = p
obj.append(((grating_hist(x, *p_data) - grating_hist(x, *temp_guess))**2.0).sum())
py.plot(frac_surroundings, obj)
py.show()
you will notice that the landscape is piecewise constant (you can easily check that the situation is the same for other parameters. The problem with that is that these pieces are of the order of 10^-6, whereas the initial step of the fitting procedure is somewhere around 10^-8, hence the procedure ends quickly concluding that you cannot improve from the given initial condition. You could try to fix it by changing epsfcn parameter in curve_fit, but you would quickly notice that the landscape, on top of being piecewise constant, is also very "rugged". In other words, curve_fit is simply not well suited for such a problem, which is simply difficult for gradient based methods, as it is highly non-convex. Probably, some stochastic optimisation methods could do a better job. That is, however, a different question/problem.
I think it is a local minimum, or the algorith fails for a non trivial reason. It is far easier to fit the data to the input, instead of fitting the statistical description of the data to the statistical description of the input.
Here's a modified version of the code doing so:
z = np.linspace(0,1,20000,endpoint=True)
def grating_hist_indicator(x,frac,xmax,x0):
# model data to be turned into a histogram
dx = x[1]-x[0]
grating = np.cos(frac*np.pi*z)
norm_grating = xmax*(grating-grating[-1])/(1-grating[-1])+x0
return norm_grating
x = np.linspace(0,5,512)
p_data = [0.7,1.1,0.8]
pct = grating_hist(x,*p_data)
pct_indicator = grating_hist_indicator(x,*p_data)
p_guess = [1,1,1]
p_fit,pcov = curve_fit(grating_hist_indicator,x,pct_indicator,p0=p_guess)
plot(x,pct,label='Data')
plot(x,grating_hist(x,*p_fit),label='Fit')
legend()
show()

Categories

Resources