Adding constraints to my fitting model using lmfit - python

I am trying to fit a complex conductivity model (the drude-smith-anderson model) using lmfit.minimize. In that fitting, I want constraints on my parameters c and c1 such that 0<c<1, -1<c1<0 and 0<1+c1-c<1. So, I am using the following code:
#reference: Juluri B.K. "Fitting Complex Metal Dielectric Functions with Differential Evolution Method". http://juluribk.com/?p=1597.
#reference: https://lmfit.github.io/lmfit-py/fitting.html
#import libraries (numdifftools needs to be installed but doesn't need to be imported)
import matplotlib.pyplot as plt
import numpy as np
import lmfit as lmf
import math as mt
#define the complex conductivity model
def model(params,w):
sigma0 = params["sigma0"].value
tau = params["tau"].value
c = params["c"].value
d = params["d"].value
c1 = params["c1"].value
druidanderson = (sigma0/(1-1j*2*mt.pi*w*tau))*(1 + c1/(1-1j*2*mt.pi*w*tau)) - sigma0*c/(1-1j*2*mt.pi*w*d*tau)
return druidanderson
#defining the complex residues (chi squared is sum of squares of residues)
def complex_residuals(params,w,exp_data):
delta = model(params,w)
residual = (abs((delta.real - exp_data.real) / exp_data.real) + abs(
(delta.imag - exp_data.imag) / exp_data.imag))
return residual
# importing data from CSV file
importpath = input("Path of CSV file: ") #Asking the location of where your data file is kept (give input in form of path\name.csv)
frequency = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(0)) #path to be changed to the file from which data is taken
conductivity = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(1)) + 1j*np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(2)) #path to be changed to the file from which data is taken
frequency = frequency[np.logical_not(np.isnan(frequency))]
conductivity = conductivity[np.logical_not(np.isnan(conductivity))]
w_for_fit = frequency
eps_for_fit = conductivity
#defining the bounds and initial guesses for the fitting parameters
params = lmf.Parameters()
params.add("sigma0", value = float(input("Guess for \u03C3\u2080: ")), min =10 , max = 5000) #bounds have to be changed manually
params.add("tau", value = float(input("Guess for \u03C4: ")), min = 0.0001, max =10) #bounds have to be changed manually
params.add("c1", value = float(input("Guess for c1: ")), min = -1 , max = 0) #bounds have to be changed manually
params.add("constraint", value = float(input("Guess for constraint: ")), min = 0, max=1)
params.add("c", expr="1+c1-constraint", min = 0, max = 1) #bounds have to be changed manually
params.add("d", value = float(input("Guess for \u03C4_1/\u03C4: ")),min = 100, max = 100000) #bounds have to be changed manually
# minimizing the chi square
minimizer_results = lmf.minimize(complex_residuals, params, args=(w_for_fit, eps_for_fit), method = 'differential_evolution', strategy='best1bin',
popsize=50, tol=0.01, mutation=(0, 1), recombination=0.9, seed=None, callback=None, disp=True, polish=True, init='latinhypercube')
lmf.printfuncs.report_fit(minimizer_results, show_correl=False)
As a result for the fit, I get the following output:
sigma0: 3489.38961 (init = 1000)
tau: 1.2456e-04 (init = 0.01)
c1: -0.99816132 (init = -1)
constraint: 0.98138820 (init = 1)
c: 0.00000000 == '1+c1-constraint'
d: 7333.82306 (init = 1000)
These values don't make any sense as 1+c1-c = -0.97954952 which is not 0 and is thus invalid. How to fix this issue?

Your code is not runnable. The use of input() is sort of stunning - please do not do that. Write code that is pleasant to read and separates i/o from logic.
To make a floating point residual from a complex array, use complex_array.view(float)
Guessing any parameter value to be at or very close to its limit (here, c) is a very bad idea, likely to make the fit harder.
More to your question, you defined c as "evaluate 1+c1-constant and then apply the bounds min=0, max=1". That is literally, precisely, and exactly what your
params.add("c", expr="1+c1-constraint", min = 0, max = 1)
means: calculate c as 1+c1-constraint, and then apply the bounds [0, 1]. The code is doing exactly what you told it to do.
Unless you know what you are doing (I suspect maybe not ;)), I would strongly advise doing a fit with the default leastsq method before trying to use differential_evolution. It turns out that differential_evolution is not a very good global fitting method (shgo is generally better, though no "global" solver should be considered as very reliable). But, unless you know that you need such a method, you probably do not.
I would also strongly advise you to plot your data and some models evaluated with what you think are reasonable parameters.

Related

lmfit - SineModel+ConstantModel appears inaccurate fit

I'm trying to fit a simple sine function to some experimental data using lmfit and I find that the SineModel with a constant model offset returns, what looks like an inaccurate fit to the data (to me). I suppose it may be helpful to highlight that I am most interested in the frequency of the peaks (and I appreciate that I can simply use a scipy.find_peaks() but would prefer to show a fit to the data).
I use the function below for lmfit model:
def Sine(self, x_axis, y_axis):
sine = SineModel()
const = ConstantModel()
x_fit = np.linspace(min(x_axis), max(x_axis), x_axis.size)
guess_sine = sine.guess(y_axis, x=x_fit)
pars = sine.guess(y_axis, x=x_fit)
sine_offset = SineModel() + ConstantModel()
pars.add('c', value=1, vary=True)
result = sine_offset.fit(y_axis, pars, x=x_fit)
return result
Sine function output (graph and report results) are provided here:
SineModel+ConstModel
I then tried to define my own function, defining my own parameters and evaluating in the same lmfit method, providing sensible "guess" initial values etc.
def Sine_User2(self, x_axis, y_axis):
def sine_func(x, amplitude, freq, shift, c):
return amplitude * np.sin(freq * x + shift) + c
sinemodel = Model(sine_func)
# Take a FFT of the data to provide a guess starting location for the curve fitting
x = np.array(x_axis)
y = np.array(y_axis)
ff = np.fft.fftfreq(len(x), (x[1] - x[0])) # assume uniform spacing
Fyy = abs(np.fft.fft(y))
guess_freq = abs(ff[np.argmax(Fyy[1:]) + 1]) * 2. * np.pi
guess_amp = np.std(y) * 2.**0.5
guess_offset = np.mean(y)
x_fit = np.linspace(min(x_axis), max(x_axis), x_axis.size)
params = sinemodel.make_params(amplitude = guess_amp, freq = guess_freq, shift = 0, c = guess_offset )
result = sinemodel.fit(y_axis, params, x = x_fit)
return result
The output of the user defined model appears to provide a much closer fit to the data, however, the report does not provide uncertainties citing a warning that the "Uncertainties could not be estimated":
SineUser2 function outputs (graph and report results) are provided here: User Defined Model
I then tried to include min/max values to the parameters by replacing the "sinmodel.make_params" line with:
params = Parameters()
params.add('amplitude', value=guess_amp, min = 0)
params.add('freq', value=guess_freq, min=0)
params.add('shift', value=0, min=-2*np.pi, max=2*np.pi)
params.add('c', value=guess_offset)
But the results resort back to the SineModel+ConstModel results seen in the first linked graph/report results. Therefore it must be something to do with the way I'm setting initial values.
The fit using the "SineUser2" function appears to be better. Is there a way to improve the fit for "Sine" function in the first block of code.
Why are the uncertainties not calculated in the second function "Sine_User2"?
Data (.csv):
Wavelength (nm),Power (dBm),,,,,
1549.9,-13.76008731,,,,,
1549.905,-13.69423162,,,,,
1549.91,-12.59004339,,,,,
1549.915,-11.31061848,,,,,
1549.92,-10.58731809,,,,,
1549.925,-10.19024329,,,,,
1549.93,-10.07301418,,,,,
1549.935,-10.19513172,,,,,
1549.94,-10.45582159,,,,,
1549.945,-11.15984161,,,,,
1549.95,-12.15876596,,,,,
1549.955,-13.44674933,,,,,
1549.96,-13.56388277,,,,,
1549.965,-12.2513065,,,,,
1549.97,-11.08699015,,,,,
1549.975,-10.43829185,,,,,
1549.98,-10.12861158,,,,,
1549.985,-10.0962929,,,,,
1549.99,-10.1852173,,,,,
1549.995,-10.55438183,,,,,
1550,-11.19555345,,,,,
1550.005,-12.28715299,,,,,
1550.01,-13.5153863,,,,,
1550.015,-13.47019261,,,,,
1550.02,-12.12394732,,,,,
1550.025,-11.01946751,,,,,
1550.03,-10.42138778,,,,,
1550.035,-10.14438079,,,,,
1550.04,-10.05681218,,,,,
1550.045,-10.17148605,,,,,
1550.05,-10.56046759,,,,,
1550.055,-11.11621478,,,,,
1550.06,-12.19930263,,,,,
1550.065,-13.48428349,,,,,
1550.07,-13.43424913,,,,,
1550.075,-12.08019952,,,,,
1550.08,-11.08731704,,,,,
1550.085,-10.45730899,,,,,
1550.09,-10.11278169,,,,,
1550.095,-10.00651194,,,,,
,,,,,,

How can I get sco.minimize to give me a solution that isn't the initial guesses?

I have a piece of code that worked well when I optimized advertising budget with 2 variables (channels) but when I added aditional channels, it stopped optimizing with no error messages.
import numpy as np
import scipy.optimize as sco
# setup variables
media_budget = 100000 # total media budget
media_labels = ['launchvideoviews', 'conversion', 'traffic', 'videoviews', 'reach'] # channel names
media_coefs = [0.3524764781, 5.606903166, -0.1761937775, 5.678596017, 10.50445914] #
# model coefficients
media_drs = [-1.15, 2.09, 6.7, -0.201, 1.21] # diminishing returns
const = -243.1018144
# the function for our model
def model_function(x, media_coefs, media_drs, const):
# transform variables and multiply them by coefficients to get contributions
channel_1_contrib = media_coefs[0] * x[0]**media_drs[0]
channel_2_contrib = media_coefs[1] * x[1]**media_drs[1]
channel_3_contrib = media_coefs[2] * x[2]**media_drs[2]
channel_4_contrib = media_coefs[3] * x[3]**media_drs[3]
channel_5_contrib = media_coefs[4] * x[4]**media_drs[4]
# sum contributions and add constant
y = channel_1_contrib + channel_2_contrib + channel_3_contrib + channel_4_contrib + channel_5_contrib + const
# return negative conversions for the minimize function to work
return -y
# set up guesses, constraints and bounds
num_media_vars = len(media_labels)
guesses = num_media_vars*[media_budget/num_media_vars,] # starting guesses: divide budget evenly
args = (media_coefs, media_drs, const) # pass non-optimized values into model_function
con_1 = {'type': 'eq', 'fun': lambda x: np.sum(x) - media_budget} # so we can't go over budget
constraints = (con_1)
bound = (0, media_budget) # spend for a channel can't be negative or higher than budget
bounds = tuple(bound for x in range(5))
# run the SciPy Optimizer
solution = sco.minimize(model_function, x0=guesses, args=args, method='SLSQP', constraints=constraints, bounds=bounds)
# print out the solution
print(f"Spend: ${round(float(media_budget),2)}\n")
print(f"Optimized CPA: ${round(media_budget/(-1 * solution.fun),2)}")
print("Allocation:")
for i in range(len(media_labels)):
print(f"-{media_labels[i]}: ${round(solution.x[i],2)} ({round(solution.x[i]/media_budget*100,2)}%)")
And the result is
Spend: $100000.0
Optimized CPA: $-0.0
Allocation:
-launchvideoviews: $20000.0 (20.0%)
-conversion: $20000.0 (20.0%)
-traffic: $20000.0 (20.0%)
-videoviews: $20000.0 (20.0%)
-reach: $20000.0 (20.0%)
Which is the same as the initial guesses argument.
Thank you very much!
Update: Following #joni comment, I passed the gradient function explicitly, but still no result.
I don't know how to change the constrains to test #chthonicdaemon
comment yet.
import numpy as np
import scipy.optimize as sco
# setup variables
media_budget = 100000 # total media budget
media_labels = ['launchvideoviews', 'conversion', 'traffic', 'videoviews', 'reach'] # channel names
media_coefs = [0.3524764781, 5.606903166, -0.1761937775, 5.678596017, 10.50445914] #
# model coefficients
media_drs = [-1.15, 2.09, 6.7, -0.201, 1.21] # diminishing returns
const = -243.1018144
# the function for our model
def model_function(x, media_coefs, media_drs, const):
# transform variables and multiply them by coefficients to get contributions
channel_1_contrib = media_coefs[0] * x[0]**media_drs[0]
channel_2_contrib = media_coefs[1] * x[1]**media_drs[1]
channel_3_contrib = media_coefs[2] * x[2]**media_drs[2]
channel_4_contrib = media_coefs[3] * x[3]**media_drs[3]
channel_5_contrib = media_coefs[4] * x[4]**media_drs[4]
# sum contributions and add constant (objetive function)
y = channel_1_contrib + channel_2_contrib + channel_3_contrib + channel_4_contrib + channel_5_contrib + const
# return negative conversions for the minimize function to work
return -y
# partial derivative of the objective function
def fun_der(x, media_coefs, media_drs, const):
d_chan1 = 1
d_chan2 = 1
d_chan3 = 1
d_chan4 = 1
d_chan5 = 1
return np.array([d_chan1, d_chan2, d_chan3, d_chan4, d_chan5])
# set up guesses, constraints and bounds
num_media_vars = len(media_labels)
guesses = num_media_vars*[media_budget/num_media_vars,] # starting guesses: divide budget evenly
args = (media_coefs, media_drs, const) # pass non-optimized values into model_function
con_1 = {'type': 'eq', 'fun': lambda x: np.sum(x) - media_budget} # so we can't go over budget
constraints = (con_1)
bound = (0, media_budget) # spend for a channel can't be negative or higher than budget
bounds = tuple(bound for x in range(5))
# run the SciPy Optimizer
solution = sco.minimize(model_function, x0=guesses, args=args, method='SLSQP', constraints=constraints, bounds=bounds, jac=fun_der)
# print out the solution
print(f"Spend: ${round(float(media_budget),2)}\n")
print(f"Optimized CPA: ${round(media_budget/(-1 * solution.fun),2)}")
print("Allocation:")
for i in range(len(media_labels)):
print(f"-{media_labels[i]}: ${round(solution.x[i],2)} ({round(solution.x[i]/media_budget*100,2)}%)")
The reason you are not able to solve this exact problem turns out to be all about the specific coefficients you have. For the problem as it is specified, the optimum appears to be near allocations where some spends are zero. However, at spends near zero, due to the negative coefficients in media_drs, the objective function rapidly becomes infinite. I believe this is what is causing the issues you are experiencing. I can get a solution with success = True by manipulating the 6.7 to be 0.7 in the coefficients and setting lower bound that is larger than 0 to stop the objective function from exploding. So this isn't so much of a programming issue as a problem formulation issue.
I cannot imagine it would be true that you would see more payoff when you reduce the budget on a particular item, so all the negative powers in media_dirs seem off to me.
I will also post here some improvements I made while debugging this issue. Notice that I'm using numpy arrays more to make some of the functions easier to read. Also notice how I have calculated a correct jacobian:
import numpy as np
import scipy.optimize as sco
# setup variables
media_budget = 100000 # total media budget
media_labels = ['launchvideoviews', 'conversion', 'traffic', 'videoviews', 'reach'] # channel names
media_coefs = np.array([0.3524764781, 5.606903166, -0.1761937775, 5.678596017, 10.50445914]) #
# model coefficients
media_drs = np.array([-1.15, 2.09, 1.7, -0.201, 1.21]) # diminishing returns
const = -243.1018144
# the function for our model
def model_function(x, media_coefs, media_drs, const):
# transform variables and multiply them by coefficients to get contributions
channel_contrib = media_coefs * x**media_drs
# sum contributions and add constant
y = channel_contrib.sum() + const
# return negative conversions for the minimize function to work
return -y
def model_function_jac(x, media_coefs, media_drs, const):
dy_dx = media_coefs * media_drs * x**(media_drs-1)
return -dy_dx
# set up guesses, constraints and bounds
num_media_vars = len(media_labels)
guesses = num_media_vars*[media_budget/num_media_vars,] # starting guesses: divide budget evenly
args = (media_coefs, media_drs, const) # pass non-optimized values into model_function
con_1 = {'type': 'ineq', 'fun': lambda x: media_budget - sum(x)} # so we can't go over budget
constraints = (con_1,)
bound = (10, media_budget) # spend for a channel can't be negative or higher than budget
bounds = tuple(bound for x in range(5))
# run the SciPy Optimizer
solution = sco.minimize(
model_function, x0=guesses, args=args,
method='SLSQP',
jac=model_function_jac,
constraints=constraints,
bounds=bounds
)
# print out the solution
print(solution)
print(f"Spend: ${round(float(media_budget),2)}\n")
print(f"Optimized CPA: ${round(media_budget/(-1 * solution.fun),2)}")
print("Allocation:")
for i in range(len(media_labels)):
print(f"-{media_labels[i]}: ${round(solution.x[i],2)} ({round(solution.x[i]/media_budget*100,2)}%)")
This solution at least "works" in the sense that it reports a successful solve and returns an answer different from the initial guess.

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

MCMC Sampling a Maxwellian Curve Using Python's emcee

I am trying to introduce myself to MCMC sampling with emcee. I want to simply take a sample from a Maxwell Boltzmann distribution using a set of example code on github, https://github.com/dfm/emcee/blob/master/examples/quickstart.py.
The example code is really excellent, but when I change the distribution from a Gaussian to a Maxwellian, I receive the error, TypeError: lnprob() takes exactly 2 arguments (3 given)
However it is not called anywhere where it is not given the appropriate parameters? In need of some guidance as to how to define a Maxwellian Curve and have it fit into this example code.
Here is what I have;
from __future__ import print_function
import numpy as np
import emcee
try:
xrange
except NameError:
xrange = range
def lnprob(x, a, icov):
pi = np.pi
return np.sqrt(2/pi)*x**2*np.exp(-x**2/(2.*a**2))/a**3
ndim = 2
means = np.random.rand(ndim)
cov = 0.5-np.random.rand(ndim**2).reshape((ndim, ndim))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov,cov)
icov = np.linalg.inv(cov)
nwalkers = 50
p0 = [np.random.rand(ndim) for i in xrange(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[means, icov])
pos, prob, state = sampler.run_mcmc(p0, 5000)
sampler.reset()
sampler.run_mcmc(pos, 100000, rstate0=state)
Thanks
I think there are a couple of problems that I see. The main one is that emcee wants you to give it the natural logarithm of the probability distribution function that you want to sample. So, rather than having:
def lnprob(x, a, icov):
pi = np.pi
return np.sqrt(2/pi)*x**2*np.exp(-x**2/(2.*a**2))/a**3
you would instead want, e.g.
def lnprob(x, a):
pi = np.pi
if x < 0:
return -np.inf
else:
return 0.5*np.log(2./pi) + 2.*np.log(x) - (x**2/(2.*a**2)) - 3.*np.log(a)
where the if...else... statement is to explicitly say that negative values of x have zero probability (or -infinity in log-space).
You also shouldn't have to calculate icov and pass it to lnprob as that's only needed for the Gaussian case in the example you link to.
When you call:
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[means, icov])
the args value should just be any additional arguments that your lnprob function requires, so in your case this would be the value of a that you want to set your Maxwell-Boltxmann distribution up with. This should be a single value rather than the two randomly initialised values you have set when creating mean.
Overall, the following should work for you:
from __future__ import print_function
import emcee
import numpy as np
from numpy import pi as pi
# define the natural log of the Maxwell-Boltzmann distribution
def lnprob(x, a):
if x < 0:
return -np.inf
else:
return 0.5*np.log(2./pi) + 2.*np.log(x) - (x**2/(2.*a**2)) - 3.*np.log(a)
# choose a value of 'a' for the distributions
a = 5. # I'm choosing 5!
# choose the number of walkers
nwalkers = 50
# set some initial points at which to calculate the lnprob
p0 = [np.random.rand(1) for i in xrange(nwalkers)]
# initialise the sampler
sampler = emcee.EnsembleSampler(nwalkers, 1, lnprob, args=[a])
# Run 5000 steps as a burn-in.
pos, prob, state = sampler.run_mcmc(p0, 5000)
# Reset the chain to remove the burn-in samples.
sampler.reset()
# Starting from the final position in the burn-in chain, sample for 100000 steps.
sampler.run_mcmc(pos, 100000, rstate0=state)
# lets check the samples look right
mbmean = 2.*a*np.sqrt(2./pi) # mean of Maxwell-Boltzmann distribution
print("Sample mean = {}, analytical mean = {}".format(np.mean(sampler.flatchain[:,0]), mbmean))
mbstd = np.sqrt(a**2*(3*np.pi-8.)/np.pi) # std. dev. of M-B distribution
print("Sample standard deviation = {}, analytical = {}".format(np.std(sampler.flatchain[:,0]), mbstd))

Fit two normal distributions (histograms) with MCMC using pymc?

I am trying to fit line profiles as detected with a spectrograph on a CCD. For ease of consideration, I have included a demonstration that, if solved, is very similar to the one I actually want to solve.
I've looked at this:
https://stats.stackexchange.com/questions/46626/fitting-model-for-two-normal-distributions-in-pymc
and various other questions and answers, but they are doing something fundamentally different than what I want to do.
import pymc as mc
import numpy as np
import pylab as pl
def GaussFunc(x, amplitude, centroid, sigma):
return amplitude * np.exp(-0.5 * ((x - centroid) / sigma)**2)
wavelength = np.arange(5000, 5050, 0.02)
# Profile 1
centroid_one = 5025.0
sigma_one = 2.2
height_one = 0.8
profile1 = GaussFunc(wavelength, height_one, centroid_one, sigma_one, )
# Profile 2
centroid_two = 5027.0
sigma_two = 1.2
height_two = 0.5
profile2 = GaussFunc(wavelength, height_two, centroid_two, sigma_two, )
# Measured values
noise = np.random.normal(0.0, 0.02, len(wavelength))
combined = profile1 + profile2 + noise
# If you want to plot what this looks like
pl.plot(wavelength, combined, label="Measured")
pl.plot(wavelength, profile1, color='red', linestyle='dashed', label="1")
pl.plot(wavelength, profile2, color='green', linestyle='dashed', label="2")
pl.title("Feature One and Two")
pl.legend()
My question: Can PyMC (or some variant) give me the mean, amplitude, and sigma for the two components used above?
Please note that the functions that I will actually fit on my real problem are NOT Gaussians -- so please provide the example using a generic function (like GaussFunc in my example), and not a "built-in" pymc.Normal() type function.
Also, I understand model selection is another issue: so with the current noise, 1 component (profile) might be all that is statistically justified. But I'd like to see what the best solution for 1, 2, 3, etc. components would be.
I'm also not wed to the idea of using PyMC -- if scikit-learn, astroML, or some other package seems perfect, please let me know!
EDIT:
I failed a number of ways, but one of the things that I think was on the right track was the following:
sigma_mc_one = mc.Uniform('sig', 0.01, 6.5)
height_mc_one = mc.Uniform('height', 0.1, 2.5)
centroid_mc_one = mc.Uniform('cen', 5015., 5040.)
But I could not construct a mc.model that worked.
Not the most concise PyMC code, but I made that decision to help the reader. This should run, and give (really) accurate results.
I made the decision to use Uniform priors, with liberal ranges, because I really have no idea what we are modelling. But probably one has an idea about the centroid locations, and can use a better priors there.
### Suggested one runs the above code first.
### Unknowns we are interested in
est_centroid_one = mc.Uniform("est_centroid_one", 5000, 5050 )
est_centroid_two = mc.Uniform("est_centroid_two", 5000, 5050 )
est_sigma_one = mc.Uniform( "est_sigma_one", 0, 5 )
est_sigma_two = mc.Uniform( "est_sigma_two", 0, 5 )
est_height_one = mc.Uniform( "est_height_one", 0, 5 )
est_height_two = mc.Uniform( "est_height_two", 0, 5 )
#std deviation of the noise, converted to precision by tau = 1/sigma**2
precision= 1./mc.Uniform("std", 0, 1)**2
#Set up the model's relationships.
#mc.deterministic( trace = False)
def est_profile_1(x = wavelength, centroid = est_centroid_one, sigma = est_sigma_one, height= est_height_one):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False)
def est_profile_2(x = wavelength, centroid = est_centroid_two, sigma = est_sigma_two, height= est_height_two):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False )
def mean( profile_1 = est_profile_1, profile_2 = est_profile_2 ):
return profile_1 + profile_2
observations = mc.Normal("obs", mean, precision, value = combined, observed = True)
model = mc.Model([est_centroid_one,
est_centroid_two,
est_height_one,
est_height_two,
est_sigma_one,
est_sigma_two,
precision])
#always a good idea to MAP it prior to MCMC, so as to start with good initial values
map_ = mc.MAP( model )
map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 50000,40000 ) #try running for longer if not happy with convergence.
Important
Keep in mind the algorithm is agnostic to labeling, so the results might show profile1 with all the characteristics from profile2 and vice versa.

Categories

Resources