Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python? - python

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

Related

Adding constraints to my fitting model using lmfit

I am trying to fit a complex conductivity model (the drude-smith-anderson model) using lmfit.minimize. In that fitting, I want constraints on my parameters c and c1 such that 0<c<1, -1<c1<0 and 0<1+c1-c<1. So, I am using the following code:
#reference: Juluri B.K. "Fitting Complex Metal Dielectric Functions with Differential Evolution Method". http://juluribk.com/?p=1597.
#reference: https://lmfit.github.io/lmfit-py/fitting.html
#import libraries (numdifftools needs to be installed but doesn't need to be imported)
import matplotlib.pyplot as plt
import numpy as np
import lmfit as lmf
import math as mt
#define the complex conductivity model
def model(params,w):
sigma0 = params["sigma0"].value
tau = params["tau"].value
c = params["c"].value
d = params["d"].value
c1 = params["c1"].value
druidanderson = (sigma0/(1-1j*2*mt.pi*w*tau))*(1 + c1/(1-1j*2*mt.pi*w*tau)) - sigma0*c/(1-1j*2*mt.pi*w*d*tau)
return druidanderson
#defining the complex residues (chi squared is sum of squares of residues)
def complex_residuals(params,w,exp_data):
delta = model(params,w)
residual = (abs((delta.real - exp_data.real) / exp_data.real) + abs(
(delta.imag - exp_data.imag) / exp_data.imag))
return residual
# importing data from CSV file
importpath = input("Path of CSV file: ") #Asking the location of where your data file is kept (give input in form of path\name.csv)
frequency = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(0)) #path to be changed to the file from which data is taken
conductivity = np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(1)) + 1j*np.genfromtxt(rf"{importpath}",delimiter=",", usecols=(2)) #path to be changed to the file from which data is taken
frequency = frequency[np.logical_not(np.isnan(frequency))]
conductivity = conductivity[np.logical_not(np.isnan(conductivity))]
w_for_fit = frequency
eps_for_fit = conductivity
#defining the bounds and initial guesses for the fitting parameters
params = lmf.Parameters()
params.add("sigma0", value = float(input("Guess for \u03C3\u2080: ")), min =10 , max = 5000) #bounds have to be changed manually
params.add("tau", value = float(input("Guess for \u03C4: ")), min = 0.0001, max =10) #bounds have to be changed manually
params.add("c1", value = float(input("Guess for c1: ")), min = -1 , max = 0) #bounds have to be changed manually
params.add("constraint", value = float(input("Guess for constraint: ")), min = 0, max=1)
params.add("c", expr="1+c1-constraint", min = 0, max = 1) #bounds have to be changed manually
params.add("d", value = float(input("Guess for \u03C4_1/\u03C4: ")),min = 100, max = 100000) #bounds have to be changed manually
# minimizing the chi square
minimizer_results = lmf.minimize(complex_residuals, params, args=(w_for_fit, eps_for_fit), method = 'differential_evolution', strategy='best1bin',
popsize=50, tol=0.01, mutation=(0, 1), recombination=0.9, seed=None, callback=None, disp=True, polish=True, init='latinhypercube')
lmf.printfuncs.report_fit(minimizer_results, show_correl=False)
As a result for the fit, I get the following output:
sigma0: 3489.38961 (init = 1000)
tau: 1.2456e-04 (init = 0.01)
c1: -0.99816132 (init = -1)
constraint: 0.98138820 (init = 1)
c: 0.00000000 == '1+c1-constraint'
d: 7333.82306 (init = 1000)
These values don't make any sense as 1+c1-c = -0.97954952 which is not 0 and is thus invalid. How to fix this issue?
Your code is not runnable. The use of input() is sort of stunning - please do not do that. Write code that is pleasant to read and separates i/o from logic.
To make a floating point residual from a complex array, use complex_array.view(float)
Guessing any parameter value to be at or very close to its limit (here, c) is a very bad idea, likely to make the fit harder.
More to your question, you defined c as "evaluate 1+c1-constant and then apply the bounds min=0, max=1". That is literally, precisely, and exactly what your
params.add("c", expr="1+c1-constraint", min = 0, max = 1)
means: calculate c as 1+c1-constraint, and then apply the bounds [0, 1]. The code is doing exactly what you told it to do.
Unless you know what you are doing (I suspect maybe not ;)), I would strongly advise doing a fit with the default leastsq method before trying to use differential_evolution. It turns out that differential_evolution is not a very good global fitting method (shgo is generally better, though no "global" solver should be considered as very reliable). But, unless you know that you need such a method, you probably do not.
I would also strongly advise you to plot your data and some models evaluated with what you think are reasonable parameters.

Monte Carlo in Python

Edited to include VBA code for comparison
Also, we know the analytical value, which is 8.021, towards which the Monte-Carlo should converge, which makes the comparison easier.
Excel VBA gives 8.067 based on averaging 5 Monte-Carlo simulations (7.989, 8.187, 8.045, 8.034, 8.075)
Python gives 7.973 based on 5 MCs (7.913, 7.915, 8.203, 7.739, 8.095) and a larger Variance!
The VBA code is not even "that good", using a rather bad way to produce samples from Standard Normal!
I am running a super simple code in Python to price European Call Option via Monte Carlo, and I am surprised at how "bad" the convergence is with 10,000 "simulated paths". Usually, when running a Monte-Carlo for this simple problem in C++ or even VBA, I get better convergence.
I show the code below (the code is taken from Textbook "Python for Finance" and I run in in Visual Studio Code under Python 3.7.7, 64-bit version): I get the following results, as an example: Run 1 = 7.913, Run 2 = 7.915, Run 3 = 8.203, Run 4 = 7.739, Run 5 = 8.095,
Results such as the above, that differ by so much, would be unacceptable. How can the convergence be improved??? (Obviously by running more paths, but as I said: for 10,000 paths, the result should already have converged much better):
#MonteCarlo valuation of European Call Option
import math
import numpy as np
#Parameter Values
S_0 = 100. # initial value
K = 105. # strike
T = 1.0 # time to maturity
r = 0.05 # short rate (constant)
sigma = 0.2 # vol
nr_simulations = 10000
#Valuation Algo:
# Notice the vectorization below, instead of a loop
z = np.random.standard_normal(nr_simulations)
# Notice that the S_T below is a VECTOR!
S_T = S_0 * np.exp((r-0.5*sigma**2)+math.sqrt(T)*sigma*z)
#Call option pay-off at maturity (Vector!)
C_T = np.maximum((S_T-K),0)
# C_0 is a scalar
C_0 = math.exp(-r*T)*np.average(C_T)
print('Value of the European Call is: ', C_0)
I also include VBA code, which produces slightly better results (in my opinion): with the VBA code below, I get 7.989, 8.187, 8.045, 8.034, 8.075.
Option Explicit
Sub monteCarlo()
' variable declaration
' stock initial & final values, option pay-off at maturity
Dim stockInitial, stockFinal, optionFinal As Double
' r = rate, sigma = volatility, strike = strike price
Dim r, sigma, strike As Double
'maturity of the option
Dim maturity As Double
' instatiate variables
stockInitial = 100#
r = 0.05
maturity = 1#
sigma = 0.2
strike = 105#
' normal is Standard Normal
Dim normal As Double
' randomNr is randomly generated nr via "rnd()" function, between 0 & 1
Dim randomNr As Double
' variable for storing the final result value
Dim result As Double
Dim i, j As Long, monteCarlo As Long
monteCarlo = 10000
For j = 1 To 5
result = 0#
For i = 1 To monteCarlo
' get random nr between 0 and 1
randomNr = Rnd()
'max(Rnd(), 0.000000001)
' standard Normal
normal = Application.WorksheetFunction.Norm_S_Inv(randomNr)
stockFinal = stockInitial * Exp((r - (0.5 * (sigma ^ 2))) + (sigma * Sqr(maturity) * normal))
optionFinal = max((stockFinal - strike), 0)
result = result + optionFinal
Next i
result = result / monteCarlo
result = result * Exp(-r * maturity)
Worksheets("sheet1").Cells(j, 1) = result
Next j
MsgBox "Done"
End Sub
Function max(ByVal number1 As Double, ByVal number2 As Double)
If number1 > number2 Then
max = number1
Else
max = number2
End If
End Function
I don't think there is anything wrong with Python or numpy internals, the convergence is definitely should be the same no matter what tool you're using. I ran a few simulations with different sample sizes and different sigma values. No surprise, it turns out the speed of convergence is heavily controlled by the sigma value, see the plot below. Note that x axis is on log-scale! After the bigger oscillations fade away there are more smaller waves before it stabilizes. The easiest to see at sigma=0.5.
I'm definitely not an expert, but I think the most obvious solution is to increase sample size, as you mentioned. It would be nice to see results and code from C++ or VBA, because I don't know how familiar you are with numpy and python functions. Maybe something is not doing what you think it's doing.
Code to generate the plot (let's not talk about efficiency, it's horrible):
import numpy as np
import matplotlib.pyplot as plt
S_0 = 100. # initial value
K = 105. # strike
T = 1.0 # time to maturity
r = 0.05 # short rate (constant)
fig = plt.figure()
ax = fig.add_subplot()
plt.xscale('log')
samplesize = np.geomspace(1000, 20000000, 64)
sigmas = np.arange(0, 0.7, 0.1)
for s in sigmas:
arr = []
for n in samplesize:
n = n.astype(int)
z = np.random.standard_normal(n)
S_T = S_0 * np.exp((r-0.5*s**2)+np.sqrt(T)*s*z)
C_T = np.maximum((S_T-K),0)
C_0 = np.exp(-r*T)*np.average(C_T)
arr.append(C_0)
ax.scatter(samplesize, arr, label=f'sigma={s:.2f}')
plt.tight_layout()
plt.xlabel('Sample size')
plt.ylabel('Value')
plt.grid()
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles[::-1], labels[::-1], loc='upper left')
plt.show()
Addition:
This time you got closer results to the real value using VBA. But sometimes you don't. The effect of randomness is too big here. The truth is averaging out only 5 results from a low sample number simulation is not meaningful. For example averaging out 50 different simulations in Python (with only n=10000, even though you shouldn't do that if you're willing to get the right answer) yields to 8.025167 (± 0.039717 with 95% confidence level), which is very close to the real solution.

Estimating the p value of the difference between two proportions using statsmodels and PyMC3 (MCMC simulation) in Python

In Probabilistic-Programming-and-Bayesian-Methods-for-Hackers, a method is proposed to compute the p value that two proportions are different.
(You can find the jupyter notebook here containing the entire chapter
http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC2.ipynb)
The code is the following:
import pymc3 as pm
figsize(12, 4)
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1700
N_B = 1700
#generate some observations
observations_A = bernoulli.rvs(true_p_A, size=N_A)
observations_B = bernoulli.rvs(true_p_B, size=N_B)
print(np.mean(observations_A))
print(np.mean(observations_B))
0.04058823529411765
0.03411764705882353
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]
p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0))
Probability site A is WORSE than site B: 0.167
Probability site A is BETTER than site B: 0.833
However, if we compute the p value using statsmodels, we get a very different result:
from scipy.stats import norm, chi2_contingency
import statsmodels.api as sm
s1 = int(1700 * 0.04058823529411765)
n1 = 1700
s2 = int(1700 * 0.03411764705882353)
n2 = 1700
p1 = s1/n1
p2 = s2/n2
p = (s1 + s2)/(n1+n2)
z = (p2-p1)/ ((p*(1-p)*((1/n1)+(1/n2)))**0.5)
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2])
print('z1 is {0} and p is {1}'.format(z1, p))
z1 is 0.9948492584166934 and p is 0.03735294117647059
With MCMC, the p value seems to be 0.167, but using statsmodels, we get a p value 0.037.
How can I understand this?
Looks like you printed the wrong value. Try this instead:
print('z1 is {0} and p is {1}'.format(z1, p_value1))
Also, if you want to test the hypothesis p_A > p_B then you should set the alternative parameter in the function call to larger like so:
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2], alternative='larger')
The docs have more examples on how to use it.

Modified BPMF in PyMC3 using `LKJCorr` priors: PositiveDefiniteError using `NUTS`

I previously implemented the original Bayesian Probabilistic Matrix Factorization (BPMF) model in pymc3. See my previous question for reference, data source, and problem setup. Per the answer to that question from #twiecki, I've implemented a variation of the model using LKJCorr priors for the correlation matrices and uniform priors for the standard deviations. In the original model, the covariance matrices are drawn from Wishart distributions, but due to current limitations of pymc3, the Wishart distribution cannot be sampled from properly. This answer to a loosely related question provides a succinct explanation for the choice of LKJCorr priors. The new model is below.
import pymc3 as pm
import numpy as np
import theano.tensor as t
n, m = train.shape
dim = 10 # dimensionality
beta_0 = 1 # scaling factor for lambdas; unclear on its use
alpha = 2 # fixed precision for likelihood function
std = .05 # how much noise to use for model initialization
# We will use separate priors for sigma and correlation matrix.
# In order to convert the upper triangular correlation values to a
# complete correlation matrix, we need to construct an index matrix:
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
logging.info('building the BPMF model')
with pm.Model() as bpmf:
# Specify user feature matrix
sigma_u = pm.Uniform('sigma_u', shape=dim)
corr_triangle_u = pm.LKJCorr(
'corr_u', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_u = corr_triangle_u[tri_index]
corr_matrix_u = t.fill_diagonal(corr_matrix_u, 1)
cov_matrix_u = t.diag(sigma_u).dot(corr_matrix_u.dot(t.diag(sigma_u)))
lambda_u = t.nlinalg.matrix_inverse(cov_matrix_u)
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
U = pm.MvNormal(
'U', mu=mu_u, tau=lambda_u,
shape=(n, dim), testval=np.random.randn(n, dim) * std)
# Specify item feature matrix
sigma_v = pm.Uniform('sigma_v', shape=dim)
corr_triangle_v = pm.LKJCorr(
'corr_v', n=1, p=dim,
testval=np.random.randn(n_elem) * std)
corr_matrix_v = corr_triangle_v[tri_index]
corr_matrix_v = t.fill_diagonal(corr_matrix_v, 1)
cov_matrix_v = t.diag(sigma_v).dot(corr_matrix_v.dot(t.diag(sigma_v)))
lambda_v = t.nlinalg.matrix_inverse(cov_matrix_v)
mu_v = pm.Normal(
'mu_v', mu=0, tau=beta_0 * lambda_v, shape=dim,
testval=np.random.randn(dim) * std)
V = pm.MvNormal(
'V', mu=mu_v, tau=lambda_v,
testval=np.random.randn(m, dim) * std)
# Specify rating likelihood function
R = pm.Normal(
'R', mu=t.dot(U, V.T), tau=alpha * np.ones((n, m)),
observed=train)
# `start` is the start dictionary obtained from running find_MAP for PMF.
# See the previous post for PMF code.
for key in bpmf.test_point:
if key not in start:
start[key] = bpmf.test_point[key]
with bpmf:
step = pm.NUTS(scaling=start)
The goal with this reimplementation was to produce a model that could be estimated using the NUTS sampler. Unfortunately, I'm still getting the same error at the last line:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 ... 1030 1031 1032 1033 1034 ]
I've made all the code for PMF, BPMF, and this modified BPMF available in this gist to make it simple to replicate the error. All you need to do is download the data (also referenced in the gist).
It looks like you are passing the complete precision matrix into the normal distribution:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
I assume you only want to pass the diagonal values:
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * t.diag(lambda_u), shape=dim,
testval=np.random.randn(dim) * std)
Does this change to mu_u and mu_v fix it for you?

Fit two normal distributions (histograms) with MCMC using pymc?

I am trying to fit line profiles as detected with a spectrograph on a CCD. For ease of consideration, I have included a demonstration that, if solved, is very similar to the one I actually want to solve.
I've looked at this:
https://stats.stackexchange.com/questions/46626/fitting-model-for-two-normal-distributions-in-pymc
and various other questions and answers, but they are doing something fundamentally different than what I want to do.
import pymc as mc
import numpy as np
import pylab as pl
def GaussFunc(x, amplitude, centroid, sigma):
return amplitude * np.exp(-0.5 * ((x - centroid) / sigma)**2)
wavelength = np.arange(5000, 5050, 0.02)
# Profile 1
centroid_one = 5025.0
sigma_one = 2.2
height_one = 0.8
profile1 = GaussFunc(wavelength, height_one, centroid_one, sigma_one, )
# Profile 2
centroid_two = 5027.0
sigma_two = 1.2
height_two = 0.5
profile2 = GaussFunc(wavelength, height_two, centroid_two, sigma_two, )
# Measured values
noise = np.random.normal(0.0, 0.02, len(wavelength))
combined = profile1 + profile2 + noise
# If you want to plot what this looks like
pl.plot(wavelength, combined, label="Measured")
pl.plot(wavelength, profile1, color='red', linestyle='dashed', label="1")
pl.plot(wavelength, profile2, color='green', linestyle='dashed', label="2")
pl.title("Feature One and Two")
pl.legend()
My question: Can PyMC (or some variant) give me the mean, amplitude, and sigma for the two components used above?
Please note that the functions that I will actually fit on my real problem are NOT Gaussians -- so please provide the example using a generic function (like GaussFunc in my example), and not a "built-in" pymc.Normal() type function.
Also, I understand model selection is another issue: so with the current noise, 1 component (profile) might be all that is statistically justified. But I'd like to see what the best solution for 1, 2, 3, etc. components would be.
I'm also not wed to the idea of using PyMC -- if scikit-learn, astroML, or some other package seems perfect, please let me know!
EDIT:
I failed a number of ways, but one of the things that I think was on the right track was the following:
sigma_mc_one = mc.Uniform('sig', 0.01, 6.5)
height_mc_one = mc.Uniform('height', 0.1, 2.5)
centroid_mc_one = mc.Uniform('cen', 5015., 5040.)
But I could not construct a mc.model that worked.
Not the most concise PyMC code, but I made that decision to help the reader. This should run, and give (really) accurate results.
I made the decision to use Uniform priors, with liberal ranges, because I really have no idea what we are modelling. But probably one has an idea about the centroid locations, and can use a better priors there.
### Suggested one runs the above code first.
### Unknowns we are interested in
est_centroid_one = mc.Uniform("est_centroid_one", 5000, 5050 )
est_centroid_two = mc.Uniform("est_centroid_two", 5000, 5050 )
est_sigma_one = mc.Uniform( "est_sigma_one", 0, 5 )
est_sigma_two = mc.Uniform( "est_sigma_two", 0, 5 )
est_height_one = mc.Uniform( "est_height_one", 0, 5 )
est_height_two = mc.Uniform( "est_height_two", 0, 5 )
#std deviation of the noise, converted to precision by tau = 1/sigma**2
precision= 1./mc.Uniform("std", 0, 1)**2
#Set up the model's relationships.
#mc.deterministic( trace = False)
def est_profile_1(x = wavelength, centroid = est_centroid_one, sigma = est_sigma_one, height= est_height_one):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False)
def est_profile_2(x = wavelength, centroid = est_centroid_two, sigma = est_sigma_two, height= est_height_two):
return GaussFunc( x, height, centroid, sigma )
#mc.deterministic( trace = False )
def mean( profile_1 = est_profile_1, profile_2 = est_profile_2 ):
return profile_1 + profile_2
observations = mc.Normal("obs", mean, precision, value = combined, observed = True)
model = mc.Model([est_centroid_one,
est_centroid_two,
est_height_one,
est_height_two,
est_sigma_one,
est_sigma_two,
precision])
#always a good idea to MAP it prior to MCMC, so as to start with good initial values
map_ = mc.MAP( model )
map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample( 50000,40000 ) #try running for longer if not happy with convergence.
Important
Keep in mind the algorithm is agnostic to labeling, so the results might show profile1 with all the characteristics from profile2 and vice versa.

Categories

Resources