Porting pyMC2 Bayesian A/B testing example to pyMC3 - python

I am working to learn pyMC 3 and having some trouble. Since there are limited tutorials for pyMC3 I am working from Bayesian Methods for Hackers. I'm trying to port the pyMC 2 code to pyMC 3 in the Bayesian A/B testing example, with no success. From what I can see the model isn't taking into account the observations at all.
I've had to make a few changes from the example, as pyMC 3 is quite different, so what should look like this:
import pymc as pm
# The parameters are the bounds of the Uniform.
p = pm.Uniform('p', lower=0, upper=1)
# set constants
p_true = 0.05 # remember, this is unknown.
N = 1500
# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = pm.rbernoulli(p_true, N)
print occurrences # Remember: Python treats True == 1, and False == 0
print occurrences.sum()
# Occurrences.mean is equal to n/N.
print "What is the observed frequency in Group A? %.4f" % occurrences.mean()
print "Does this equal the true frequency? %s" % (occurrences.mean() == p_true)
# include the observations, which are Bernoulli
obs = pm.Bernoulli("obs", p, value=occurrences, observed=True)
# To be explained in chapter 3
mcmc = pm.MCMC([p, obs])
mcmc.sample(18000, 1000)
figsize(12.5, 4)
plt.title("Posterior distribution of $p_A$, the true effectiveness of site A")
plt.vlines(p_true, 0, 90, linestyle="--", label="true $p_A$ (unknown)")
plt.hist(mcmc.trace("p")[:], bins=25, histtype="stepfilled", normed=True)
plt.legend()
instead looks like:
import pymc as pm
import random
import numpy as np
import matplotlib.pyplot as plt
with pm.Model() as model:
# Prior is uniform: all cases are equally likely
p = pm.Uniform('p', lower=0, upper=1)
# set constants
p_true = 0.05 # remember, this is unknown.
N = 1500
# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = [] # pm.rbernoulli(p_true, N)
for i in xrange(N):
occurrences.append((random.uniform(0.0, 1.0) <= p_true))
occurrences = np.array(occurrences)
obs = pm.Bernoulli('obs', p_true, observed=occurrences)
start = pm.find_MAP()
step = pm.Metropolis()
trace = pm.sample(18000, step, start)
pm.traceplot(trace);
plt.show()
Apologies for the lengthy post but in my adaptation there have been a number of small changes, e.g. manually generating the observations because pm.rbernoulli no longer exists. I'm also not sure if I should be finding the start prior to running the trace. How should I change my implementation to correctly run?

You were indeed close. However, this line:
obs = pm.Bernoulli('obs', p_true, observed=occurrences)
is wrong as you are just setting a constant value for p (p_true == 0.05). Thus, your random variable p defined above to have a uniform prior is not constrained by the likelihood and your plot shows that you are just sampling from the prior. If you replace p_true with p in your code it should work. Here is the fixed version:
import pymc as pm
import random
import numpy as np
import matplotlib.pyplot as plt
with pm.Model() as model:
# Prior is uniform: all cases are equally likely
p = pm.Uniform('p', lower=0, upper=1)
# set constants
p_true = 0.05 # remember, this is unknown.
N = 1500
# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = [] # pm.rbernoulli(p_true, N)
for i in xrange(N):
occurrences.append((random.uniform(0.0, 1.0) <= p_true))
occurrences = np.array(occurrences)
obs = pm.Bernoulli('obs', p, observed=occurrences)
start = pm.find_MAP()
step = pm.Metropolis()
trace = pm.sample(18000, step, start)
pm.traceplot(trace);

This worked for me. I generated the observations before initiating the model.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1500
N_B = 750
obs_A = np.random.binomial(1, true_p_A, size=N_A)
obs_B = np.random.binomial(1, true_p_B, size=N_B)
with pm.Model() as ab_model:
p_A = pm.Uniform('p_A', 0, 1)
p_B = pm.Uniform('p_B', 0, 1)
delta = pm.Deterministic('delta',p_A - p_B)
obs_A = pm.Bernoulli('obs_A', p_A, observed=obs_A)
osb_B = pm.Bernoulli('obs_B', p_B, observed=obs_B)
with ab_model:
trace = pm.sample(2000)
pm.traceplot(trace)

You were very close - you just need to unindent the last two lines, which produce the traceplot. You can think of plotting the traceplot as a diagnostic that should occur after you finish sampling. The following works for me:
import pymc as pm
import random
import numpy as np
import matplotlib.pyplot as plt
with pm.Model() as model:
# Prior is uniform: all cases are equally likely
p = pm.Uniform('p', lower=0, upper=1)
# set constants
p_true = 0.05 # remember, this is unknown.
N = 1500
# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = [] # pm.rbernoulli(p_true, N)
for i in xrange(N):
occurrences.append((random.uniform(0.0, 1.0) <= p_true))
occurrences = np.array(occurrences)
obs = pm.Bernoulli('obs', p_true, observed=occurrences)
start = pm.find_MAP()
step = pm.Metropolis()
trace = pm.sample(18000, step, start)
#Now plot
pm.traceplot(trace)
plt.show()

Related

Constraining log-likelihood using pymc3.Potential?

Problem description
I am new to probabilistic programming and working through PyMC3's Gaussian Mixture Model sample notebook:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
# simulate data from a known mixture distribution
np.random.seed(12345) # set random seed for reproducibility
k = 3
ndata = 500
spread = 5
centers = np.array([-spread, 0, spread])
# simulate data from mixture distribution
v = np.random.randint(0, k, ndata)
data = centers[v] + np.random.randn(ndata)
plt.hist(data);
# setup model
model = pm.Model()
with model:
# cluster sizes
p = pm.Dirichlet("p", a=np.array([1.0, 1.0, 1.0]), shape=k)
# ensure all clusters have some points
p_min_potential = pm.Potential("p_min_potential", tt.switch(tt.min(p) < 0.1, -np.inf, 0))
# cluster centers
means = pm.Normal("means", mu=[0, 0, 0], sigma=15, shape=k)
# break symmetry
order_means_potential = pm.Potential(
"order_means_potential",
tt.switch(means[1] - means[0] < 0, -np.inf, 0)
+ tt.switch(means[2] - means[1] < 0, -np.inf, 0),
)
# measurement error
sd = pm.Uniform("sd", lower=0, upper=20)
# latent cluster of each observation
category = pm.Categorical("category", p=p, shape=ndata)
# likelihood for each observed value
points = pm.Normal("obs", mu=means[category], sigma=sd, observed=data)
# fit model
with model:
step1 = pm.Metropolis(vars=[p, sd, means])
step2 = pm.ElemwiseCategorical(vars=[category], values=[0, 1, 2])
tr = pm.sample(10000, step=[step1, step2], tune=5000)
I am struggling with the following expressions:
# ensure all clusters have some points
p_min_potential = pm.Potential("p_min_potential", tt.switch(tt.min(p) < 0.1, -np.inf, 0))
and
# break symmetry
order_means_potential = pm.Potential(
"order_means_potential",
tt.switch(means[1] - means[0] < 0, -np.inf, 0)
+ tt.switch(means[2] - means[1] < 0, -np.inf, 0),
)
What I researched
From looking at related questions and PyMC3's and Theano's documentation I think I understand that pm.Potential() is a way of setting the log-likelihood of an event in your model during sampling without providing observations to it, and that tt.switch() checks whether a certain condition is met and returns one of two values accordingly.
Thus p_min_potential ensures that all values in p are greater than 0.1 by setting the log-likelihood for an event where one value in p to negative infinite, similarly order_means_potential ensures the values in means differ from on another and their ordering stays the same during sampling.
Questions
Unfortunately neither related questions nor the documentations could answer the following questions:
How are the results of these expressions fed back into the model, as neither p_min_potential nor order_means_potential occur as input to any other expression?
Am I right in how tt.switch() works, thus if the condition tt.min(p) < 0.1 is met -np.inf is returned as that events log-likelihood, and 0 in any other case?
Any help would be greatly appreciated, I would like to understand how this example works to the degree where I'll be able to alter and expand it. Specifically I want to implement a mixture model for two or more beta distributions.

Recognition of a plateau with a slope close to zero

I am writing code to remove plateau outliers from time series data. I proceeded after receiving advice to use np.diff, but there was a problem that it could not be recognized if it was not the same value.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=15):
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus)
plateaus = find_plateaus(test, min_length=5, tolerance = 0.02, smoothing=11)
plateaus = np.ravel(plateaus, order = 'A')
plateaus = plateaus.tolist()
print(plateaus)
test2['T&F'] = np.nan
for i in test2.index:
if i in plateaus:
test2.loc[i,['T&F']] = test2.loc[i,'data']
else :
test2.loc[i,['T&F']] = 0
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(test2.index, test2['data'], color='black', label = 'time_series')
ax.scatter(test2.index,test2['T&F'], color='red', label = 'D910')
plt.legend()
plt.show();
Do you know any libraries or methods that can be used?
I want to recognize the parts marked in the picture below.
enter image description here
Still in progress, but found the answer.
First, make the np array multidimensional.
ex) time_step = 3
.....
Then, using np.std(), find the standard deviation,
After checking, you can set the standard deviation range to recognize the included range.

Checking if Frequentist approach is correct? Bayesian approach using MCMC for AB test. How to calculate Bayes Factors in Python?

I've been trying to get my head around Frequentist and Bayesian approaches for a toy data AB test problem.
The results don't really make sense to me. I am struggling to understand the results, or whether I have computed them (in)correctly (which is probably likely). Furthermore, after much research, I am still somewhat lost as to how to compute Bayes Factors. I've seen packages in R that make this look somewhat easy. Alas, I am not familiar with R and would prefer to be able to solve this problem in Python.
I would greatly appreciate any help and guidance regarding this!
Here is the data:
# imports
import pingouin as pg
import pymc3 as pm
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.stats.api as sms
import math
import matplotlib.pyplot as plt
# A = control -- B = treatment
a_success = 10730
a_failure = 61988
a_total = a_success + a_failure
a_cr = a_success / a_total
b_success = 10966
b_failure = 60738
b_total = b_success + b_failure
b_cr = b_success / b_total
I started by doing some power analysis, to determine the number of required samples with a power of 0.8, alpha of 0.05 and a practical significance of 2%. I'm not sure whether expected conversion rates should be supplied, or the baseline + some proportion. Depending on the effect size, the required number of samples increases significantly.
# determine required sample size
baseline_rate = a_cr
practical_significance = 0.02
alpha = 0.05
power = 0.8
nobs1 = None
# is this how to calculate effect size?
effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance) # 5204
# # or this?
# effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + baseline_rate * practical_significance) # 228583
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size,
power = power,
alpha = alpha,
nobs1 = nobs1,
ratio = 1)
I continued trying to determine if the null hypothesis could be rejected:
# calculate pooled probability
pooled_probability = (a_success + b_success) / (a_total + b_total)
# calculate pooled standard error and margin of error
se_pooled = math.sqrt(pooled_probability * (1 - pooled_probability) * (1 / b_total + 1 / a_total))
z_score = scs.norm.ppf(1 - alpha / 2)
margin_of_error = se_pooled * z_score
# the estimated difference between probability of conversions of both groups
d_hat = (test_b_success / test_b_total) - (test_a_success / test_a_total)
# test if null hypothesis can be rejected
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error
if practical_significance < lower_bound:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
which evaluates to 'do not reject the null ...' despite group B (treatment) showing a 3.65% relative improvement with regards to conversion rate over group A (control) which seems... odd?
I tried a slightly different approach (I guess a slightly different hypothesis?):
successes = [a_success, b_success]
nobs = [a_total, b_total]
z_stat, p_value = sms.proportions_ztest(successes, nobs=nobs)
(lower_a, lower_b), (upper_a, upper_b) = sms.proportion_confint(successes, nobs=nobs, alpha=alpha)
if p_value < alpha:
print("reject null hypothesis -- groups do not have the same conversion rates")
else:
print("do not reject the null hypothesis -- groups have the same conversion rates")
Which evaluates to 'reject null hypothesis ... ' with p-value: 0.004236. This seems highly contradictory, especially since the p-value is < 0.01.
On to Bayes... I created some arrays of success and failures (and only tested on 100 observations) due to how long this thing takes, and ran the following:
# generate lists of 1, 0
obs_a = np.repeat([1, 0], [a_success, a_failure])
obs_v = np.repeat([1, 0], [b_success, b_failure])
for _ in range(10):
np.random.shuffle(observations_A)
np.random.shuffle(observations_B)
with pm.Model() as model:
p_A = pm.Beta("p_A", 1, 1)
p_B = pm.Beta("p_B", 1, 1)
delta = pm.Deterministic("delta", p_A - p_B)
obs_A = pm.Bernoulli("obs_A", p_A, observed = obs_a[:1000])
obs_B = pm.Bernoulli("obs_B", p_B, observed = obs_b[:1000])
step = pm.NUTS()
trace = pm.sample(1000, step = step, chains = 2)
Firstly, I understand that you are supposed to burn some proportion of the trace -- how do you determine an appropriate number of indices to burn?
In trying to evaluate the posterior probabilities, is the following code the correct way to do this?
b_lift = (trace['p_B'].mean() - trace['p_A'].mean()) / trace['p_A'].mean() * 100
b_prob = np.mean(trace["delta"] > 0)
a_lift = (trace['p_A'].mean() - trace['p_B'].mean()) / trace['p_B'].mean() * 100
a_prob = np.mean(trace["delta"] < 0)
# is the Bayes Factor just the ratio of the posterior probabilities for these two models?
BF = (trace['p_B'] / trace['p_A']).mean()
print(f'There is {b_prob} probability B outperforms A by a magnitude of {round(b_lift, 2)}%')
print(f'There is {a_prob} probability A outperforms B by a magnitude of {round(a_lift, 2)}%')
print('BF:', BF)
-- output:
There is 0.666 probability B outperforms A by a magnitude of 1.29%
There is 0.334 probability A outperforms B by a magnitude of -1.28%
BF: 1.013357654428127
I suspect that this is not the correct way to calculate Bayes Factors. How can the Bayes Factor be calculated?
I really hope you can help me understand all of the above... I realize it's an exceptionally long post. But I've tried every resource I can find and am still stuck!
Kind regards.

Estimating the p value of the difference between two proportions using statsmodels and PyMC3 (MCMC simulation) in Python

In Probabilistic-Programming-and-Bayesian-Methods-for-Hackers, a method is proposed to compute the p value that two proportions are different.
(You can find the jupyter notebook here containing the entire chapter
http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC2.ipynb)
The code is the following:
import pymc3 as pm
figsize(12, 4)
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1700
N_B = 1700
#generate some observations
observations_A = bernoulli.rvs(true_p_A, size=N_A)
observations_B = bernoulli.rvs(true_p_B, size=N_B)
print(np.mean(observations_A))
print(np.mean(observations_B))
0.04058823529411765
0.03411764705882353
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]
p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0))
Probability site A is WORSE than site B: 0.167
Probability site A is BETTER than site B: 0.833
However, if we compute the p value using statsmodels, we get a very different result:
from scipy.stats import norm, chi2_contingency
import statsmodels.api as sm
s1 = int(1700 * 0.04058823529411765)
n1 = 1700
s2 = int(1700 * 0.03411764705882353)
n2 = 1700
p1 = s1/n1
p2 = s2/n2
p = (s1 + s2)/(n1+n2)
z = (p2-p1)/ ((p*(1-p)*((1/n1)+(1/n2)))**0.5)
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2])
print('z1 is {0} and p is {1}'.format(z1, p))
z1 is 0.9948492584166934 and p is 0.03735294117647059
With MCMC, the p value seems to be 0.167, but using statsmodels, we get a p value 0.037.
How can I understand this?
Looks like you printed the wrong value. Try this instead:
print('z1 is {0} and p is {1}'.format(z1, p_value1))
Also, if you want to test the hypothesis p_A > p_B then you should set the alternative parameter in the function call to larger like so:
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2], alternative='larger')
The docs have more examples on how to use it.

MCMC Sampling a Maxwellian Curve Using Python's emcee

I am trying to introduce myself to MCMC sampling with emcee. I want to simply take a sample from a Maxwell Boltzmann distribution using a set of example code on github, https://github.com/dfm/emcee/blob/master/examples/quickstart.py.
The example code is really excellent, but when I change the distribution from a Gaussian to a Maxwellian, I receive the error, TypeError: lnprob() takes exactly 2 arguments (3 given)
However it is not called anywhere where it is not given the appropriate parameters? In need of some guidance as to how to define a Maxwellian Curve and have it fit into this example code.
Here is what I have;
from __future__ import print_function
import numpy as np
import emcee
try:
xrange
except NameError:
xrange = range
def lnprob(x, a, icov):
pi = np.pi
return np.sqrt(2/pi)*x**2*np.exp(-x**2/(2.*a**2))/a**3
ndim = 2
means = np.random.rand(ndim)
cov = 0.5-np.random.rand(ndim**2).reshape((ndim, ndim))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov,cov)
icov = np.linalg.inv(cov)
nwalkers = 50
p0 = [np.random.rand(ndim) for i in xrange(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[means, icov])
pos, prob, state = sampler.run_mcmc(p0, 5000)
sampler.reset()
sampler.run_mcmc(pos, 100000, rstate0=state)
Thanks
I think there are a couple of problems that I see. The main one is that emcee wants you to give it the natural logarithm of the probability distribution function that you want to sample. So, rather than having:
def lnprob(x, a, icov):
pi = np.pi
return np.sqrt(2/pi)*x**2*np.exp(-x**2/(2.*a**2))/a**3
you would instead want, e.g.
def lnprob(x, a):
pi = np.pi
if x < 0:
return -np.inf
else:
return 0.5*np.log(2./pi) + 2.*np.log(x) - (x**2/(2.*a**2)) - 3.*np.log(a)
where the if...else... statement is to explicitly say that negative values of x have zero probability (or -infinity in log-space).
You also shouldn't have to calculate icov and pass it to lnprob as that's only needed for the Gaussian case in the example you link to.
When you call:
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[means, icov])
the args value should just be any additional arguments that your lnprob function requires, so in your case this would be the value of a that you want to set your Maxwell-Boltxmann distribution up with. This should be a single value rather than the two randomly initialised values you have set when creating mean.
Overall, the following should work for you:
from __future__ import print_function
import emcee
import numpy as np
from numpy import pi as pi
# define the natural log of the Maxwell-Boltzmann distribution
def lnprob(x, a):
if x < 0:
return -np.inf
else:
return 0.5*np.log(2./pi) + 2.*np.log(x) - (x**2/(2.*a**2)) - 3.*np.log(a)
# choose a value of 'a' for the distributions
a = 5. # I'm choosing 5!
# choose the number of walkers
nwalkers = 50
# set some initial points at which to calculate the lnprob
p0 = [np.random.rand(1) for i in xrange(nwalkers)]
# initialise the sampler
sampler = emcee.EnsembleSampler(nwalkers, 1, lnprob, args=[a])
# Run 5000 steps as a burn-in.
pos, prob, state = sampler.run_mcmc(p0, 5000)
# Reset the chain to remove the burn-in samples.
sampler.reset()
# Starting from the final position in the burn-in chain, sample for 100000 steps.
sampler.run_mcmc(pos, 100000, rstate0=state)
# lets check the samples look right
mbmean = 2.*a*np.sqrt(2./pi) # mean of Maxwell-Boltzmann distribution
print("Sample mean = {}, analytical mean = {}".format(np.mean(sampler.flatchain[:,0]), mbmean))
mbstd = np.sqrt(a**2*(3*np.pi-8.)/np.pi) # std. dev. of M-B distribution
print("Sample standard deviation = {}, analytical = {}".format(np.std(sampler.flatchain[:,0]), mbstd))

Categories

Resources