Convert WINBUGS model to PyMC3 - python

I am currently taking a class on Bayesian statistics, we are allowed to use any package to computationally solve the models but all of the examples are provided in WINBUGS. I would prefer to use Python and PyMC3. I don't have much experience using PyMC3 and could use some help on how to convert this simple WINBUGS model into a PyMC3 model.
The example WINBUGS code is below. It is a simple Binomial comparing two options with a different number of observations per sample. The model also tests 5 different priors.
model{
for(i in 1:5){
n1[i] <- Tot1 #100
n2[i] <- Tot2 #3
y1[i] <- Positives1
y2[i] <- Positives2
y1[i] ~ dbin(p1[i],n1[i])
y2[i] ~ dbin(p2[i],n2[i])
diffps[i] <- p1[i]-p2[i] #100seller - 3seller
}
# Uniform priors
p1[1] ~ dbeta(1, 1); p2[1] ~ dbeta(1, 1)
# Jeffreys' priors
p1[2] ~ dbeta(0.5, 0.5); p2[2] ~ dbeta(0.5, 0.5)
# Informative priors centered at about 93% and 97%
p1[3] ~ dbeta(30,2); p2[3] ~ dbeta(2.9,0.1)
# Zellner priors prop to 1/(p * (1-p))
logit(p1[4]) <- x[1]
x[1] ~ dunif(-10000, 10000) # as dflat()
logit(p2[4]) <- x[2]
x[2] ~ dunif(-10000, 10000) # as dflat()
#Logit centered at 3 gives mean probs close to 95%
logit(p1[5]) <- x[3]
x[3] ~ dnorm(3, 1)
logit(p2[5]) <- x[4]
x[4] ~ dnorm(3, 1)
}
DATA
list(Tot1=100, Tot2=3, Positives1=95, Positives2=3)
INITS
list(p1=c(0.9, 0.9, 0.9, NA,NA),
p2=c(0.9, 0.9, 0.9, NA,NA), x=c(0,0,0,0))
list(p1=c(0.5, 0.5, 0.5, NA,NA),
p2=c(0.5, 0.5, 0.5, NA,NA), x=c(0,0,0,0))
list(p1=c(0.3, 0.3, 0.3, NA,NA),
p2=c(0.3, 0.3, 0.3, NA,NA), x=c(0,0,0,0))
In PyMC3 I attempted to implement the first of the 5 priors on a single sample (I am not sure how to do both) with the following code:
import np as np
import pymc3 as pm
sample2 = np.ones(3)
with pm.Model() as ebay_example:
prior = pm.Beta('theta', alpha = 1, beta = 1)
likelihood = pm.Bernoulli('y', p = prior, observed = sample2)
trace = pm.sample(1000, tune = 2000, target_accept = 0.95)
The above model ran, but the results don't not align with the BUGS results, I am not sure if it because I didn't do a burn in or some other larger issue. Any guidance would be great.

we are taking the same class right now.
Following are my codes and the difference of means is close to BUGS results.
na = 100
nb = 3
pos_a = 95
pos_b = 3
with pm.Model() as model:
# priors
p0a = pm.Beta('p0a', 1, 1)
# likelihood
obs_a = pm.Binomial("obs_a", n=na, p=p0a, observed=pos_a)
# sample
trace1_a = pm.sample(1000)
with pm.Model() as model:
# priors
p0b = pm.Beta('p0b', 1, 1)
# likelihood
obs_b = pm.Binomial("obs_b", n=nb, p=p0b, observed=pos_b)
# sample
trace1_b = pm.sample(1000)
pm.summary(trace1_a)["mean"][0] - pm.summary(trace1_b)["mean"][0]
OUT: 0.1409999999999999

Related

Old PyMC3 style grouping traceplot plotted with Arviz

I have an old blogpost where I am training a PyMC3 model. You can find the blogpost here but the gist of the model is shown below.
with pm.Model() as model:
mu_intercept = pm.Normal('mu_intercept', mu=40, sd=5)
mu_slope = pm.HalfNormal('mu_slope', 10, shape=(n_diets,))
mu = mu_intercept + mu_slope[df.diet-1] * df.time
sigma_intercept = pm.HalfNormal('sigma_intercept', sd=2)
sigma_slope = pm.HalfNormal('sigma_slope', sd=2, shape=n_diets)
sigma = sigma_intercept + sigma_slope[df.diet-1] * df.time
weight = pm.Normal('weight', mu=mu, sd=sigma, observed=df.weight)
approx = pm.fit(20000, random_seed=42, method="fullrank_advi")
In this dataset I'm estimating the effect of Diet on the weight of chickens. This is what the traceplot looks like.
Look at how pretty it is! Each diet has its own line! Beautiful!
Arviz Changes
This traceplot was made using the older PyMC3 API. Nowadays this functionality has moved to arviz. So tried redo-ing this work but ... the plot looks very different.
The code that I'm using here is slightly different. I'm using pm.Data now but I doubt that's supposed to cause this difference.
with pm.Model() as mod:
time_in = pm.Data("time_in", df['time'].astype(float))
diet_in = pm.Data("diet_in", dummies)
intercept = pm.Normal("intercept", 0, 2)
time_effect = pm.Normal("time_weight_effect", 0, 2, shape=(4,))
diet = pm.Categorical("diet", p=[0.25, 0.25, 0.25, 0.25], shape=(4,), observed=diet_in)
sigma = pm.HalfNormal("sigma", 2)
sigma_time_effect = pm.HalfNormal("time_sigma_effect", 2, shape=(4,))
weight = pm.Normal("weight",
mu=intercept + time_effect.dot(diet_in.T)*time_in,
sd=sigma + sigma_time_effect.dot(diet_in.T)*time_in,
observed=df.weight)
trace = pm.sample(5000, return_inferencedata=True)
What do I need to do to get the different colors per DIET back in?
There's a parameter for it in the new plot_trace function. This does the trick;
az.plot_trace(trace, compact=True)

Estimating the p value of the difference between two proportions using statsmodels and PyMC3 (MCMC simulation) in Python

In Probabilistic-Programming-and-Bayesian-Methods-for-Hackers, a method is proposed to compute the p value that two proportions are different.
(You can find the jupyter notebook here containing the entire chapter
http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC2.ipynb)
The code is the following:
import pymc3 as pm
figsize(12, 4)
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1700
N_B = 1700
#generate some observations
observations_A = bernoulli.rvs(true_p_A, size=N_A)
observations_B = bernoulli.rvs(true_p_B, size=N_B)
print(np.mean(observations_A))
print(np.mean(observations_B))
0.04058823529411765
0.03411764705882353
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]
p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0))
Probability site A is WORSE than site B: 0.167
Probability site A is BETTER than site B: 0.833
However, if we compute the p value using statsmodels, we get a very different result:
from scipy.stats import norm, chi2_contingency
import statsmodels.api as sm
s1 = int(1700 * 0.04058823529411765)
n1 = 1700
s2 = int(1700 * 0.03411764705882353)
n2 = 1700
p1 = s1/n1
p2 = s2/n2
p = (s1 + s2)/(n1+n2)
z = (p2-p1)/ ((p*(1-p)*((1/n1)+(1/n2)))**0.5)
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2])
print('z1 is {0} and p is {1}'.format(z1, p))
z1 is 0.9948492584166934 and p is 0.03735294117647059
With MCMC, the p value seems to be 0.167, but using statsmodels, we get a p value 0.037.
How can I understand this?
Looks like you printed the wrong value. Try this instead:
print('z1 is {0} and p is {1}'.format(z1, p_value1))
Also, if you want to test the hypothesis p_A > p_B then you should set the alternative parameter in the function call to larger like so:
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2], alternative='larger')
The docs have more examples on how to use it.

Fit negative exponential model in LMFIT

How does lmfit's exponential model work when approximating a (negative) exponential function?
The following tried to follow https://lmfit.github.io/lmfit-py/model.html, but failed to provide the right results:
mod = lmfit.models.ExponentialModel()
pars = mod.guess([1, 0.5], x=[0, 1])
out = mod.fit([1, 0.5], pars, x=[0, 1])
out.eval(x=0) # result is 0.74999998273811308, should be 1
out.eval(x=1) # result is 0.75000001066995159, should be 0.5
You'll need more than two data points to fit the two-parameter exponential model to data. Lmfit Models are designed to do data fitting. Something like this will work:
import numpy as np
import lmfit
xdat = np.linspace(0, 2.0, 11)
ydat = 2.1 * np.exp(-xdat /0.88) + np.random.normal(size=len(xdat), scale=0.06)
mod = lmfit.models.ExponentialModel()
pars = mod.guess(ydat, x=xdat)
out = mod.fit(ydat, pars, x=xdat)
print(out.fit_report())
Instead, you're getting amplitude = 0.75 and decay > 1e6.

PyMC3 Multinomial Model doesn't work with non-integer observe data

I'm trying to use PyMC3 to solve a fairly simple multinomial distribution. It works perfectly if I have the 'noise' value set to 0.0. However when I change it to anything else, for example 0.01, I get an error in the find_MAP() function and it hangs if I don't use find_MAP().
Is there some reason that the multinomial has to be sparse?
import numpy as np
from pymc3 import *
import pymc3 as mc
import pandas as pd
print 'pymc3 version: ' + mc.__version__
sample_size = 10
number_of_experiments = 1
true_probs = [0.2, 0.1, 0.3, 0.4]
k = len(true_probs)
noise = 0.0
y = np.random.multinomial(n=number_of_experiments, pvals=true_probs, size=sample_size)+noise
y_denominator = np.sum(y,axis=1)
y = y/y_denominator[:,None]
with Model() as multinom_test:
probs = Dirichlet('probs', a = np.ones(k), shape = k)
for i in range(sample_size):
data = Multinomial('data_%d' % (i),
n = y[i].sum(),
p = probs,
observed = y[i])
with multinom_test:
start = find_MAP()
trace = sample(5000, Slice())
trace[probs].mean(0)
Error:
ValueError: Optimization error: max, logp or dlogp at max have non-
finite values. Some values may be outside of distribution support.
max: {'probs_stickbreaking_': array([ 0.00000000e+00, -4.47034834e-
08, 0.00000000e+00])} logp: array(-inf) dlogp: array([
0.00000000e+00, 2.98023221e-08, 0.00000000e+00])Check that 1) you
don't have hierarchical parameters, these will lead to points with
infinite density. 2) your distribution logp's are properly specified.
Specific issues:
This works for me
sample_size = 10
number_of_experiments = 100
true_probs = [0.2, 0.1, 0.3, 0.4]
k = len(true_probs)
noise = 0.01
y = np.random.multinomial(n=number_of_experiments, pvals=true_probs, size=sample_size)+noise
with pm.Model() as multinom_test:
a = pm.Dirichlet('a', a=np.ones(k))
for i in range(sample_size):
data_pred = pm.Multinomial('data_pred_%s'% i, n=number_of_experiments, p=a, observed=y[i])
trace = pm.sample(50000, pm.Metropolis())
#trace = pm.sample(1000) # also works with NUTS
pm.traceplot(trace[500:]);

Supplying test values in pymc 3

I am exploring the use of bounded distributions in pymc. I am trying to bound a Gamma prior distribution between two values. The model specification seems to fail due to the absence of test values. How may I pass a testval argument such that I am able to specify these sorts of models?
For completeness I have included the error, as well as a minimal example below. Thank you!
AttributeError: <pymc.quickclass.Gamma object at 0x110a62890> has no default value to use, checked for: ['median', 'mean', 'mode'] pass testval argument or provide one of these.
import pymc as pm
import numpy as np
ndims = 2
nobs = 20
zdata = np.random.normal(loc=0, scale=0.75, size=(ndims, nobs))
BoundedGamma = pm.Bound(pm.Gamma, 0.5, 2)
with pm.Model() as model:
xbound = BoundedGamma('xbound', alpha=1, beta=2)
z = pm.Normal('z', mu=0, tau=xbound, shape=(ndims, 1), observed=zdata)
edit: for reference purposes, here is a simple working model utilizing a bounded gamma prior distribution:
import pymc as pm
import numpy as np
ndims = 2
nobs = 20
zdata = np.random.normal(loc=0, scale=0.75, size=(ndims, nobs))
BoundedGamma = pm.Bound(pm.Gamma, 0.5, 2)
with pm.Model() as model:
xbound = BoundedGamma('xbound', alpha=1, beta=2, testval=2)
z = pm.Normal('z', mu=0, tau=xbound, shape=(ndims, 1), observed=zdata)
with model:
start = pm.find_MAP()
with model:
step = pm.NUTS()
with model:
trace = pm.sample(3000, step, start)
pm.traceplot(trace);
Use that line:
xbound = BoundedGamma('xbound', alpha=1, beta=2, testval=1)

Categories

Resources