Difficulties on pymc3 vs. pymc2 when discrete variables are involved - python

I'm updating some calculations where I used pymc2 to pymc3 and I'm having some problems with samplers behavior when I have some discrete random variables on my model. As an example, consider the following model using pymc2:
import pymc as pm
N = 100
data = 10
p = pm.Beta('p', alpha=1.0, beta=1.0)
q = pm.Beta('q', alpha=1.0, beta=1.0)
A = pm.Binomial('A', N, p)
X = pm.Binomial('x', A, q, observed=True, value=data)
It's not really representative of anything, it's just a model where one of the unobserved variables is discrete. When I sample this model with pymc2 I get the following results:
mcmc = pm.MCMC(model)
mcmc.sample(iter=100000, burn=50000, thin=100)
plot(mcmc)
But when I try the same with PYMC3, I get this:
with pm.Model() as model:
N = 100
p = pm.Beta('p', alpha=1.0, beta=1.0)
q = pm.Beta('q', alpha=1.0, beta=1.0)
A = pm.Binomial('A', N, p)
X = pm.Binomial('x', A, q, observed=10)
with model:
start = pm.find_MAP()
with model:
step = pm.NUTS()
trace = pm.sample(3000, step, start)
pm.traceplot(trace)
It looks like the variable A is not being sampled at all. I didn't read a lot about the sampling method used in pymc3, but I noticed it seems to be particulary aimed for continuous models. Does this means it rules out discrete unobserved variables on the model or is there some way to do what I'm trying to do?

The NUTS sampler does not work with discrete variables (though folks are working on generalizing it to do so). What you'd want to do is assign different step methods to different types of variables. For example:
step1 = pm.NUTS(vars=[p, q])
step2 = pm.Metropolis(vars=[A])
trace = pm.sample(3000, [step1, step2], start)

Related

PYMC3 Mixture model: help understanding multiple variables model

Let's say I have a dataframe with 4 variable. I want to see if I can generate a posterior of gamma mixtures over all the variables, with the goal to find clusters for each observation. I'm guessing I will need some sort of multivariate gamma distribution? But how would I go about this?
Here is some pymc3 code as an example with one parameter, looking for a mixture of two gammas (I have chosen arbitrary parameters):
with pm.Model() as m:
p = pm.Dirichlet('p', a = np.ones(2))
alpha = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
beta = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
x = pm.Gammma('x', alpha, beta)
comp_dist = pm.Gamma.dist(means, scale, shape = (2,))
like = pm.Mixture('y', w = p,comp_dists = comp_dist, observed = data)
trace = pm.sample(1000)
So my question is, how would I extend this basic example to multiple variables? I assume that I need to define relationships between the variables somehow to encode them in the model? I feel that I understand the basics of mixture modelling, but at the same time feel that I am missing something pretty fundamental.
Here's how the multidimensional case should work:
J = 4 # num dimensions
K = 2 # num clusters
with pm.Model() as m:
p = pm.Dirichlet('p', a=np.ones(K))
alpha = pm.Gamma('alpha', alpha=1, beta=1, shape=(J,K))
beta = pm.Gamma('beta', alpha=1, beta=1, shape=(J,K))
gamma = pm.Gamma.dist(alpha=alpha, beta=beta, shape=(J,K))
like = pm.Mixture('y', w=p, comp_dists=gamma, observed=X, shape=J)
trace = pm.sample(1000)
where X.shape should be (N,J).
Note on Symmetry Breaking
The difficult part is going to be resolving identifiability issues, but I think that's beyond the scope of the question. Maybe have a look at how the GMM tutorial breaks symmetry using the pm.Potential function. I expect highly-correlated parameterizations of the likelihood function(s), like alpha and beta, would exacerbate the issue, so perhaps consider switching to the mu and sigma parameterization.

How to overplot fit results for discrete values in pymc3?

I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.

How to code a hierarchical mixture model of multivariate normals using PYMC

I successfully implemented a mixture of 3 normals using PyMC (shown at https://drive.google.com/file/d/0Bwnmbh6ueWhqSkUtV1JFZDJwLWc, and similar to the question asked at How to model a mixture of 3 Normals in PyMC?)
My next step is to try and code mixtures of multivariate normals.
There is, however, an additional complexity to the data - a hierarchy, with sets of observations belonging to a parent observation. The clustering is done on the parent observations, and not on the individual observations themselves. This first step generates the code (60 parents, with 50 observations per each parent), and works fine.
import numpy as np
import pymc as mc
n = 3 #mixtures
B = 5 #Bias between those at different mixtures
tau = 3 #Variances
nprov = 60 #number of parent observations
mu = [[0,0],[0,B],[-B,0]]
true_cov0 = np.array([[1.,0.],[0.,1.]])
true_cov1 = np.array([[1.,0.],[0.,tau**(2)]])
true_cov2 = np.array([[tau**(-2),0],[0.,1.]])
trueprobs = [.4, .3, .3] #probability of being in each of the three mixtures
prov = np.random.multinomial(1, trueprobs, size=nprov)
v = prov[:,1] + (prov[:,2])*2
numtoeach = 50
n_obs = nprov*numtoeach
vAll = np.tile(v,numtoeach)
ndata = numtoeach*nprov
p1 = range(nprov)
prov1 = np.tile(p1,numtoeach)
data = (vAll==0)*(np.random.multivariate_normal(mu[0],true_cov0,ndata)).T \
+ (vAll==1)*(np.random.multivariate_normal(mu[1],true_cov1,ndata)).T \
+ (vAll==2)*(np.random.multivariate_normal(mu[2],true_cov2,ndata)).T
data=data.T
However, when I try and use PyMC to do the sampling, I run intro trouble ('error: failed in converting 3rd argument `tau' of flib.prec_mvnorm to C/Fortran array')
p = 2 #covariates
prior_mu1=np.ones(p)
prior_mu2=np.ones(p)
prior_mu3=np.ones(p)
post_mu1 = mc.Normal("returns1",prior_mu1,1,size=p)
post_mu2 = mc.Normal("returns2",prior_mu2,1,size=p)
post_mu3 = mc.Normal("returns3",prior_mu3,1,size=p)
post_cov_matrix_inv1 = mc.Wishart("cov_matrix_inv1",n_obs,np.eye(p) )
post_cov_matrix_inv2 = mc.Wishart("cov_matrix_inv2",n_obs,np.eye(p) )
post_cov_matrix_inv3 = mc.Wishart("cov_matrix_inv3",n_obs,np.eye(p) )
#Combine prior means and variance matrices
meansAll= np.array([post_mu1,post_mu2,post_mu3])
precsAll= np.array([post_cov_matrix_inv1,post_cov_matrix_inv2,post_cov_matrix_inv3])
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=nprov)
#This step accounts for the hierarchy: observations' means are equal to their parents mean
#Parent is labeled prov1
#mc.deterministic
def mean(category=category, meansAll=meansAll):
lat = category[prov1]
new = meansAll[lat]
return new
#mc.deterministic
def prec(category=category, precsAll=precsAll):
lat = category[prov1]
return precsAll[lat]
obs = mc.MvNormal( "observed returns", mean, prec, observed = True, value = data)
I know the problem is not with the format of the simulated observed data, because this step would work fine, in place of the above:
obs = mc.MvNormal( "observed returns", post_mu3, post_cov_matrix_inv3, observed = True, value = data )
As a result, I think the issue is how the mean vector ('mean') and the covariance matrix ('prec') are entered, I just don't know how. Like I said, this worked fine with mixtures of normal distributions, but mixtures of multivariate normals is adding a complexity I can't figure out.
This is a good example of the difficulty PyMC has with vectors of multivariate variables. Not that its difficult--just not as elegant as it should be. You should create a list comprehension of the MVN nodes and wrap that as an observed stochastic.
#mc.observed
def obs(value=data, mean=mean, prec=prec):
return sum(mc.mv_normal_like(v, m, T) for v,m,T in zip(data, mean, prec))
Here is the IPython notebook

PyMC multiple linear regressions

I'm trying to fit several lines sharing the same intercept.
import numpy as np
import pymc
# Observations
a_actual = np.array([[2., 5., 7.]]).T
b_actual = 3.
t = np.arange(100)
obs = np.random.normal(a_actual * t + b_actual)
# PyMC Model
def model_linear():
b = pymc.Uniform('b', value=1., lower=0, upper=200)
a = []
s = []
r = []
for i in range(len(a_actual)):
s.append(pymc.Uniform('sigma_{}'.format(i), value=1., lower=0, upper=100))
a.append(pymc.Uniform('a_{}'.format(i), value=1., lower=0, upper=200))
r.append(pymc.Normal('r_{}'.format(i), mu=a[i] * t + b, tau=1/s[i]**2, value=obs[i], observed=True))
return [pymc.Container(a), b, pymc.Container(s), pymc.Container(r)]
model = pymc.Model(model_linear())
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
The computed MAP estimates are far from the actual values. Those values are also very sensitive to the lower and upper bounds of sigmas and a, to the actual values of a (e.g. a = [.2, .5, .7] will give me good estimates) or to the number of lines to do the regression on.
Is this the right way of performing my linear regressions?
ps : I tried to use an Exponential prior distribution for sigmas but results were not better.
I think using MAP might not be your best bet. If you are able to do a proper sampling then consider replacing the last 3 lines of your example code with
MCMClinear = pymc.MCMC( model)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()
Printing the linear_output for this gives very accurate inferences for the parameters.

Defining a custom PyMC distribution

This is perhaps a silly question.
I'm trying to fit data to a very strange PDF using MCMC evaluation in PyMC. For this example I just want to figure out how to fit to a normal distribution where I manually input the normal PDF. My code is:
data = [];
for count in range(1000): data.append(random.gauss(-200,15));
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
# #mc.potential
# def density(x = data, mu = mean, sigma = std_dev):
# return (1./(sigma*np.sqrt(2*np.pi))*np.exp(-((x-mu)**2/(2*sigma**2))))
mc.Normal('process', mu=mean, tau=1./std_dev**2, value=data, observed=True)
model = mc.MCMC([mean,std_dev])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
The examples I've found all use something like mc.Normal, or mc.Poisson or whatnot, but I want to fit to the commented out density function.
Any help would be appreciated.
An easy way is to use the stochastic decorator:
import pymc as mc
import numpy as np
data = np.random.normal(-200,15,size=1000)
mean = mc.Uniform('mean', lower=min(data), upper=max(data))
std_dev = mc.Uniform('std_dev', lower=0, upper=50)
#mc.stochastic(observed=True)
def custom_stochastic(value=data, mean=mean, std_dev=std_dev):
return np.sum(-np.log(std_dev) - 0.5*np.log(2) -
0.5*np.log(np.pi) -
(value-mean)**2 / (2*(std_dev**2)))
model = mc.MCMC([mean,std_dev,custom_stochastic])
model.sample(iter=5000)
print "!"
print(model.stats()['mean']['mean'])
print(model.stats()['std_dev']['mean'])
Note that my custom_stochastic function returns the log likelihood, not the likelihood, and that it is the log likelihood for the entire sample.
There are a few other ways to create custom stochastic nodes. This doc gives more details, and this gist contains an example using pymc.Stochastic to create a node with a kernel density estimator.

Categories

Resources