I am trying to create a Labeled LDA model as described in this paper (section 3.2).
What I have so far is:
# settings
entityTypesSize = 100
minibatchSize = 10
entityStringsSize = 100
model = pm.Model()
with pm.Model() as model:
alpha = pm.Gamma(alpha=0.1, beta=1, name='alpha')
eta = pm.Gamma(alpha=0.1, beta=1, name='eta')
beta = pm.Dirichlet('beta', a=eta * np.ones((entityTypesSize, entityStringsSize)),
shape=(entityTypesSize, entityStringsSize), transform=t_stick_breaking(1e-9))
theta = pm.Dirichlet('theta', a=alpha * np.ones((minibatchSize, entityTypesSize)),
shape=(minibatchSize, entityTypesSize), transform=t_stick_breaking(1e-9))
z = pm.Multinomial('z', n=, p=)
w = pm.Multinomial('w', n=, p=)
The challenge I am having is with the z and w random variables. As stated in the paper, the number of draws (n-param) should not be fixed, but depends on the number of words in an entity string. Furthermore, I need to place different probabilities (p-param), since they are sampled from the beta and theta distributions. Is it possible to have them somehow chained? If yes, can someone assist with that, please?
The same model has a alternative implementation in HBC, which can be found here.
Thank you!!!
Related
Let's say I have a dataframe with 4 variable. I want to see if I can generate a posterior of gamma mixtures over all the variables, with the goal to find clusters for each observation. I'm guessing I will need some sort of multivariate gamma distribution? But how would I go about this?
Here is some pymc3 code as an example with one parameter, looking for a mixture of two gammas (I have chosen arbitrary parameters):
with pm.Model() as m:
p = pm.Dirichlet('p', a = np.ones(2))
alpha = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
beta = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
x = pm.Gammma('x', alpha, beta)
comp_dist = pm.Gamma.dist(means, scale, shape = (2,))
like = pm.Mixture('y', w = p,comp_dists = comp_dist, observed = data)
trace = pm.sample(1000)
So my question is, how would I extend this basic example to multiple variables? I assume that I need to define relationships between the variables somehow to encode them in the model? I feel that I understand the basics of mixture modelling, but at the same time feel that I am missing something pretty fundamental.
Here's how the multidimensional case should work:
J = 4 # num dimensions
K = 2 # num clusters
with pm.Model() as m:
p = pm.Dirichlet('p', a=np.ones(K))
alpha = pm.Gamma('alpha', alpha=1, beta=1, shape=(J,K))
beta = pm.Gamma('beta', alpha=1, beta=1, shape=(J,K))
gamma = pm.Gamma.dist(alpha=alpha, beta=beta, shape=(J,K))
like = pm.Mixture('y', w=p, comp_dists=gamma, observed=X, shape=J)
trace = pm.sample(1000)
where X.shape should be (N,J).
Note on Symmetry Breaking
The difficult part is going to be resolving identifiability issues, but I think that's beyond the scope of the question. Maybe have a look at how the GMM tutorial breaks symmetry using the pm.Potential function. I expect highly-correlated parameterizations of the likelihood function(s), like alpha and beta, would exacerbate the issue, so perhaps consider switching to the mu and sigma parameterization.
Does anyone know how I can see the final acceptance-rate in PyMC3 (Metropolis-Hastings) ? Or in general, how can I see all the information that pymc3.sample() returns ?
Thanks
Given an example, first, set up the model:
import pymc3 as pm3
sigma = 3 # Note this is the std of our data
data = norm(10,sigma).rvs(100)
mu_prior = 8
sigma_prior = 1.5 # Note this is our prior on the std of mu
plt.hist(data,bins=20)
plt.show()
basic_model = pm3.Model()
with basic_model:
# Priors for unknown model parameters
mu = pm3.Normal('Mean of Data',mu_prior,sigma_prior)
# Likelihood (sampling distribution) of observations
data_in = pm3.Normal('Y_obs', mu=mu, sd=sigma, observed=data)
Second, perform the simulation:
chain_length = 10000
with basic_model:
# obtain starting values via MAP
startvals = pm3.find_MAP(model=basic_model)
# instantiate sampler
step = pm3.Metropolis()
# draw 5000 posterior samples
trace = pm3.sample(chain_length, step=step, start=startvals)
Using the above example, the acceptance rate can be calculated this way:
accept = np.sum(trace['Mean of Data'][1:] != trace['Mean of Data'][:-1])
print("Acceptance Rate: ", accept/trace['Mean of Data'].shape[0])
(I found this solution in an online tutorial, but I don't quite understand it.)
Reference: Introduction to PyMC3
I checked for the NUTS algorithm, and found the solution from here pymc3 forum.
trace.mean_tree_accept.mean()
Let step = pymc3.Metropolis() be our sampler, we can get the final acceptance-rate through
"step.accepted"
Just for beginners (pymc3) like myself, after each variable/obj. put a "." and hit the tab key; you will see some interesting suggestions ;)
I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.
I'm trying to fit several lines sharing the same intercept.
import numpy as np
import pymc
# Observations
a_actual = np.array([[2., 5., 7.]]).T
b_actual = 3.
t = np.arange(100)
obs = np.random.normal(a_actual * t + b_actual)
# PyMC Model
def model_linear():
b = pymc.Uniform('b', value=1., lower=0, upper=200)
a = []
s = []
r = []
for i in range(len(a_actual)):
s.append(pymc.Uniform('sigma_{}'.format(i), value=1., lower=0, upper=100))
a.append(pymc.Uniform('a_{}'.format(i), value=1., lower=0, upper=200))
r.append(pymc.Normal('r_{}'.format(i), mu=a[i] * t + b, tau=1/s[i]**2, value=obs[i], observed=True))
return [pymc.Container(a), b, pymc.Container(s), pymc.Container(r)]
model = pymc.Model(model_linear())
map = pymc.MAP(model)
map.fit()
map.revert_to_max()
The computed MAP estimates are far from the actual values. Those values are also very sensitive to the lower and upper bounds of sigmas and a, to the actual values of a (e.g. a = [.2, .5, .7] will give me good estimates) or to the number of lines to do the regression on.
Is this the right way of performing my linear regressions?
ps : I tried to use an Exponential prior distribution for sigmas but results were not better.
I think using MAP might not be your best bet. If you are able to do a proper sampling then consider replacing the last 3 lines of your example code with
MCMClinear = pymc.MCMC( model)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()
Printing the linear_output for this gives very accurate inferences for the parameters.
I'm updating some calculations where I used pymc2 to pymc3 and I'm having some problems with samplers behavior when I have some discrete random variables on my model. As an example, consider the following model using pymc2:
import pymc as pm
N = 100
data = 10
p = pm.Beta('p', alpha=1.0, beta=1.0)
q = pm.Beta('q', alpha=1.0, beta=1.0)
A = pm.Binomial('A', N, p)
X = pm.Binomial('x', A, q, observed=True, value=data)
It's not really representative of anything, it's just a model where one of the unobserved variables is discrete. When I sample this model with pymc2 I get the following results:
mcmc = pm.MCMC(model)
mcmc.sample(iter=100000, burn=50000, thin=100)
plot(mcmc)
But when I try the same with PYMC3, I get this:
with pm.Model() as model:
N = 100
p = pm.Beta('p', alpha=1.0, beta=1.0)
q = pm.Beta('q', alpha=1.0, beta=1.0)
A = pm.Binomial('A', N, p)
X = pm.Binomial('x', A, q, observed=10)
with model:
start = pm.find_MAP()
with model:
step = pm.NUTS()
trace = pm.sample(3000, step, start)
pm.traceplot(trace)
It looks like the variable A is not being sampled at all. I didn't read a lot about the sampling method used in pymc3, but I noticed it seems to be particulary aimed for continuous models. Does this means it rules out discrete unobserved variables on the model or is there some way to do what I'm trying to do?
The NUTS sampler does not work with discrete variables (though folks are working on generalizing it to do so). What you'd want to do is assign different step methods to different types of variables. For example:
step1 = pm.NUTS(vars=[p, q])
step2 = pm.Metropolis(vars=[A])
trace = pm.sample(3000, [step1, step2], start)