I'am obviously doing something wrong here... Please have a look at the following program. It runs well but gives me a lambda parameter for an exponential distribution which is far away from the parameter I used for generating random observations:
import numpy as np
import arviz as az
import pymc as pm
lambda_param = 0.25
random_size = 1000
x = np.random.exponential(lambda_param, random_size)
basic_model = pm.Model()
with basic_model:
_lam_ = pm.HalfNormal("lambda", sigma = 1)
Y_obs = pm.Exponential("Y_obs", lam = _lam_, observed = x)
start = pm.find_MAP(model = basic_model)
idata = pm.sample(1000, start = start)
summary = az.summary(idata, round_to = 6)
summary
Following my last running of the program, I find in summary a mean lambda greater than 4..., where lambda=0.25 as I used it.
Pointing the finger at my programing errors would be highly appreciated.
I found the problem, the uncertainty on _lam_ was too large and given that the exponential probability distribution is not symmetric, the high uncertainty modified the result. The fix is simply to use a smaller standard deviation, I also used Normal rather than HalfNormal for simplicity:
import numpy as np
import pymc3 as pm
import arviz as az
lambda_param = 0.25
random_size = 1000
x = np.random.exponential(lambda_param, random_size)
with pm.Model() as basic_model:
lam = pm.Normal("lam", mu=lambda_param, sigma=0.0001)
Y_obs = pm.Exponential("Y_obs", lam=lam, observed=x)
trace = pm.sample(1000, tune=1000)
summary = az.summary(trace, round_to=6)
summary
This gives a mean of 0.25 for lambda, within a small margin of error.
Related
I'm having trouble obtaining the dispersion parameter of simulated data using statsmodels' GLM function.
import statsmodels.api as sm
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
np.random.seed(1)
# Generate data
x=np.random.uniform(0, 100,50000)
x2 = sm.add_constant(x)
a = 0.5
b = 0.2
y_true = 1/(a+(b*x))
# Add error
scale = 2 # the scale parameter I'm trying to obtain
shape = y_true/scale # given that, for Gamma, mu = scale*shape
y = np.random.gamma(shape=shape, scale=scale)
# Run model
model = sm.GLM(y, x2, family=sm.families.Gamma()).fit()
model.summary()
Here's the summary from above:
Note that the coefficient estimates are correct (0.5 and 0.2), but the scale (21.995) is way off the scale I set (2).
Can someone point out what it is I'm misunderstanding/doing wrong? Thanks!
As Josef noted in the comments, statsmodels uses a different kind of parameterization.
I am trying to define a multivariate custom distribution through pymc3.DensityDist(); however, I keep getting the following error that dimensions do not match:
"LinAlgError: 0-dimensional array given. Array must be two-dimensional"
I have already seen https://github.com/pymc-devs/pymc3/issues/535 but I could not find the answer to my question. Just for clarity, here is my simple example
import numpy as np
import pymc3 as pm
def pdf(x):
y = 0
print(x)
sigma = np.identity(2)
isigma = sigma
mu = np.array([[1,2],[3,4]])
for i in range(2):
x0 = x- mu[i,:]
xsinv = np.linalg.multi_dot([x0,isigma,x0])
y = y + np.exp(-0.5*xsinv)
return y
logp = lambda x: np.log(pdf(x))
with pm.Model() as model:
pm.DensityDist('x',logp, shape=2)
step = pm.Metropolis(tune=False, S=np.identity(2))
trace = pm.sample(100000, step=step, chain=1, tune=0,progressbar=False)
result = trace['x']
In this simple code I want to define an unnormilized pdf function, which is sum of two unnormalized normal distributions, and take samples from this pdf through Metropolis algorithm.
Thanks,
Try replacing numpy for theano in the following lines:
xsinv = tt.dot(tt.dot(x0, isigma), x0)
y = y + tt.exp(-0.5 * xsinv)
as a side note, try using NUTS instead of metropolis and let PyMC3 choose the sampling method for you, just do
trace = pm.sample(1000)
For future reference you can also ask questions here
Does anyone know how I can see the final acceptance-rate in PyMC3 (Metropolis-Hastings) ? Or in general, how can I see all the information that pymc3.sample() returns ?
Thanks
Given an example, first, set up the model:
import pymc3 as pm3
sigma = 3 # Note this is the std of our data
data = norm(10,sigma).rvs(100)
mu_prior = 8
sigma_prior = 1.5 # Note this is our prior on the std of mu
plt.hist(data,bins=20)
plt.show()
basic_model = pm3.Model()
with basic_model:
# Priors for unknown model parameters
mu = pm3.Normal('Mean of Data',mu_prior,sigma_prior)
# Likelihood (sampling distribution) of observations
data_in = pm3.Normal('Y_obs', mu=mu, sd=sigma, observed=data)
Second, perform the simulation:
chain_length = 10000
with basic_model:
# obtain starting values via MAP
startvals = pm3.find_MAP(model=basic_model)
# instantiate sampler
step = pm3.Metropolis()
# draw 5000 posterior samples
trace = pm3.sample(chain_length, step=step, start=startvals)
Using the above example, the acceptance rate can be calculated this way:
accept = np.sum(trace['Mean of Data'][1:] != trace['Mean of Data'][:-1])
print("Acceptance Rate: ", accept/trace['Mean of Data'].shape[0])
(I found this solution in an online tutorial, but I don't quite understand it.)
Reference: Introduction to PyMC3
I checked for the NUTS algorithm, and found the solution from here pymc3 forum.
trace.mean_tree_accept.mean()
Let step = pymc3.Metropolis() be our sampler, we can get the final acceptance-rate through
"step.accepted"
Just for beginners (pymc3) like myself, after each variable/obj. put a "." and hit the tab key; you will see some interesting suggestions ;)
So, let's say I have the following 2-dimensional target distribution that I would like to sample from (a mixture of bivariate normal distributions) -
import numba
import numpy as np
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
def targ_dist(x):
target = (stats.multivariate_normal.pdf(x,[0,0],[[1,0],[0,1]])+stats.multivariate_normal.pdf(x,[-6,-6],[[1,0.9],[0.9,1]])+stats.multivariate_normal.pdf(x,[4,4],[[1,-0.9],[-0.9,1]]))/3
return target
and the following proposal distribution (a bivariate random walk) -
def T(x,y,sigma):
return stats.multivariate_normal.pdf(y,x,[[sigma**2,0],[0,sigma**2]])
The following is the Metropolis Hastings code for updating the "entire" state in every iteration -
#Initialising
n_iter = 30000
# tuning parameter i.e. variance of proposal distribution
sigma = 2
# initial state
X = stats.uniform.rvs(loc=-5, scale=10, size=2, random_state=None)
# count number of acceptances
accept = 0
# store the samples
MHsamples = np.zeros((n_iter,2))
# MH sampler
for t in range(n_iter):
# proposals
Y = X+stats.norm.rvs(0,sigma,2)
# accept or reject
u = stats.uniform.rvs(loc=0, scale=1, size=1)
# acceptance probability
r = (targ_dist(Y)*T(Y,X,sigma))/(targ_dist(X)*T(X,Y,sigma))
if u < r:
X = Y
accept += 1
MHsamples[t] = X
However, I would like to update "per component" (i.e. component-wise updating) in every iteration. Is there a simple way of doing this?
Thank you for your help!
From the tone of your question I assume you are looking performance improvements.
MonteCarlo algorithms are quite compute intensive. You will get better results, if you perform in algorithms on a lower level than in an interpreted language like python, e.g. writing a c-extension.
There are also implementations available for python (PyStan, PyMC3).
I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.