PyMC Bernoulli model checking

PyMC Bernoulli model checking - python

I am currently trying to do model checking with PyMC where my model is a Bernoulli model and I have a Beta prior. I want to do both a (i) gof plot as well as (ii) calculate the posterior predictive p-value.
I have got my code running with a Binomial model, but I am quite struggling to find the right way of making a Bernoulli model working. Unfortunately, there is no example anywhere that I can work with. My code looks like the following:
import pymc as mc
import numpy as np
alpha = 2
beta = 2
n = 13
yes = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,0,0])
p = mc.Beta('p',alpha,beta)
surv = mc.Bernoulli('surv',p=p,observed=True,value=yes)
surv_sim = mc.Bernoulli('surv_sim',p=p)
mc_est = mc.MCMC({'p':p,'surv':surv,'surv_sim':surv_sim})
mc_est.sample(10000,5000,2)
import matplotlib.pylab as plt
plt.hist(mc_est.surv_sim.trace(),bins=range(0,3),normed=True)
plt.figure()
plt.hist(mc_est.p.trace(),bins=100,normed=True)
mc.Matplot.gof_plot(mc_est.surv_sim.trace(), 10/13., name='surv')
#here I have issues
D = mc.discrepancy(yes, surv_sim, p.trace())
mc.Matplot.discrepancy_plot(D)
The main problem I am having is in determining the expected values for the discrepancy function. Just using p.trace() does not work here, as these are the probabilities. Somehow, I need to incorporate the sample size here, but I am struggling to do that in a similar way as I would do it for a Binomial model. I am also not quite sure, if I am doing the gof_plot correctly.
Hope someone can help me out here! Thanks!

Per the discrepancy function doc string, the parameters are:
observed : Iterable of observed values (size=(n,))
simulated : Iterable of simulated values (size=(r,n))
expected : Iterable of expected values (size=(r,) or (r,n))
So you need to correct two things:
1) modify your simulated results to have size n (e.g., 13 in your example):
surv_sim = mc.Bernoulli('surv_sim', p=p, size=n)
2) encapsulate your p.trace() with the bernoulli_expval method:
D = mc.discrepancy(yes, surv_sim.trace(), mc.bernoulli_expval(p.trace()))
(bernoulli_expval just spits back p.)
With those two changes, I get the following:

Related

How to declare a binomial distribution variable with changing parameters in pymc3

I would like to declare a binomial variable that is intermediate and not observed, used to compute another deterministic variable that is observed. If I had the observations, I could declare the following and get the model pictured below:
import pymc3
with pymc3.Model() as model:
rate = pymc3.Beta('rate', alpha=2, beta=2)
input_var = pymc3.Data('input_var', [1, 2, 3])
intermediate_var = pymc3.Binomial('intermediate_var', n=input_var, p=rate, observed=[1, 1, 1])
output_var = pymc3.Deterministic('output_var', intermediate_var)
pymc3.model_to_graphviz(model)
But, I don't have observations of the intermediate variable, only of the output variable, and this does not work, giving wrong number of dimensions error:
import pymc3
with pymc3.Model() as model:
rate = pymc3.Beta('rate', alpha=2, beta=2)
input_var = pymc3.Data('input_var', [1, 2, 3])
intermediate_var = pymc3.Binomial('intermediate_var', n=input_var, p=rate)
output_var = pymc3.Deterministic('output_var', intermediate_var)
Obviously, this is an innocent example, and someone could say that how is that I have observations of the output variable and not the intermediate one in this case, but in my model, the output variable depends on several intermediate variables that depend in other input variables as well. This is a minimal example trying to explain what I want to do.
The same happens if I change the Deterministic by a Normal with observed data.

If the plate diagram is to be taken literally, the dimensionality of the intermediate_var needs to be declared, since it is really three random variables each with a distinct n parameterized by the entries in input_var. This is done with the shape argument:
import pymc3
with pymc3.Model() as model:
rate = pymc3.Beta('rate', alpha=2, beta=2)
input_var = pymc3.Data('input_var', [1, 2, 3])
intermediate_var = pymc3.Binomial('intermediate_var', n=input_var, p=rate, shape=3)
output_var = pymc3.Deterministic('output_var', intermediate_var)

Multiple Linear Regression using Python

Firstly, there are a few topics on this but they involve deprecated packages with pandas etc. Suppose I'm trying to predict a variable w with variables x,y and z. I want to run a multiple linear regression to try and predict w. There are quite a few solutions that will produce the coefficients but I'm not sure how to use these. So, in pseudocode;
import numpy as np
from scipy import stats
w = np.array((1,2,3,4,5,6,7,8,9,10)) # Time series I'm trying to predict
x = np.array((1,3,6,1,4,6,8,9,2,2)) # The three variables to predict w
y = np.array((2,7,6,1,5,6,3,9,5,7))
z = np.array((1,3,4,7,4,8,5,1,8,2))
def model(w,x,y,z):
# do something!
return guess # where guess is some 10 element array formed
# using multiple linear regression of x,y,z
guess = model(w,x,y,z)
r = stats.pearsonr(w,guess) # To see how good guess is
Hopefully this makes sense as I'm new to MLR. There is probably a package in scipy that does all this so any help welcome!

You can use the normal equation method.
Let your equation be of the form : ax+by+cz +d =w
Then
import numpy as np
x = np.asarray([[1,3,6,1,4,6,8,9,2,2],
[2,7,6,1,5,6,3,9,5,7],
[1,3,4,7,4,8,5,1,8,2],
[1,1,1,1,1,1,1,1,1,1]]).T
y = numpy.asarray([1,2,3,4,5,6,7,8,9,10]).T
a,b,c,d = np.linalg.pinv((x.T).dot(x)).dot(x.T.dot(y))

Think I've found out now. If anyone could confirm that this produces the correct results that'd be great!
import numpy as np
from scipy import stats
# What I'm trying to predict
y = [-6,-5,-10,-5,-8,-3,-6,-8,-8]
# Array that stores two predictors in columns
x = np.array([[-4.95,-4.55],[-10.96,-1.08],[-6.52,-0.81],[-7.01,-4.46],[-11.54,-5.87],[-4.52,-11.64],[-3.36,-7.45],[-2.36,-7.33],[-7.65,-10.03]])
# Fit linear least squares and get regression coefficients
beta_hat = np.linalg.lstsq(x,y)[0]
print(beta_hat)
# To store my best guess
estimate = np.zeros((9))
for i in range(0,9):
# y = x1b1 + x2b2
estimate[i] = beta_hat[0]*x[i,0]+beta_hat[1]*x[i,1]
# Correlation between best guess and real values
print(stats.pearsonr(estimate,y))

scipy.stats.weibull_min.fit() - how to deal with right-censored data?

Non-Censored (Complete) Dataset
I am attempting to use the scipy.stats.weibull_min.fit() function to fit some life data. Example generated data is contained below within values.
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 20683.2,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
I attempt to fit using the function:
fit = scipy.stats.weibull_min.fit(values, loc=0)
The result:
(1.3392877335100251, -277.75467055900197, 9443.6312323849124)
Which isn't far from the nominal beta and eta values of 1.4 and 10000.
Right-Censored Data
The weibull distribution is well known for its ability to deal with right-censored data. This makes it incredibly useful for reliability analysis. How do I deal with right-censored data within scipy.stats? That is, curve fit for data that has not experienced failures yet?
The input form might look like:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, np.inf,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
or perhaps using np.nan or simply 0.
Both of the np solutions are throwing RunTimeWarnings and are definitely not coming close to the correct values. I using numeric values - such as 0 and -1 - removes the RunTimeWarning, but the returned parameters are obviously flawed.
Other Softwares
In some reliability or lifetime analysis softwares (minitab, lifelines), it is necessary to have two columns of data, one for the actual numbers and one to indicate if the item has failed or not yet. For instance:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 0,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
censored = np.array(
[True, True, True, True, False,
True, True, True, True, True]
)
I see no such paths within the documentation.

Old question but if anyone comes across this, there is a new survival analysis package for python, surpyval, that handles this, and other cases of censoring and truncation. For the example you provide above it would simply be:
import surpyval as surv
values = np.array([10197.8, 3349.0, 15318.6, 142.6, 6976.5, 2590.7, 11351.7, 10177.0, 3738.4])
# 0 = failed, 1 = right censored
censored = np.array([0, 0, 0, 0, 0, 1, 1, 1, 0])
model = surv.Weibull.fit(values, c=censored)
print(model.params)
(10584.005910580288, 1.038163987652635)
You might also be interested in the Weibull plot:
model.plot(plot_bounds=False)
Weibull plot
Full disclosure, I am the creator of surpyval

GaussianMixture initialization using component parameters - sklearn

I want to use sklearn.mixture.GaussianMixture to store a gaussian mixture model so that I can later use it to generate samples or a value at a sample point using score_samples method. Here is an example where the components have the following weight, mean and covariances
import numpy as np
weights = np.array([0.6322941277066596, 0.3677058722933399])
mu = np.array([[0.9148052872961359, 1.9792961751316835],
[-1.0917396392992502, -0.9304220945910037]])
sigma = np.array([[[2.267889129267119, 0.6553245618368836],
[0.6553245618368835, 0.6571014653342457]],
[[0.9516607767206848, -0.7445831474157608],
[-0.7445831474157608, 1.006599716443763]]])
Then I initialised the mixture as follow
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
Finally I tried to generate a sample based on the parameters which resulted in an error:
x = gmix.sample(1000)
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
As I understand GaussianMixture is intended to fit a sample using a mixture of Gaussian but is there a way to provide it with the final values and continue from there?

You rock, J.P.Petersen!
After seeing your answer I compared the change introduced by using fit method. It seems the initial instantiation does not create all the attributes of gmix. Specifically it is missing the following attributes,
covariances_
means_
weights_
converged_
lower_bound_
n_iter_
precisions_
precisions_cholesky_
The first three are introduced when the given inputs are assigned. Among the rest, for my application the only attribute that I need is precisions_cholesky_ which is cholesky decomposition of the inverse covarinace matrices. As a minimum requirement I added it as follow,
gmix.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(sigma)).transpose((0, 2, 1))

It seems that it has a check that makes sure that the model has been trained. You could trick it by training the GMM on a very small data set before setting the parameters. Like this:
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(rand(10, 2)) # Now it thinks it is trained
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
x = gmix.sample(1000) # Should work now

To understand what is happening, what GaussianMixture first checks that it has been fitted:
self._check_is_fitted()
Which triggers the following check:
def _check_is_fitted(self):
check_is_fitted(self, ['weights_', 'means_', 'precisions_cholesky_'])
And finally the last function call:
def check_is_fitted(estimator, attributes, msg=None, all_or_any=all):
which only checks that the classifier already has the attributes.
So in short, the only thing you have missing to have it working (without having to fit it) is to set precisions_cholesky_ attribute:
gmix.precisions_cholesky_ = 0
should do the trick (can't try it so not 100% sure :P)
However, if you want to play safe and have a consistent solution in case scikit-learn updates its contrains, the solution of #J.P.Petersen is probably the best way to go.

As a slight alternative to #hashmuke's answer, you can use the precision computation that is used inside GaussianMixture directly:
import numpy as np
from scipy.stats import invwishart as IW
from sklearn.mixture import GaussianMixture as GMM
from sklearn.mixture._gaussian_mixture import _compute_precision_cholesky
n_dims = 5
mu1 = np.random.randn(n_dims)
mu2 = np.random.randn(n_dims)
Sigma1 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
Sigma2 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
gmm = GMM(n_components=2)
gmm.weights_ = np.array([0.2, 0.8])
gmm.means_ = np.stack([mu1, mu2])
gmm.covariances_ = np.stack([Sigma1, Sigma2])
gmm.precisions_cholesky_ = _compute_precision_cholesky(gmm.covariances_, 'full')
X, y = gmm.sample(1000)
And depending on your covariance type you should change full accordingly as input to _compute_precision_cholesky (will be one of full, diag, tied, spherical).

PyMC2 and PyMC3 give different results...?

I'm trying to get a simple PyMC2 model working in PyMC3. I've gotten the model to run but the models give very different MAP estimates for the variables. Here is my PyMC2 model:
import pymc
theta = pymc.Normal('theta', 0, .88)
X1 = pymc.Bernoulli('X2', p=pymc.Lambda('a', lambda theta=theta:1./(1+np.exp(-(theta-(-0.75))))), value=[1],observed=True)
X2 = pymc.Bernoulli('X3', p=pymc.Lambda('b', lambda theta=theta:1./(1+np.exp(-(theta-0)))), value=[1],observed=True)
model = pymc.Model([theta, X1, X2])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=25000, burn=5000)
trace = (mcmc.trace('theta')[:])
print "\nThe MAP value for theta is", trace.sum()/len(trace)
That seems to work as expected. I had all sorts of trouble figuring out how to use the equivalent of the pymc.Lambda object in PyMC3. I eventually came across the Deterministic object. The following is my code:
import pymc3
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+np.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+np.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
pymc3.traceplot(trace)
The problem I'm having is that my MAP estimate for theta using PyMC2 is ~0.68 (correct), while the estimate PyMC3 gives is ~0.26 (incorrect). I suspect this has something to do with the way I'm defining the deterministic function. PyMC3 won't let me use a lambda function, so I just have to write the expression in-line. When I try to use lambda theta=theta:... I get this error:
AsTensorError: ('Cannot convert <function <lambda> at 0x157323e60> to TensorType', <type 'function'>)
Something to do with Theano?? Any suggestions would be greatly appreciated!

It works when you use a theano tensor instead of a numpy function in your Deterministic.
import pymc3
import theano.tensor as tt
with pymc3.Model() as model:
theta = pymc3.Normal('theta', 0, 0.88)
X1 = pymc3.Bernoulli('X1', p=pymc3.Deterministic('b', 1./(1+tt.exp(-(theta-(-0.75))))), observed=[1])
X2 = pymc3.Bernoulli('X2', p=pymc3.Deterministic('c', 1./(1+tt.exp(-(theta-(0))))), observed=[1])
start=pymc3.find_MAP()
step=pymc3.NUTS(state=start)
trace = pymc3.sample(20000, step, njobs=1, progressbar=True)
print "\nThe MAP value for theta is", np.median(trace['theta'])
pymc3.traceplot(trace);
Here's the output:

Just in case someone else has the same problem, I think I found an answer. After trying different sampling algorithms I found that:
find_MAP gave the incorrect answer
the NUTS sampler gave the incorrect answer
the Metropolis sampler gave the correct answer, yay!
I read somewhere else that the NUTS sampler doesn't work with Deterministic. I don't know why. Maybe that's the case with find_MAP too? But for now I'll stick with Metropolis.

Also, NUTS doesn't handle discrete variables. If you want to use NUTS, you have to split up the samplers:
step1 = pymc3.NUTS([theta])
step2 = pymc3.BinaryMetropolis([X1,X2])
trace = pymc3.sample(10000, [step1, step2], start)
EDIT:
Missed that 'b' and 'c' were defined inline. Removed them from the NUTS function call

The MAP value is not defined as the mean of a distribution, but as its maximum. With pymc2 you can find it with:
M = pymc.MAP(model)
M.fit()
theta.value
which returns array(0.6253614422469552)
This agrees with the MAP that you find with find_MAP in pymc3, which you call start:
{'theta': array(0.6253614811102668)}
The issue of which is a better sampler is a different one, and does not depend on the calculation of the MAP. The MAP calculation is an optimization.
See: https://pymc-devs.github.io/pymc/modelfitting.html#maximum-a-posteriori-estimates for pymc2.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyMC Bernoulli model checking - python

Related

How to declare a binomial distribution variable with changing parameters in pymc3

Multiple Linear Regression using Python

scipy.stats.weibull_min.fit() - how to deal with right-censored data?

GaussianMixture initialization using component parameters - sklearn

PyMC2 and PyMC3 give different results...?

Categories

Resources