Posterior probability with pymc - python

(This question was originally posted on stats.O. I moved it here because it does relate with pymc and more general matters within it: in fact the main aim is to have a better understanding of how pymc works. If any of the moderators believe it not to be suitable for SO, I'll remove from here.)
I've been reading the pymc tutorial and many other questions both in here and in SO.
I am trying to understand how to apply Bayes' theorem to compute the posterior probability using certain data. In particular, I have a tuple of independent parameters
From the data I'd like to infer the likelihood , where is a certain event. Then the aim is to compute
Some additional comments:
This is a sort of unsupervised learning, I know that happened, and I want to find out the parameters that maximise the probability . (*)
I'd also like to have a parallel procedure where I let pymc compute the likelihood given the data and then for each tuple of parameters I want to get the posterior probability.
In the following I'll assume that and that the likelihood is a multi-dimensional normal distribution with (because of the independence).
The following is the code I'm using (for simplicity assume there are only two parameters). The code is still under development (I know it can't work!). But I believe it is useful to include it, and then refine it following comments and answers to provide a skeleton for future reference.
class ObsData(object):
def __init__(self, params):
self.theta1 = params[0]
self.theta2 = params[1]
class Model(object):
def __init__(self, data):
# Priors
self.theta1 = pm.Uniform('theta1', 0, 100)
self.theta2 = pm.Normal('theta2', 0, 0.0001)
#pm.deterministic
def model(
theta1=self.theta1,
theta2=self.theta2,
):
return (theta1, theta2)
# Is this the actual likelihood?
self.likelihood = pm.MvNormal(
'likelihood',
mu=model,
tau=np.identity(2),
value=data, # is it correct to put the data here?
observed=True
)
def mcmc(observed_data):
data = ObsData(observed_data)
pymc_obj = Model(data)
model = pymc.MCMC(pymc_obj)
model.sample(10000, verbose=0) # Does this line compute the likelihood and the normalisation factor?
# How do I get the posterior distribution?
The questions are the following:
Does self.likelihood represent the Bayesian likelihood?
How do I make use of the data? (I suspect value=data is incorrect..)
Does .sample() actually compute the posterior probability?
How do I get information from the posterior?
(*)Should I include anything that relates to the fact that happened at some point?
As general question: is it any way to use pymc to only compute the likelihood given the data and the priors?
Any comments, and also reference to other question or tutorial are welcome!

For starters, I think you want to return (theta1*theta2) from your definition of model.
model.sample is sampling, not computing, the posterior distributions (given sufficient burn-in, etc) of your parameters given your data, and the likelihood of specific values for each tuple of parameters can be determined from the joint posterior after sampling.
I think you've got some fundamental misunderstanding of MCMC at the moment. I can't think of a better way to answer your questions than to point you to the wonderful Bayesian Methods for Hackers

Related

Tensorflow MCMC doesn't evolve chain states

I'm fairly new to tensorflow and MCMC in general. I'm doing a few basic calculations with different models, the most basic model converges without problem and gives good results from the MCMC calculation. However, when I use a more advanced model, I have a problem where the chain states are never evolved from the initial state.
I'm calling the sampler via this code:
nkernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=_tf_lnlike,
num_leapfrog_steps=5,
step_size=0.1)
adapt_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
inner_kernel=nkernel,
num_adaptation_steps=num_burnin_steps,
target_accept_prob=0.75)
chains_states = tfp.mcmc.sample_chain(
num_results=nresults,
num_burnin_steps=num_burnin_steps,
current_state=initial_state,
kernel=adapt_kernel,
trace_fn=None)
The likelihood function looks like this:
#tf.function
def _tf_lnlike(theta):
y0 = tf.tensordot(tf.ones(theta.shape[0], dtype=dtype), data, axes=0)
y0_err = tf.tensordot(tf.ones(theta.shape[0], dtype=dtype), data_err, axes=0)
y_model = _tf_model(theta)
return tf.math.reduce_sum(-0.5*((y_model-y0)/y0_err)**2, axis=1)
where _tf_model is a rather complex function (so I won't post it here). This is essentially trying to fit some input data (which are tf.constant). The first thing I checked was the gradients, which had inf or nan values from _tf_model. The simplest way I thought to solve that was to write a very simple numerical gradient function into the likelihiood function since the model is not analytically differentiable. _tf_lnlike now returns some reasonable gradients but I still have the same problem with the sampler. Honestly I'm not familiar enough with tf to even diagnose why it's not working so some suggestions for troubleshooting would be appreciated!
Edit: after some playing around it seems to be related to whether or not the model function calls tf.reduce_sum at any point.
It's hard to say much without understanding what's inside _tf_model. If it has inf or nan values or gradients, that can be troublesome as you've already seen. But also if the curvature (second order derivatives) of the likelihood is very sharp, the log-likelihood can be extremely sensitive to any move so that any proposal would be rejected. Are there any constraints on theta (must be positive, etc)? If so, you may want to use TransformedTransitionKernel to enforce those.

Can someone give a good math/stats explanation as to what the parameter var_smoothing does for GaussianNB in scikit learn?

I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.
I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).

sklearn logistic regression gives biased results?

I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually).
I run LR as follows: I have a matrix of called “success_fail” that has for each setting (row of the design matrix) the number of success and number of fail. I run LR as:
skdesign = np.vstack((design,design))
sklabel = np.hstack((np.ones(success_fail.shape[0]),
np.zeros(success_fail.shape[0])))
skweight = np.hstack((success_fail['success'], success_fail['fail']))
logregN = linear_model.LogisticRegression(C=1,
solver= 'lbfgs',fit_intercept=False)
logregN.fit(skdesign, sklabel, sample_weight=skweight)
(sklearn version 0.18)
I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better.
Below, I plot the results for the 1000 different regressions for 2 different values of C:
I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model.
Why is this happening? How can I fix it? Can I make sklearn regularize the intercept less?
Thanks to the lovely folks at the sklearn mailing list I found out the answer. As you can see in the Question I made a design matrix (including intercept), and then fit the model with "fit_intercept = False" set. This resulted in regularization of the intercept. Very stupid on my part! All I needed to do was remove the intercept from the design and remove "fit_intercept = False".

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

Unbalanced classification using RandomForestClassifier in sklearn

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any
single class carrying a negative weight in either child node.
In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
Update
Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score?
Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then?
I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y)).
Use the parameter class_weight='balanced'
From sklearn documentation: The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
If the majority class is 1, and the minority class is 0, and they are in the ratio 5:1, the sample_weight array should be:
sample_weight = np.array([5 if i == 1 else 1 for i in y])
Note that you do not invert the ratios.This also applies to class_weights. The larger number is associated with the majority class.

Categories

Resources