Tensorflow MCMC doesn't evolve chain states

Tensorflow MCMC doesn't evolve chain states - python

I'm fairly new to tensorflow and MCMC in general. I'm doing a few basic calculations with different models, the most basic model converges without problem and gives good results from the MCMC calculation. However, when I use a more advanced model, I have a problem where the chain states are never evolved from the initial state.
I'm calling the sampler via this code:
nkernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=_tf_lnlike,
num_leapfrog_steps=5,
step_size=0.1)
adapt_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
inner_kernel=nkernel,
num_adaptation_steps=num_burnin_steps,
target_accept_prob=0.75)
chains_states = tfp.mcmc.sample_chain(
num_results=nresults,
num_burnin_steps=num_burnin_steps,
current_state=initial_state,
kernel=adapt_kernel,
trace_fn=None)
The likelihood function looks like this:
#tf.function
def _tf_lnlike(theta):
y0 = tf.tensordot(tf.ones(theta.shape[0], dtype=dtype), data, axes=0)
y0_err = tf.tensordot(tf.ones(theta.shape[0], dtype=dtype), data_err, axes=0)
y_model = _tf_model(theta)
return tf.math.reduce_sum(-0.5*((y_model-y0)/y0_err)**2, axis=1)
where _tf_model is a rather complex function (so I won't post it here). This is essentially trying to fit some input data (which are tf.constant). The first thing I checked was the gradients, which had inf or nan values from _tf_model. The simplest way I thought to solve that was to write a very simple numerical gradient function into the likelihiood function since the model is not analytically differentiable. _tf_lnlike now returns some reasonable gradients but I still have the same problem with the sampler. Honestly I'm not familiar enough with tf to even diagnose why it's not working so some suggestions for troubleshooting would be appreciated!
Edit: after some playing around it seems to be related to whether or not the model function calls tf.reduce_sum at any point.

It's hard to say much without understanding what's inside _tf_model. If it has inf or nan values or gradients, that can be troublesome as you've already seen. But also if the curvature (second order derivatives) of the likelihood is very sharp, the log-likelihood can be extremely sensitive to any move so that any proposal would be rejected. Are there any constraints on theta (must be positive, etc)? If so, you may want to use TransformedTransitionKernel to enforce those.

Related

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

GAN Generator output layer is [way] out of target range - potentially due to loss func

The TL;DR is that my G(z) (z is normal distributed noise) is way out of the expected range. Picture - see comparison of real against generated data.
(It's also a terrible model, being just a few dense layers - but I can sort that later!)
I've thrown s**t at the wall for the past week to see what sticks and now it's time to ask for an informed opinion! :)
Has anyone else encountered this problem before? Suggested fixes?
What is your generic loss function for your regressive Generator (with a categorical discriminator)?
As it's a regression problem (with target data normalised between 0 & 1) my G output has a linear activation. However, I'm trying to follow Soumith's GAN Hacks to get it working, where we're conflicting over the activation - he says it should be tanh. One could assume with my expected target range I could choose sigmoid, but as I say, I infer this should be linear as a regressive problem to solve. Opinions?
Plus there isn't much clarity on the loss function. He Says:-
In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D
Does this mean one should aim to minimise the error log(1-D), or pick the minimum of log(1-D)? Similarly with max log(D). (I'm using class definitions of true/false, so a maximum and minimum from the can be chosen from the two classes.)
In the Goodfellow et. al 2014 paper on GANs, I can see how some of the loss function for the Discriminator resolves to 0 when applied to the Generator. However, I'm obviously missing something, as the equation states I should take log(D(x)) which will sooner or later hit a NaN.
Clarification about my confused state is thanked many times in advance.
Update:
This tutorial on O'Reilly says
Take a look at the last line of our discriminator: there's no softmax or sigmoid layer at the end. GANs can fail if their discriminators "saturate," or become confident enough to return exactly 0 when they're given a generated image; that leaves the discriminator without a useful gradient to descend.
Which sounds a little like my problem, especially as I was using Softmax. Using no activation doesn't solve the problem, but does make my GAN compete in interesting new ways!

changing the bias parameter b in scikit SVM after training, before prediction

I'm using sklearn.svm.SVC for a classification problem. After having trained on my data, I would like to loop the bias (i.e. the term b in the usual sign(w.x + b) SVM equation) through a number of values, so as to produce an ROC curve. (I've already performed cross-validation and chosen my hyperparameters, so this is for testing).
I tried playing with the .intercept_ attribute, but this doesn't change what I get out from .predict()... Is there an alternative method for altering the bias term?
I could potentially recover the support vectors, and then implement my own .predict() function, with an altered bias, but this seems like a rather heavy-handed approach.

I had a very same problem 2 years ago. Unfortunately the only solution is to do this by yourself. Implementing "predict" is pretty straight forward, it is a one-liner in Python. Unfortunately .intercept_ is actually a copy of intercept used internally (the libsvm one). Quite confusing thing is that for LinearSVC from the very same library it is not true, and you can actually alternate the bias (however, without access to kernels, obviously).
Obviously you do not have to go as deep as computing kernels values yourself. You still have access to "decision_function", which in the end, has a bias inside. Simply remove the old bias from decision function, add new one, and take the sign. This will be (up to the sign of bias):
def new_predict(clf, new_bias, X):
return np.sign(clf.decision_function(X) + clf.intercept_ - new_bias)

Posterior probability with pymc

(This question was originally posted on stats.O. I moved it here because it does relate with pymc and more general matters within it: in fact the main aim is to have a better understanding of how pymc works. If any of the moderators believe it not to be suitable for SO, I'll remove from here.)
I've been reading the pymc tutorial and many other questions both in here and in SO.
I am trying to understand how to apply Bayes' theorem to compute the posterior probability using certain data. In particular, I have a tuple of independent parameters
From the data I'd like to infer the likelihood , where is a certain event. Then the aim is to compute
Some additional comments:
This is a sort of unsupervised learning, I know that happened, and I want to find out the parameters that maximise the probability . (*)
I'd also like to have a parallel procedure where I let pymc compute the likelihood given the data and then for each tuple of parameters I want to get the posterior probability.
In the following I'll assume that and that the likelihood is a multi-dimensional normal distribution with (because of the independence).
The following is the code I'm using (for simplicity assume there are only two parameters). The code is still under development (I know it can't work!). But I believe it is useful to include it, and then refine it following comments and answers to provide a skeleton for future reference.
class ObsData(object):
def __init__(self, params):
self.theta1 = params[0]
self.theta2 = params[1]
class Model(object):
def __init__(self, data):
# Priors
self.theta1 = pm.Uniform('theta1', 0, 100)
self.theta2 = pm.Normal('theta2', 0, 0.0001)
#pm.deterministic
def model(
theta1=self.theta1,
theta2=self.theta2,
):
return (theta1, theta2)
# Is this the actual likelihood?
self.likelihood = pm.MvNormal(
'likelihood',
mu=model,
tau=np.identity(2),
value=data, # is it correct to put the data here?
observed=True
)
def mcmc(observed_data):
data = ObsData(observed_data)
pymc_obj = Model(data)
model = pymc.MCMC(pymc_obj)
model.sample(10000, verbose=0) # Does this line compute the likelihood and the normalisation factor?
# How do I get the posterior distribution?
The questions are the following:
Does self.likelihood represent the Bayesian likelihood?
How do I make use of the data? (I suspect value=data is incorrect..)
Does .sample() actually compute the posterior probability?
How do I get information from the posterior?
(*)Should I include anything that relates to the fact that happened at some point?
As general question: is it any way to use pymc to only compute the likelihood given the data and the priors?
Any comments, and also reference to other question or tutorial are welcome!

For starters, I think you want to return (theta1*theta2) from your definition of model.
model.sample is sampling, not computing, the posterior distributions (given sufficient burn-in, etc) of your parameters given your data, and the likelihood of specific values for each tuple of parameters can be determined from the joint posterior after sampling.
I think you've got some fundamental misunderstanding of MCMC at the moment. I can't think of a better way to answer your questions than to point you to the wonderful Bayesian Methods for Hackers

Bayesian Stochastic Optimal Control, MCMC

I have a Stochastic Optimal Control problem that I wish to solve, using some type of Bayesian Simulation based framework. My problem has the following general structure:
s_t+1 = r*s_t(1 - s_t) - x_t+1 + epsilon_t+1
x_t+1 ~ Beta(u_t+1, w_t+1)
u_t+1 = f_1(u_t,w_t, s_t, x_t)
w_t+1 = f_2(u_t,w_t, s_t, x_t)
epsilon_t ~ Normal(0,sigma)
objective function: max_{x_t} E(Sigma_{t=0}^{T} V(s_t,x_t,c) * rho^t)
My goal is to explore different functional forms of f_1, f_2, and V to determine how this model differs w.r.t a non-stochastic model and another simpler stochastic model.
State variables are s_t, control variables are x_t with u_t and w_t representing some belief of the current state. The objective function is the discounted maximum from gains (function V) over the time period t=0 to t=T.
I was thinking of using Python, specifically PyMC to solve this, though I am not sure how to proceed, specifically how to optimize the control variables. I found a book, published 1967, Optimization of Stochastic Systems by Masanao Aoki, that references some bayesian techniques that may be useful, is there a current Python implementation that may help? Or is there a much better way to simulate a optimal path, using Python?

The first guess coming to my mind is to try neural network packages like chainer or theano which can track derivative of your cost function with respect to control function parameters; they also have a bunch of optimization plug-in routines. You can use numpy.random to generate samples (particles), compose your control functions from the libraries components, and run them through explicit Euler scheme for first try. This will give you cost function on your particles and its derivative with respect to parameters, which can be fed to the optimizers.
The issue that can arise here is that solver's iterations will create a host of derivative-tracking objects.
update: Please see this example on Github
Also there is a number of hits on Github with keywords particle filter python:
https://github.com/strohel/PyBayes
https://github.com/jerkern/pyParticleEst
Also there is a manuscript around which mentions that the author implemented filters in Python, so you might want to contact them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.