Advice on how to create a custom tf.keras optimizer (optimizer_v2)

Advice on how to create a custom tf.keras optimizer (optimizer_v2) - python

I want to make an accumulated SGD optimizer for tf.keras (not keras standalone). I have found a couple of implementations of standalone keras accumulated SGD optimizers including this one on pypi. Nevertheless, I am using a project which make use of tf.keras. And as I have seen it's not a good idea to mix them together.
The problem is that the documentation for achieving this custom optimizer is not really straight forward. The base class (which I should inherit from) is Optimizer_v2.py which contains some information in the comment section about the task.
The required methods that should be overridden are:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
- get_config (serialization of the optimizer, include all hyper parameters)
Of course of these ones only get_config() actually exists in the base class. resource_apply_dense is actually _resource_apply_dense, resource_apply_sparse is _resource_apply_sparse and create_slots does not even exist in base class. In subclasses as SGD in gradient_decent.py, create_slots also exists as _create_slots.
Anyway, apparently the documentation is not updated (there is also an issue regarding this in git but I don't remember the link which pointed this lack of consistency with the documentation) but this makes the whole procedure difficult. For example in SGD I have to override the _resource_apply_dense() method but I cannot understand where the gradients are being calculated and where they are updated.
The actual code is given below:
def _resource_apply_dense(self, grad, var, apply_state=None):
var_device, var_dtype = var.device, var.dtype.base_dtype
coefficients = ((apply_state or {}).get((var_device, var_dtype))
or self._fallback_apply_state(var_device, var_dtype))
if self._momentum:
momentum_var = self.get_slot(var, "momentum")
return training_ops.resource_apply_keras_momentum(
var.handle,
momentum_var.handle,
coefficients["lr_t"],
grad,
coefficients["momentum"],
use_locking=self._use_locking,
use_nesterov=self.nesterov)
else:
return training_ops.resource_apply_gradient_descent(
var.handle, coefficients["lr_t"], grad, use_locking=self._use_locking)
which obviously rely on training_ops.resource_apply_keras_momentum and training_ops.resource_apply_gradient_descent to do the actual job. How can I split the 2 parts mentioned in the minimize() method in OptimizerV2 from the above code? The 2 parts are:
_compute_gradients() and apply_gradients().
There are a lot of parts that are confusing in this comments like for example in the base class:
Many optimizer subclasses, such as Adam and Adagrad allocate and
manage additional variables associated with the variables to train.
These are called Slots. Slots have names and you can ask the
optimizer for the names of the slots that it uses.
although if I declare an Adam optimizer and ask for slot names I get an empty list (?).
optimizer = Adam(lr=1e-3)
optimizer.get_slot_names()
[]
Another confusing issue is the use of private methods which is not clear when they are called and what's their purpose. For example _prepare_local() is contained within SGD and includes a line:
apply_state[(var_device, var_dtype)]["momentum"] = array_ops.identity(self._get_hyper("momentum", var_dtype))
Anyway, the problem here is that I do not know which exactly approach to follow to create a custom tf.keras optimizer. Instructions included in comments seem to contradict with the actual implemented subclasses, and the latter also seem to assign the dirty work to the actual C++ function without being clear how this is done or how (in my case) to separate the actions (like the gradient calculation and application). So, is there any advice someone can provide on how to proceed and steps to follow to accomplish this (relatively) simple task?
I am using tf 1.15 by the way (so the links are from there).

Reference for optimizer : DiffGrad (kind of Adam like)
https://github.com/evanatyourservice/diffGrad-tf/blob/master/diffgrad.py
It is based on a paper called DiffGrad , they have good explanations and generally a good read.
First of all good question, secondly TensorFlow documentation can do a lot better. Answers to various questions in no particular order:
In reference to empty slot list for Adam, you have to run a model.fit once on a model for it to initialize as far as I have seen. Remember reading about it while looking up saving and loading optimizer states (check if it works on model.compile).
As for _prepare_local, that line creates the momentum variable from the hyper parameter you set on creation. I suppose it makes it accessible to all the weights the optimizer is trying to update, why they use identity is deep TensorFlow graph stuff.
Why they use _prepare_local generally is to create variables that are common across all weighs that are being updated like decays or learning rates or time steps and such. For every Iteration these variables are common across all variables tracked in the optimizer's var_list.
Unlike the above _prepare_local, slots are separate variables for each weight tracked by the optimizer so you might have moments or history or cumulative sum. Anything to do with that specific individual weight.
Gradient compute and apply: If I understand this correctly compute gradients takes the loss does back propagation and auto differentiation and gets you the "gradients" for each weight. when you go to apply it is when the optimizer comes into play with its slots and variables. finally optimizer does the updating with the computed gradients as inputs.

Related

Smartest way to add KL Divergence into (Variational) Auto Encoder

I have an Auto Encoder model with multiple outputs and weightening which a want to enrich into a Variational Auto Encoder.
I followed this: https://keras.io/examples/generative/vae/ official keras tutorial.
But if a manually adapt the train_step function I lose the majority of my original implementation details:
I got two weighted optimization goals: re-construction (decoder) and classification (softmax)
accuracy metrics for the classification
the original fit method also takes care of the validation data and corresponding metrics
Adding the suggested sampling layer according to the keras link is no problem, but to correctly implement the Kullback-Leibler-Loss as it depends on the additional parameters z_mu and z_log_var which is not supported by standard Keras losses.
I search for some workarounds to solve this issue but none of them was succesfull:
re-writing the train_step: its hard to fully re-implement all details (
weightening, multiple losses with different inputs -> decoder: data, classifier: labels etc)
adding a psyeudo layer to the ecoder that calculates the loss: https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/ like here. But here is the problem that the add loss function does not specify to which key and how KL-Loss is added to the model's total loss
Adding everything as global/top-level element to make the z_mu, z_log_var accessible for the loss calculation like here: https://www.machinecurve.com/index.php/2019/12/30/how-to-create-a-variational-autoencoder-with-keras/. This is the approach I like the least as my current architecture is parametrized to be able to e.g. perform hyperopt tuning
I was not able to find a pleasing solution to this problem, as VAE's are more and more popular I am surprised by the phenomenon that there is no extended tutorial about this especially when dealing with multiple in- and outputs. Or I am just unable to find the right answers through my query.
Any opinions welcome!

After a couple of re-designs I and bug-ticket tracing I found this recent example:
here
The VAE examples can be found at the very bottom of the post.
Solution: write your own train_step: cleanest but also hardest solution depending how complex your loss calculation is.
Solution: use a functional approach the access the necessary variables and add the loss with .add_loss: not very clean but straight to implement (you will lose an additional loss tracker for the KL-loss)
To achieve my weighting I weighted the KL loss before I added it via .add_loss according to the weight of my decoder loss.
Note: The first solution I tested was to define a custom loss function for the mse+kl loss and added it into my functional designed model - this works if one turns of the tf eager eval off. But be careful this really slows down your network and you will lose the ability to monitor your training via tensorboard if you don't have admin rights for your nvidia gpu (profile_batch=0 does not turn off profiling if eager mode is switched off, therefore you ran into INSUFFICENT_PRIVILEDGES Errors with the CUPTI driver)

Potential bug in tensorflow CNN tutorial code using ExponentialMovingAverage?

I'm studying the code of the tensorflow convolution neural network tutorial which trains a CNN using the cifar10 dataset. The source code lies here in Gihub and the document in Document.
My question is specifically about the use of ExponentialMovingAverage(doc here) in cifar10.py line 375-378. which is
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
return train_op
Here, the variables_averages_op is an operation that updates all the shadow variables and apply_gradient_op is an operation that applies computed gradients to all original variables(which updates the original variables, a.k.a. model weights).
Since control_dependencies doesn't guarantee the order of the execution of its passed arguments, the execution order of apply_gradient_op and variables_averages_op is arbitrary in this example, which further indicates that upon running the train_op, we could end up with firstly updating the original variables and then updating the corresponding shadow variables, or updating shadow variables before the original variables. The latter one seems unreasonable to me.
According to the official doc of ExponentialMovingAverage(link above), the update of the shadow variable relies on the original variable:
shadow_variable = decay * shadow_variable + (1 - decay) * variable
The update of the original variable should be before the update of the shadow ones, which is not the case in the tutorial code.
Can anyone help me clear that? Thanks.

I believe you are right. It does seem like a bug in the example. It is probably not important in practice as the order of variable update and moving average update is likely to be stable. Even if it is the "wrong" order, in the worst case, your moving average will be "one step ahead of the variable". Which is likely to have a less significant effect than changing your decay from 0.999 to 0.998 or something like that.
Just created a pull request to fix this: https://github.com/tensorflow/models/pull/3946

Why should I build separated graph for training and validation in tensorflow?

I've been using tensorflow for a while now. At first I had stuff like this:
def myModel(training):
with tf.scope_variables('model', reuse=not training):
do model
return model
training_model = myModel(True)
validation_model = myModel(False)
Mostly because I started with some MOOCs that tought me to do that. But they also didn't use TFRecords or Queues. And I didn't know why I was using two separate models. I tried building only one and feeding the data with the feed_dict: everything worked.
Ever since I've been usually using only one model. My inputs are always place_holders and I just input either training or validation data.
Lately, I've noticed some weird behavior on models that use tf.layers.dropout and tf.layers.batch_normalization. Both functions have a 'training' parameter that I use with a tf.bool placeholder. I've seen tf.layers used generally with a tf.estimator.Estimator, but I'm not using it. I've read the Estimators code and it appears to create two different graphs for training and validation. May be that those issues are arising from not having two separate models, but I'm still skeptical.
Is there a clear reason I'm not seeing that implies that two separate-equivalent models have to be used?

You do not have to use two neural nets for training and validation. After all, as you noticed, tensorflow helps you having a monolothical train-and-validate net by allowing the training parameter of some layers to be a placeholder.
However, why wouldn't you? By having separate nets for training and for validation, you set yourself on the right path and future-proof your code. Your training and validation nets might be identical today, but you might later see some benefit to having distinct nets such as having different inputs, different outputs, removing out intermediate layers, etc.
Also, because variables are shared between them, having distinct training and validation nets comes at almost no penalty.
So, keeping a single net is fine; in my experience though, any project other than playful experimentation is likely to implement a distinct validation net at some point, and tensorflow makes it easy to do just that with minimal penalty.

tf.estimator.Estimator classes indeed create a new graph for each invocation and this has been the subject of furious debates, see this issue on GitHub. Their approach is to build the graph from scratch on each train, evaluate and predict invocations and restore the model from the last checkpoint. There are clear downsides of this approach, for example:
A loop that calls train and evaluate will create two new graphs on every iteration.
One can't evaluate while training easily (though there are workarounds, train_and_evaluate, but this doesn't look very nice).
I tend to agree that having the same graph and model for all actions is convenient and I usually go with this solution. But in a lot of cases when using a high-level API like tf.estimator.Estimator, you don't deal with the graph and variables directly, so you shouldn't care how exactly the model is organized.

Gradient Boosting with a OLS Base Learner

I've been playing with the Boostings function in Sklearn and I've noticed a key difference between sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.AdaBoostRegressor. While the latter allows the user to specify the base learner, the former does not. Specifically, sklearn.ensemble.GradientBoostingRegressor only utilize trees. This is a bit annoying, as it would be nice to utilize OLS and Spline base learners within Gradient Boosting. Am I missing something? Does another function within the Sklearn library or a different python library offer this functionality?

Here's one way of doing it.
Replace sklearn/ensemble/gradient_boosting.py with this script.
You'll then be able to pass base_estimator to GradientBoostingRegressor.
If you're satisfied, please enjoy. If not, please see below for a discussion.
Demonstration
Before we begin, I'd like to mention a few things. All of the plots in this post can be reproduced from the scripts available in my repo for this project. Simply replace the contents of restore.sh and run.sh with the appropriate directories for your local installation of Sklearn. Just keep in mind that the two bash scripts will permanently overwrite your existing Sklearn file (gradient_boosting.py). You may recover the file by simply copying it from the Sklearn repo. I don't claim that anything demonstrated in this post is good practice for developing new features for Sklearn. I'm not affiliated with Sklearn nor am I an experienced software developer. All scripts were tested on Sklearn 18.2.
Now, using the new gradient_boosting.py script, we may specify the linear base learner (as you typically would for AdaBoostRegressor). As a sanity check, we may fit the linear-base GBR on a single variable and ensure that it produces a linear visual when plotted against that variable. Here's the result for four (arbitrary) variables from the Boston Housing dataset.
As another sanity check, we may ensure that the results of the original tree-base GBR can be reproduced by passing a tree base_estimator to the new GBR. The following plot is an MSE 10-fold cross-validation profile with respect to the boosting iterations. That is, for each number of boosting iterations, I cross-validate to obtain a vector of 10 MSE scores and compute the mean/min of this vector. The left plot is the original GBR, while the right plot is the one using the new gradient_boosting.py.
What exactly has been changed?
We can use any diffchecker to compare the original gradient_boosting.py to the new one. This will show you all the steps I took to create this new script, but the main step is to modify the _fit_stage() and _decision_function() methods. Within _fit_stage(), which is responsible for fitting the learner at each boosting iteration, we notice that the base learner (named tree) is instantiated with DecisionTreeRegressor, so we simply need to add the following conditions in order for this method to use the custom learner specified by the base_estimator argument instead:
if (self.base_estimator is None or
isinstance(self.base_estimator,
(BaseDecisionTree, BaseForest))):
# Original code for decision trees will go here.
else:
base_learner = self.base_estimator
if X_csc is not None:
base_learner.fit(X_csc, residual, sample_weight=sample_weight)
else:
base_learner.fit(X, residual, sample_weight=sample_weight)
if X_csr is not None:
y_pred[:, k] += self.learning_rate * base_learner.predict(X_csr).ravel()
else:
y_pred[:, k] += self.learning_rate * base_learner.predict(X).ravel()
self.estimators_[i, k] = base_learner
Next, we may inspect _decision_function(), which is responsible for computing the boosted result when the top-level predict() function is called from GradientBoostingRegressor. Here, the heart of the calculation lies in a function called predict_stages(), which is the rather low-level Cython (.pyx) implementation of the boosting step intended for the tree-based ensemble only. In order to compute this step for an arbitrary base learner instead, we circumvent the entire predict_stages() calculation and enter it manually at the Python level:
def _decision_function(self, X):
score = self._init_decision_function(X)
score += self.learning_rate*sum(estimator[0].predict(X) for estimator in self.estimators_).reshape((-1, 1))
# predict_stages(self.estimators_, X, self.learning_rate, score)
return score
The above two changes are what I consider to be the main changes that need to be made. The rest are minor technical details, and they include:
Showing an error message if feature importances are requested when using a base learner that doesn't include a feature importance method
Including the base estimator in parameter checking and estimator validation (refer to the original script to see details on these, as I'm not entirely familiar)
Including the base estimator in the class headers and interfaces (e.g., including the super text in the class definitions for BaseGradientBoosting and GradientBoostingRegressor)
What can't be done (yet)?
The base learner that you choose must be able to support a sample_weights parameter. For this reason, I was unable to plug in a spline-like regressor (such as PyEarth). Please let me know if you or anyone manages to achieve this.
Also note that when using non-tree base learners, you may still pass tree-related arguments without error, but of course, they will be entirely ignored.
Here are things my script can't (necessarily) do:
Work with other losses besides the default, 'ls'. (Perhaps it can, but I haven't tried.)
Allow base_estimator to be passed to GradientBoostingClassifier
Overall, this was only a modest beginning toward truly custom base learners. Hope it helps.

How to get predictions out of tensorflow model after you've used tf.group on your optimizers

I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?

tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.