Potential bug in tensorflow CNN tutorial code using ExponentialMovingAverage?

Potential bug in tensorflow CNN tutorial code using ExponentialMovingAverage? - python

I'm studying the code of the tensorflow convolution neural network tutorial which trains a CNN using the cifar10 dataset. The source code lies here in Gihub and the document in Document.
My question is specifically about the use of ExponentialMovingAverage(doc here) in cifar10.py line 375-378. which is
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
return train_op
Here, the variables_averages_op is an operation that updates all the shadow variables and apply_gradient_op is an operation that applies computed gradients to all original variables(which updates the original variables, a.k.a. model weights).
Since control_dependencies doesn't guarantee the order of the execution of its passed arguments, the execution order of apply_gradient_op and variables_averages_op is arbitrary in this example, which further indicates that upon running the train_op, we could end up with firstly updating the original variables and then updating the corresponding shadow variables, or updating shadow variables before the original variables. The latter one seems unreasonable to me.
According to the official doc of ExponentialMovingAverage(link above), the update of the shadow variable relies on the original variable:
shadow_variable = decay * shadow_variable + (1 - decay) * variable
The update of the original variable should be before the update of the shadow ones, which is not the case in the tutorial code.
Can anyone help me clear that? Thanks.

I believe you are right. It does seem like a bug in the example. It is probably not important in practice as the order of variable update and moving average update is likely to be stable. Even if it is the "wrong" order, in the worst case, your moving average will be "one step ahead of the variable". Which is likely to have a less significant effect than changing your decay from 0.999 to 0.998 or something like that.
Just created a pull request to fix this: https://github.com/tensorflow/models/pull/3946

Related

Advice on how to create a custom tf.keras optimizer (optimizer_v2)

I want to make an accumulated SGD optimizer for tf.keras (not keras standalone). I have found a couple of implementations of standalone keras accumulated SGD optimizers including this one on pypi. Nevertheless, I am using a project which make use of tf.keras. And as I have seen it's not a good idea to mix them together.
The problem is that the documentation for achieving this custom optimizer is not really straight forward. The base class (which I should inherit from) is Optimizer_v2.py which contains some information in the comment section about the task.
The required methods that should be overridden are:
- resource_apply_dense (update variable given gradient tensor is dense)
- resource_apply_sparse (update variable given gradient tensor is sparse)
- create_slots (if your optimizer algorithm requires additional variables)
- get_config (serialization of the optimizer, include all hyper parameters)
Of course of these ones only get_config() actually exists in the base class. resource_apply_dense is actually _resource_apply_dense, resource_apply_sparse is _resource_apply_sparse and create_slots does not even exist in base class. In subclasses as SGD in gradient_decent.py, create_slots also exists as _create_slots.
Anyway, apparently the documentation is not updated (there is also an issue regarding this in git but I don't remember the link which pointed this lack of consistency with the documentation) but this makes the whole procedure difficult. For example in SGD I have to override the _resource_apply_dense() method but I cannot understand where the gradients are being calculated and where they are updated.
The actual code is given below:
def _resource_apply_dense(self, grad, var, apply_state=None):
var_device, var_dtype = var.device, var.dtype.base_dtype
coefficients = ((apply_state or {}).get((var_device, var_dtype))
or self._fallback_apply_state(var_device, var_dtype))
if self._momentum:
momentum_var = self.get_slot(var, "momentum")
return training_ops.resource_apply_keras_momentum(
var.handle,
momentum_var.handle,
coefficients["lr_t"],
grad,
coefficients["momentum"],
use_locking=self._use_locking,
use_nesterov=self.nesterov)
else:
return training_ops.resource_apply_gradient_descent(
var.handle, coefficients["lr_t"], grad, use_locking=self._use_locking)
which obviously rely on training_ops.resource_apply_keras_momentum and training_ops.resource_apply_gradient_descent to do the actual job. How can I split the 2 parts mentioned in the minimize() method in OptimizerV2 from the above code? The 2 parts are:
_compute_gradients() and apply_gradients().
There are a lot of parts that are confusing in this comments like for example in the base class:
Many optimizer subclasses, such as Adam and Adagrad allocate and
manage additional variables associated with the variables to train.
These are called Slots. Slots have names and you can ask the
optimizer for the names of the slots that it uses.
although if I declare an Adam optimizer and ask for slot names I get an empty list (?).
optimizer = Adam(lr=1e-3)
optimizer.get_slot_names()
[]
Another confusing issue is the use of private methods which is not clear when they are called and what's their purpose. For example _prepare_local() is contained within SGD and includes a line:
apply_state[(var_device, var_dtype)]["momentum"] = array_ops.identity(self._get_hyper("momentum", var_dtype))
Anyway, the problem here is that I do not know which exactly approach to follow to create a custom tf.keras optimizer. Instructions included in comments seem to contradict with the actual implemented subclasses, and the latter also seem to assign the dirty work to the actual C++ function without being clear how this is done or how (in my case) to separate the actions (like the gradient calculation and application). So, is there any advice someone can provide on how to proceed and steps to follow to accomplish this (relatively) simple task?
I am using tf 1.15 by the way (so the links are from there).

Reference for optimizer : DiffGrad (kind of Adam like)
https://github.com/evanatyourservice/diffGrad-tf/blob/master/diffgrad.py
It is based on a paper called DiffGrad , they have good explanations and generally a good read.
First of all good question, secondly TensorFlow documentation can do a lot better. Answers to various questions in no particular order:
In reference to empty slot list for Adam, you have to run a model.fit once on a model for it to initialize as far as I have seen. Remember reading about it while looking up saving and loading optimizer states (check if it works on model.compile).
As for _prepare_local, that line creates the momentum variable from the hyper parameter you set on creation. I suppose it makes it accessible to all the weights the optimizer is trying to update, why they use identity is deep TensorFlow graph stuff.
Why they use _prepare_local generally is to create variables that are common across all weighs that are being updated like decays or learning rates or time steps and such. For every Iteration these variables are common across all variables tracked in the optimizer's var_list.
Unlike the above _prepare_local, slots are separate variables for each weight tracked by the optimizer so you might have moments or history or cumulative sum. Anything to do with that specific individual weight.
Gradient compute and apply: If I understand this correctly compute gradients takes the loss does back propagation and auto differentiation and gets you the "gradients" for each weight. when you go to apply it is when the optimizer comes into play with its slots and variables. finally optimizer does the updating with the computed gradients as inputs.

How to check number of times a Weight Tensor has been updated in TensorFlow?

I have a simple question. I have created my own weights using tf.get_variable. For debugging purposes I need to check how many times the weights have been updated, i.e. how many times the optimizer have actually updated the weight?
How can this be done? If you require my code I can provide it.

That is usually called in TensorFlow the "global step", and it has some helper functions for it:
global_step = tf.train.create_global_step()
optimizer.minimize(loss, global_step=global_step)
tf.train.get_global_step really just creates a variable, but making sure it is not trainable and adding it to the right collections. You also have tf.train.get_global_step so you can do:
optimizer.minimize(loss, global_step=tf.train.create_global_step())
And at some other point in the code retrieve the tensor like:
global_step = tf.train.get_global_step()
Moreover, you can use tf.train.get_or_create_global_step if you are not sure which part of the graph definition will go first.
The is one more function, tf.train.global_step, but I don't think it serves any purpose currently, since it just runs the given tensor in the given session and returns its value as an integer.

Use same RNN twice

I would like to run the same RNN cell over two inputs in Tensorflow.
My code:
def lstm_cell():
return tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True)
self.forward_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(layers)], state_is_tuple=True)
self.initial_state = self.forward_cell.zero_state(self.batch_size, tf.float32)
outputs1, state1 = tf.nn.dynamic_rnn(self.forward_cell, input1, initial_state=self.initial_state)
outputs2, state2 = tf.nn.dynamic_rnn(self.forward_cell, input2, initial_state=self.initial_state)
My question now is, is this the correct code to do what I want (use the SAME RNN on both inputs, i.e. share the weights).
On a similar post I found a similar solution using reuse_variables(): Running the same RNN over two tensors in tensorflow
I would go for that, but with my current solution I do not get a reuse error, which confuses me. When I print my variables it seems to be fine, too.
Could you explain why there is no reuse error in my case, and if this is correct?

Update:
After I double checked the source code in 1.6, I found that my memories from early versions are no longer actual (so thanks for bringing this up!). Your code indeed reuses the cell variables, because cells are initialized lazily and only once (see RNNCell.build() method, which actually creates the kernel and bias). After the cell is built, it's not rebuilt upon the next call. This means that a single instance of a cell always holds the same variables, no matter how often it's used in different networks, until you manually reset the built state. That's why reuse parameter no longer matters.
Original answer (no longer valid):
Your current code creates two independent RNN layers (each one is deep), with same initial state. This means they have different weight matrices, different nodes in graph, etc. Tensorflow has nothing to complain about, because it doesn't know they are intended to be shared. That's why you should specify reuse=True before calling tf.dynamic_rnn as the question you refer to suggests, this will cause tensorflow share the kernels of all cells.

Gradient Boosting with a OLS Base Learner

I've been playing with the Boostings function in Sklearn and I've noticed a key difference between sklearn.ensemble.GradientBoostingRegressor and sklearn.ensemble.AdaBoostRegressor. While the latter allows the user to specify the base learner, the former does not. Specifically, sklearn.ensemble.GradientBoostingRegressor only utilize trees. This is a bit annoying, as it would be nice to utilize OLS and Spline base learners within Gradient Boosting. Am I missing something? Does another function within the Sklearn library or a different python library offer this functionality?

Here's one way of doing it.
Replace sklearn/ensemble/gradient_boosting.py with this script.
You'll then be able to pass base_estimator to GradientBoostingRegressor.
If you're satisfied, please enjoy. If not, please see below for a discussion.
Demonstration
Before we begin, I'd like to mention a few things. All of the plots in this post can be reproduced from the scripts available in my repo for this project. Simply replace the contents of restore.sh and run.sh with the appropriate directories for your local installation of Sklearn. Just keep in mind that the two bash scripts will permanently overwrite your existing Sklearn file (gradient_boosting.py). You may recover the file by simply copying it from the Sklearn repo. I don't claim that anything demonstrated in this post is good practice for developing new features for Sklearn. I'm not affiliated with Sklearn nor am I an experienced software developer. All scripts were tested on Sklearn 18.2.
Now, using the new gradient_boosting.py script, we may specify the linear base learner (as you typically would for AdaBoostRegressor). As a sanity check, we may fit the linear-base GBR on a single variable and ensure that it produces a linear visual when plotted against that variable. Here's the result for four (arbitrary) variables from the Boston Housing dataset.
As another sanity check, we may ensure that the results of the original tree-base GBR can be reproduced by passing a tree base_estimator to the new GBR. The following plot is an MSE 10-fold cross-validation profile with respect to the boosting iterations. That is, for each number of boosting iterations, I cross-validate to obtain a vector of 10 MSE scores and compute the mean/min of this vector. The left plot is the original GBR, while the right plot is the one using the new gradient_boosting.py.
What exactly has been changed?
We can use any diffchecker to compare the original gradient_boosting.py to the new one. This will show you all the steps I took to create this new script, but the main step is to modify the _fit_stage() and _decision_function() methods. Within _fit_stage(), which is responsible for fitting the learner at each boosting iteration, we notice that the base learner (named tree) is instantiated with DecisionTreeRegressor, so we simply need to add the following conditions in order for this method to use the custom learner specified by the base_estimator argument instead:
if (self.base_estimator is None or
isinstance(self.base_estimator,
(BaseDecisionTree, BaseForest))):
# Original code for decision trees will go here.
else:
base_learner = self.base_estimator
if X_csc is not None:
base_learner.fit(X_csc, residual, sample_weight=sample_weight)
else:
base_learner.fit(X, residual, sample_weight=sample_weight)
if X_csr is not None:
y_pred[:, k] += self.learning_rate * base_learner.predict(X_csr).ravel()
else:
y_pred[:, k] += self.learning_rate * base_learner.predict(X).ravel()
self.estimators_[i, k] = base_learner
Next, we may inspect _decision_function(), which is responsible for computing the boosted result when the top-level predict() function is called from GradientBoostingRegressor. Here, the heart of the calculation lies in a function called predict_stages(), which is the rather low-level Cython (.pyx) implementation of the boosting step intended for the tree-based ensemble only. In order to compute this step for an arbitrary base learner instead, we circumvent the entire predict_stages() calculation and enter it manually at the Python level:
def _decision_function(self, X):
score = self._init_decision_function(X)
score += self.learning_rate*sum(estimator[0].predict(X) for estimator in self.estimators_).reshape((-1, 1))
# predict_stages(self.estimators_, X, self.learning_rate, score)
return score
The above two changes are what I consider to be the main changes that need to be made. The rest are minor technical details, and they include:
Showing an error message if feature importances are requested when using a base learner that doesn't include a feature importance method
Including the base estimator in parameter checking and estimator validation (refer to the original script to see details on these, as I'm not entirely familiar)
Including the base estimator in the class headers and interfaces (e.g., including the super text in the class definitions for BaseGradientBoosting and GradientBoostingRegressor)
What can't be done (yet)?
The base learner that you choose must be able to support a sample_weights parameter. For this reason, I was unable to plug in a spline-like regressor (such as PyEarth). Please let me know if you or anyone manages to achieve this.
Also note that when using non-tree base learners, you may still pass tree-related arguments without error, but of course, they will be entirely ignored.
Here are things my script can't (necessarily) do:
Work with other losses besides the default, 'ls'. (Perhaps it can, but I haven't tried.)
Allow base_estimator to be passed to GradientBoostingClassifier
Overall, this was only a modest beginning toward truly custom base learners. Hope it helps.

Caffe LeNet: Difference between `solver.step(1)` and `solver.net.forward()`

I was checking the Caffe LeNet Tutorial here and a question came to mind:
What's the difference between these 2 codes:
self.solver.step(1)
and
self.solver.net.forward() # train net
They both seem to train the network at least according to the comment.
Personally I think the first one trains the network on the training data and updates the weights of both net and test_net but the second one seems to only forward a batch of data and apply the learned weights from the previous step.
If what I think is right, then what is the purpose of the second code in the tutorial? why did the code do a net.forward ? can't solver.step(1) do this itself?
Thanks for your time

step does one full iteration, covering all three phases: forward evaluation, backward propagation, and update. The call to forward does only the first of these. There are also differences in the signature (parameter list).

I discovered a strange behavior in solver.step(1) and solver.net.forward(). When I used a custom layer for the input network, my instance layer needs a variable before using:
solver.net.layers[0].mySet(variable)
That variable was set in a local variable for my layer. But when I called for solver.step, that variable does not appear. However it does, when I use solver.net.forward(). I am not certain, but maybe solver.step is instantiating a new variable for the layer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.