Changing learning rate of snapshotted Tensorflow Optimizer

Changing learning rate of snapshotted Tensorflow Optimizer - python

As the question says...is there any way to change the learning rate of an Optimizer object? So say I continue some training procedure using a snapshot of a prior model, but since creating that snapshot, I realise that my model is skipping round the energy landscape instead of settling down and generating some pretty output. Is there any way to load in the Optimizer object and setting a new learning rate post-hoc?
There's the object field _lr in the python memory space and _lr_t (in the tf memory space) that are numerically equal, however assigning to _lr doesn't change the value of _lr_t. Presumably it hence does not affect computions on the graph. Sooo, how can that learning rate be changed? Do I have to construct a new Optimizer object that then must be also attached to the network output? And just ignore the "old" Optimizer object restored from the snapshot?
If so, that seems a bit wasteful in terms of memory, and messy in terms of code, rather than just providing some setters.

Related

LightGBM: train() vs update() vs refit()

I'm implementing LightGBM (Python) into a continuous learning pipeline. My goal is to train an initial model and update the model (e.g. every day) with newly available data.
Most examples load an already trained model and apply train() once again:
updated_model = lightgbm.train(params=last_model_params, train_set=new_data, init_model = last_model)
However, I'm wondering if this is actually the correct way to approach continuous learning within the LightGBM library since the amount of fitted trees (num_trees()) grows for every application of train() by n_estimators. For my understanding a model update should take an initial model definition (under a given set of model parameters) and refine it without ever growing the amount of trees/size of the model definition.
I find the documentation regarding train(), update() and refit() not particularly helpful. What would be considered the right approach to implement continuous learning with LightGBM?

In lightgbm (the Python package for LightGBM), these entrypoints you've mentioned do have different purposes.
The main lightgbm model object is a Booster. A fitted Booster is produced by training on input data. Given an initial trained Booster...
Booster.refit() does not change the structure of an already-trained model. It just updates the leaf counts and leaf values based on the new data. It will not add any trees to the model.
Booster.update() will perform exactly 1 additional round of gradient boosting on an existing Booster. It will add at most 1 tree to the model.
train() with an init_model will perform gradient boosting for num_iterations additional rounds. It also allows for lots of other functionality, like custom callbacks (e.g. to change the learning rate from iteration-to-iteration) and early stopping (to stop adding trees if performance on a validation set fails to improve). It will add up to num_iterations trees to the model.
What would be considered the right approach to implement continuous learning with LightGBM?
There are trade-offs involved in this choice and no one of these is the globally "right" way to achieve the goal "modify an existing model based on newly-arrived data".
Booster.refit() is the only one of these approaches that meets your definition of "refine [the model] without ever growing the amount of trees/size of the model definition". But it could lead to drastic changes in the predictions produced by the model, especially if the batch of newly-arrived data is much smaller than the original training data, or if the distribution of the target is very different.
Booster.update() is the simplest interface for this, but a single iteration might not be enough to get most of the information from the newly-arrived data into the model. For example, if you're using fairly shallow trees (say, num_leaves=7) and a very small learning rate, even newly-arrived data that is very different from the original training data might not change the model's predictions by much.
train(init_model=previous_model) is the most flexible and powerful option, but it also introduces more parameters and choices. If you choose to use train(init_model=previous_model), pay attention to parameters num_iterations and learning_rate. Lower values of these parameters will decrease the impact of newly-arrived data on the trained model, higher values will allow a larger change to the model. Finding the right balance between those is a concern for your evaluation framework.

Why does not using retain_graph=True result in error?

If I need to backpropagate through a neural network twice and I don't use retain_graph=True, I get an error.
Why? I realize it is nice to keep the intermediate variables used for the first backpropagation to be reused for the second backpropagation. However, why aren't they simply recalculated, like they were originally calculated in the first backpropagation?

By default, PyTorch doesn't store intermediate gradients, because the PyTorch's main feature is Dynamic Computational Graphs, so after backpropagation the graph will be freed all the intermediate buffers will be destroyed.

Why does setting backward(retain_graph=True) use up lot GPU memory?

I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?

The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/

Potential bug in tensorflow CNN tutorial code using ExponentialMovingAverage?

I'm studying the code of the tensorflow convolution neural network tutorial which trains a CNN using the cifar10 dataset. The source code lies here in Gihub and the document in Document.
My question is specifically about the use of ExponentialMovingAverage(doc here) in cifar10.py line 375-378. which is
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
return train_op
Here, the variables_averages_op is an operation that updates all the shadow variables and apply_gradient_op is an operation that applies computed gradients to all original variables(which updates the original variables, a.k.a. model weights).
Since control_dependencies doesn't guarantee the order of the execution of its passed arguments, the execution order of apply_gradient_op and variables_averages_op is arbitrary in this example, which further indicates that upon running the train_op, we could end up with firstly updating the original variables and then updating the corresponding shadow variables, or updating shadow variables before the original variables. The latter one seems unreasonable to me.
According to the official doc of ExponentialMovingAverage(link above), the update of the shadow variable relies on the original variable:
shadow_variable = decay * shadow_variable + (1 - decay) * variable
The update of the original variable should be before the update of the shadow ones, which is not the case in the tutorial code.
Can anyone help me clear that? Thanks.

I believe you are right. It does seem like a bug in the example. It is probably not important in practice as the order of variable update and moving average update is likely to be stable. Even if it is the "wrong" order, in the worst case, your moving average will be "one step ahead of the variable". Which is likely to have a less significant effect than changing your decay from 0.999 to 0.998 or something like that.
Just created a pull request to fix this: https://github.com/tensorflow/models/pull/3946

How to use TF model for inference while training

I'm adapting the Tensorflow tutorial for sequence to sequence modeling for my project. Specifically, I am basing my code off of translate.py.
The tutorial computes the perplexity on the dev set every n training steps. I'd instead like to calculate BLEU score on the dev set.
The problem I'm facing is that when creating a model, you specify whether it is forward only or not. Looking through the code, it seems that if it is (which happens when training), at each step the network will not calculate the final output for the input sequence, but will calculate gradients. When it's not forward only (as in the decoding function later in the tutorial), it applies the loop function which feeds the output back into the input for the RNN which allows for the output sequence to be generated. However, it doesn't compute the gradients. So as far as I understand it, you can construct a model for either training (i.e. gradients) or testing (i.e. performing full inference on it).
Since I want to compute the BLEU score, I need some sequence produced by the model which corresponds to an input sequence in the dev set. Because of how the models are constructed, I would need both types of models (forward-only and not forward-only). However, trying to do this (even with a new session and a new variable scope), I can't seem to load the model for inference while I also have a model created for training. Without a new session/variable scope, I get errors about duplicated variables. It would be nice if there were a way to switch the model from not forward-only to forward-only.
In this case, is there any way to perform inference (run the full RNN) while I am also in the scope of training it?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.