Tensorflow: global_step not incremented; hence exponentialDecay not working

Tensorflow: global_step not incremented; hence exponentialDecay not working - python

I'm trying to learn Tensorflow, and I wanted to use the Tensorflow's cifar10 tutorial framework and train it on top of mnist (combining two tutorials).
In cifar10.py's train method:
cifar10.train(total_loss, global_step):
lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
global_step,
100,
0.1,
staircase=True)
tf.scalar_summary('learning_rate', lr)
tf.scalar_summary('global_step', global_step)
The global_step is passed initialized and passed in and the global_step does increase 1 at a step and the learning rate decays properly the source code can be found at tensorflow's cifar10 tutorial.
However, when I tried to do the same for my revised mnist.py's train method code:
mnist.training(loss, batch_size, global_step):
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(0.1,
global_step,
100,
0.1,
staircase=True)
tf.scalar_summary('learning_rate1', lr)
tf.scalar_summary('global_step1', global_step)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(lr)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step)
tf.scalar_summary('global_step2', global_step)
tf.scalar_summary('learning_rate2', lr)
return train_op
The global step is initialized (in both cifar10 and my mnist file) as:
with tf.Graph().as_default():
global_step = tf.Variable(0, trainable=False)
...
# Build a Graph that trains the model with one batch of examples and
# updates the model parameters.
train_op = mnist10.training(loss, batch_size=100,
global_step=global_step)
Here, I record the scalar_summary of global step and learning rate twice:
learning_rate1 and learning_rate2 are both the same and constant at 0.1 (initial learning rate).
global_step1 is also constant at 0 across 2000 steps.
global_step2 is increasing linearly 1 per step.
The more detailed code structure can be found at:
https://bitbucket.org/jackywang529/tesorflow-sandbox/src
It's quite confusing to me why this might be the case (in the case of my global_step since I thought everything was set up symbolically, and so once the program starts running the global step should be incremented no matter where I write the summary) and I think this is why my learning rate is constant. Of course I might have made some simplistic mistake, and would be glad to get helped/explained.
global_steps written before and after the minimize function is called

You are passing an argument called global_step to mnist.training, AND also creating a variable called global_step in mnist.training. The one used for tracking the exponential_decay is the variable that is passed in, but the one that is actually incremented (by passing to optimizer.minimize) is the newly created variable. Simply remove the following statement from mnist.training and things should work :
global_step = tf.Variable(0, name='global_step', trainable=False)

Related

Continue training of a custom tf.Estimator with AdamOptimizer

I created a custom tf.Estimator whose weights I'm training using the tf.train.AdamOptimizer. When I continue training of an existing model, I observe a steep change in the metrics at the start of the continued training in Tensorboard. After a few steps, the metrics stabilise. The behaviour looks similar to the initial transients when training a model. The behaviour is the same if I continue training on the same Estimator instance, or if I recreate the estimator from a checkpoint. I suspect that the moving averages and/or the bias correction factor are reset when restarting the training. The model weights themselves seem to be properly restored, as the metrics do continue from where they settled before, only the effective learning rate seems to be too high.
Previous Stack-Overflow answers seem to suggest that these auxiliary learning parameters should be stored with the checkpoints together with the model weights. So what am I doing wrong here? How can I control restoring of these auxiliary variables? I would like to be able to continue training as if it had never been stopped. However, other people sometimes seem look for the opposite control, to completely reset the optimizer without resetting the model weights. An answer that shows how both effects can be achieved would probably most helpful.
Here is a sketch of my model_fn:
def model_fn(features, labels, mode, params):
inputs = features['inputs']
logits = create_model(inputs, training=mode == tf.estimator.ModeKeys.TRAIN)
if mode == tf.estimator.ModeKeys.PREDICT:
...
if mode == tf.estimator.ModeKeys.TRAIN:
outputs = labels['outputs']
loss = tf.losses.softmax_cross_entropy(
tf.one_hot(outputs,tf.shape(inputs)[-1]),
logits,
# reduction=tf.losses.Reduction.MEAN,
)
optimizer = tf.train.AdamOptimizer(learning_rate=params.learning_rate)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
accuracy = tf.metrics.accuracy(
labels = outputs,
predictions = tf.argmax(logits, axis=-1),
)
tf.summary.histogram('logits',logits)
tf.summary.scalar('accuracy', accuracy[1])
tf.summary.scalar('loss', loss)
return tf.estimator.EstimatorSpec(
mode=tf.estimator.ModeKeys.TRAIN,
loss=loss,
train_op=train_op)
if mode == tf.estimator.ModeKeys.EVAL:
...
raise ValueError(mode)
The training step is called as follows:
cfg = tf.estimator.RunConfig(
save_checkpoints_secs = 5*60, # Save checkpoints every 1 minutes.
keep_checkpoint_max = 10, # Retain the 10 most recent checkpoints.
save_summary_steps = 10,
log_step_count_steps = 100,
)
estimator = tf.estimator.Estimator(
model_fn = model_fn,
params = dict(
learning_rate = 1e-3,
),
model_dir = model_dir,
config=cfg,
)
# train for the first time
estimator.train(
input_fn=train_input_fn,
)
# ... at some later time, train again
estimator.train(
input_fn=train_input_fn,
)
EDIT:
The documentation of the warm_start_from argument of tf.estimator.Estimator and tf.estimator.WarmStartSettings are not entirely clear what exactly will happen in the default case, as I am using in the example above. However, the documentation of [tf.train.warm_start] (https://www.tensorflow.org/api_docs/python/tf/train/warm_start) seems to suggest that in the default case, all TRAINABLE_VARIABLES will be warm-started, which
excludes variables such as accumulators and moving statistics from batch norm
Indeed, I find Adam's accumulator variables in VARIABLES, but not in TRAINABLE_VARIABLES. These documentation pages also state how to change the list of warm-started variables, to either a list of tf.Variable instances, or a list of their names. However, one question remains: How do I create one of those lists in advance, given that with tf.Estimator, I have no graph to collect those variables/their names from?
EDIT2:
The source-code of warm_start highlights an undocumented feature: The list of variable names is in fact a list of regexes, to be matched against GLOBAL_VARIABLES. Thus, one may use
warm_start_from=tf.estimator.WarmStartSettings(
ckpt_to_initialize_from=str(model_dir),
# vars_to_warm_start=".*", # everything in TRAINABLE_VARIABLES - excluding optimiser params
vars_to_warm_start=[".*"], # everything in GLOBAL_VARIABLES - including optimiser params
),
to load all variables. However, even with that, the spikes in the summary stats remain. With that, I'm completely at a loss now what is going on.

By default metrics are added to the local variables and metric variables collections, and these are not checkpointed by default.
If you want to include them in checkpoints, you can either append metric variables to the global variables collection:
tf.add_to_collection(tf.GraphKeys.GLOBAL_VARIABLES, tf.get_collection(tf.GraphKeys.METRIC_VARIABLES))
Or you can return a Scaffold with a custom Saver set, passing the variables to checkpoint to Saver's var_list argument. This defaults to the global variables collection.

Adjust number of samples in Tensorflow nce_loss while training

I'd like to adjust the sampling rate during training of my neural network to test some stuff and see what happens. To achieve that my idea was to create an new loss and optimizer for every iteration using the same computation graph.
def optimize(self, negative_sampling_rate):
return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.calc_loss(negative_sampling_rate))
def calc_loss(self, negative_sampling_rate):
return tf.reduce_mean(tf.nn.nce_loss(
weights=self.graph.prediction_weights,
biases=self.graph.prediction_bias,
labels=self.graph.labels,
inputs=self.graph.hidden,
num_sampled=negative_sampling_rate,
num_classes=self.graph.prediction_weights.shape[1])
)
def train(self, batch_inputs, batch_labels, negative_sampling_rate):
feed_dict = {self.graph.X: batch_inputs, self.graph.labels: batch_labels}
_, loss_val = self.session.run(
[self.optimize(negative_sampling_rate), self.calc_loss(negative_sampling_rate)], feed_dict=feed_dict
)
return loss_val
But I'm a little bit worried about the optimizer. I've heard that optimizers have internal variables, which change on every training iteration. Is that true for all optimizers or only or a some, and if so which ones are usable for this approach?
The Neural network should then be trained like:
for step in range(training_steps):
NN.train(inputs, labels, compute_sampling_rate(step))

First of all, it should be okay to change the number of samples in the nce loss without causing problems for the optimizer. The internal variables stored by some optimizers relate to the historical gradients of the trainable variables in your graph.
Secondly, if you do want to reset the state of your optimizer for some reason, the way I do it is by putting the optimizer in a variable scope. Whenever I want to reset it then I run the reset_optimizer op.
reset_optimizer = tf.no_op()
with tf.variable_scope('optimizer'):
train_op = optimizer(learning_rate).minimize(loss)
opt_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
if len(opt_vars): # check if optimzer state needs resetting
reset_optimizer = variables_initializer(opt_vars))

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

TensorFlow why my cost function doesn't decrease?

I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks

try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.

Train a tensorflow model minimizing the loss of several batches

I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?

The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: global_step not incremented; hence exponentialDecay not working - python

Related

Continue training of a custom tf.Estimator with AdamOptimizer

Adjust number of samples in Tensorflow nce_loss while training

Why do we need to call zero_grad() in PyTorch?

TensorFlow why my cost function doesn't decrease?

Train a tensorflow model minimizing the loss of several batches

Categories

Resources