due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.
Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)
Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.
Related
Long story short: How do I fix my backpropagation code, so that weights and bias are being changed effecively by my evaluate() function to present predictions closer to the target values rather than odds no better than guessing?
Details below:
I've currently got the backbone of this neural network from scratch which I'm creating using techniques gleaned from Sentdex's Neural Networks From Scratch series on YouTube and a Towards Data Science article for the backpropagation part specifically. It works by creating a large class called Neural Network which would have several LayerDense objects associated by composition, which would act as each layer within the neural network.
As my inputs to the neural network, I pass in a batch of 8 records from a Pandas DataFrame, each containing 100 values of 0 or 1, depending on their prefered options. As target values, I pass in another DataFrame containing the actual genders of each participant, with 0 being male and 1 being female.
These LayerDense objects would deal with the forward and backward passes of each layer. Prior to implementing the softmax function and backpropagation, this all worked as expected.
My current issue is getting the evaluate() function within the program to run as expected & getting the run() function to handle this information correctly.
In theory, the evaluate function should return the loss of each neuron and the run function should handle this and run the backward pass through each neuron, adjusting its weights & biases appropiately.
What actually happens is that my final outputs of classification, which are the confidence levels in predictions, with values closer to 0 representing a male gender prediction and values closer to 1 representing a female gender prediction.
Using categorical cross-entropy as my loss function, how would I properly implement backpropagation in this situation? What may I be doing wrong here?
All resource links used to get this far and the whole source code will be linked below.
Current evaluation code
def evaluate(self):
#Target values are the y values that we want to be predicting correctly
#You can calculate the loss of a categorical neural network (basically most NN) by using
#categorical cross-entropy
#Using one-hot encoding to calculate the categorical cross-entropy of data (loss)
#In one-hot encoding, we assign the target class position we want in our array of outputs
#Then make an array of 0s of the same length as outputs but put a 1 in the target class position
#This basically simplifies to just the negative natural logarithm of the predicted target value
#The following code will represent the confidence values in the predictions made by the NN
#For this to work, if categorical, the number of outputs must equal the number of possible class targets
#E.g for gender, there's two possible class targets (0 and 1), so two output neurons
#The string can be changed to the attribute in the table that you shall be predicting
#A short but ugly way of getting a start to complete this task
'''
loss = -np.log(self._network[-1].output[range(len(self._network[-1].output)),target_values.loc[:,"gender"]])
average_loss = np.mean(loss)
'''
#A nicer way to accomplish the same thing
samples = len(self._network[-1].output)
#Clip the values so we don't get any infinity errors if a confidence level happens to be spot on
y_pred_clipped = np.clip(self._network[-1].output, 1e-7, 1-1e-7)
#If one-hot encoding has not been passed in
if len(self._target_values.shape) == 1:
#Selecting the largest confidences based on their position
correct_confidences = y_pred_clipped[range(samples),self._target_values[:samples]]
elif len(self._target_values.shape) == 2:
#One-hot encoding has been used in this scenario
correct_confidences = np.sum(y_pred_clipped*self._target_values[:samples], axis=1)
#Calculate the loss and return
loss = -np.log(correct_confidences)
return loss, correct_confidences
Current run() code
def run(self, **kwargs):
epochs = kwargs['epochs']
#Start by putting initial inputs into the input layer and generating the network
self._network[0].forward(self._inputs)
for i in range(len(self._network)-1):
#Using the previous layer's outputs as the next layer's inputs
self._network[i+1].forward(self._network[i].output)
for i in range(epochs):
#Forward pass
self._network[0].forward_pass(self._inputs)
for i in range(len(self._network)-1):
output = self._network[i+1].forward_pass(self._network[i].output)
#Generates the values for loss function, used for training in multiple passes
#Backbone of backpropagation
loss = neural.evaluate()
#Backward pass
#Somehow find a way to derive the evalaute function on predicted values and target values
error, confidences = [np.e**-x for x in loss]
confidences = [np.e**-x for x in confidences]
error = confidences
for i in range(len(self._network)-1,-1):
error = self._network[i-1].backward(error, self._learning_rate)
print('Epoch %d/%d' % (i+1, epochs))
#Start by putting initial inputs into the input layer
self._network[0].forward(self._testing_data)
for i in range(len(self._network)-1):
#Using the previous layer's outputs as the next layer's inputs
self._network[i+1].forward(self._network[i].output)
print("The network's testing outputs were:", self._network[-1].output)
Backward pass code which runs for each layer
def backward(self, output_error, learning_rate):
#The error of this layer's inputs is equal to its output error multipled by the
#transposed weights of the layer
input_error = np.dot(output_error, self.weights.T)
#The error of the weights in this layer is equal to the transposed matrix of inputs fed into the layer
#multipled by the error of the output from this layer
weights_error = np.dot(self.inputs.T, output_error)
# dBias = output_error
# update parameters
self.weights -= learning_rate * weights_error
self.biases -= learning_rate * output_error
return input_error
Aforementioned softmax function within forward() function of LayerDense
elif self._activation_function.lower() == 'softmax':
#Exponentiate (e to the power of x) values and subtract largest value of layer to prevent overflow
#Afterwards, normalise (put as relative fractions) the output values
#In theory, to get the max value out of each batch, axis should be set to 1 and keepdims should be True
neuron_output = np.exp(neuron_output - np.max(layer_output,axis=0)) / np.sum(np.exp(layer_output),axis=0)
Mentioned SentDex tutorial: https://www.youtube.com/playlist?list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3
Mentioned TDS article: https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65
Source code: https://github.com/NewDeveloper911/Python-Collection/blob/master/neural%20network/nn
I'd like to adjust the sampling rate during training of my neural network to test some stuff and see what happens. To achieve that my idea was to create an new loss and optimizer for every iteration using the same computation graph.
def optimize(self, negative_sampling_rate):
return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.calc_loss(negative_sampling_rate))
def calc_loss(self, negative_sampling_rate):
return tf.reduce_mean(tf.nn.nce_loss(
weights=self.graph.prediction_weights,
biases=self.graph.prediction_bias,
labels=self.graph.labels,
inputs=self.graph.hidden,
num_sampled=negative_sampling_rate,
num_classes=self.graph.prediction_weights.shape[1])
)
def train(self, batch_inputs, batch_labels, negative_sampling_rate):
feed_dict = {self.graph.X: batch_inputs, self.graph.labels: batch_labels}
_, loss_val = self.session.run(
[self.optimize(negative_sampling_rate), self.calc_loss(negative_sampling_rate)], feed_dict=feed_dict
)
return loss_val
But I'm a little bit worried about the optimizer. I've heard that optimizers have internal variables, which change on every training iteration. Is that true for all optimizers or only or a some, and if so which ones are usable for this approach?
The Neural network should then be trained like:
for step in range(training_steps):
NN.train(inputs, labels, compute_sampling_rate(step))
First of all, it should be okay to change the number of samples in the nce loss without causing problems for the optimizer. The internal variables stored by some optimizers relate to the historical gradients of the trainable variables in your graph.
Secondly, if you do want to reset the state of your optimizer for some reason, the way I do it is by putting the optimizer in a variable scope. Whenever I want to reset it then I run the reset_optimizer op.
reset_optimizer = tf.no_op()
with tf.variable_scope('optimizer'):
train_op = optimizer(learning_rate).minimize(loss)
opt_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
if len(opt_vars): # check if optimzer state needs resetting
reset_optimizer = variables_initializer(opt_vars))
Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.
You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.
Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.
I'm searching for a way to compute the weight-update-ratio for optimizer steps in Tensorflow. The weight-update-ratio is defined as the update-scale divided by the variable scale in each step and can be used for inspecting network training.
Ideally I want a non-intrusive way to compute it in a single session run, but couldn't accomplish quite what I was looking for. Since the update-scale and parameter scale are independent of the train step, one needs to add explicit dependencies to the graph in order to graph variable-scale before and after the update step. Unfortunately, it seems that in TF dependencies can only be defined for new nodes, which further complicates the issue.
So far, the best I've come up with is a context manager for definining the necessary operations. Its used as follows
opt = tf.train.AdamOptimizer(1e0)
grads = tf.gradients(loss, tf.trainable_variables())
grads = list(zip(grads, tf.trainable_variables()))
with compute_weight_update_ratio('wur') as wur:
train = opt.apply_gradients(grads_and_vars=grads)
# ...
with tf.Session() as sess:
sess.run(wur.ratio)
The full code of compute_weight_update_ratio can be found below. What bugs me is that in the current state the weight-update-ratio (at least norm_before) is computed with every training step, but for performance reason I'd rather prefer to do it selectively (e.g only when summaries are computed).
Any ideas on how to improve?
#contextlib.contextmanager
def compute_weight_update_ratio(name, var_scope=None):
'''Injects training to compute weight-update-ratio.
The weight-update-ratio is computed as the update scale divided
by the variable scale before the update and should be somewhere in the
range 1e-2 or 1e-3.
Params
------
name : str
Operation name
Kwargs
------
var_scope : str, optional
Name selection of variables to compute weight-update-ration for. Defaults to all. Regex supported.
'''
class WeightUpdateRatio:
def __init__(self):
self.num_train = len(tf.get_collection(tf.GraphKeys.TRAIN_OP))
self.variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=var_scope)
self.norm_before = tf.norm(self.variables, name='norm_before')
def compute_ratio(self,):
train_ops = tf.get_collection(tf.GraphKeys.TRAIN_OP)
assert len(train_ops) > self.num_train, 'Missing training op'
with tf.control_dependencies(train_ops[self.num_train:]):
self.norm_after = tf.norm(self.variables, name='norm_after')
absdiff = tf.abs(tf.subtract(self.norm_after, self.norm_before), name='absdiff')
self.ratio = tf.divide(absdiff, self.norm_before, name=name)
with tf.name_scope(name) as scope:
try:
wur = WeightUpdateRatio()
with tf.control_dependencies([wur.norm_before]):
yield wur
finally:
wur.compute_ratio()
You don't need to worry about performance too much. Tensorflow only executes the subgraph necessary to produce the output.
So, in your training loop, if wur.ratio is not called during an iteration, none of the extra nodes created to compute it will be executed.