Loss returns nan in tensorflow

Loss returns nan in tensorflow - python

I am training an autoencoder whose input is a matrix P in [0,1] using the following loss function:
And here is my code:
# Define loss and optimizer, minimize the squared error
with tf.device("/device:GPU:0"):
L = -tf.reduce_sum(self.p*tf.math.log(self.p_pred+1e-10) + 0.55*(1 - self.p)*tf.math.log(1-self.p_pred+1e-10), axis = 1)
self.loss = tf.reduce_mean(L)
self.optimizer = tf.compat.v1.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
In main.py, I run session like this:
_, l, y_pred, y = sess.run([model.optimizer, model.loss, model.y_pred, model.y], feed_dict=...)
But the loss returns nan at different epoch whenever I'm training. The activation function is sigmoid and learning_rate = 0.01. The number of epoch is 20. I'm trying to save p and p_pred when the loss is nan, then I run the same loss function in google colab, the result is not nan! I don't understand.
Any ideas for what I did wrong?

The loss function you're using is BCELoss which gives "NaN" when the value inside log is less than or equal to 0. Sometimes, your term inside the log might go <=0 and sometimes, it might not. That's why you might not be getting NaN in colab.
Try adding sigmoid in the loss function (like sigmoid(self.p_pred) instead of self.p_pred. This will ensure the term is in (0,1) range.
Else, try increasing the epsilon value (1e-10 in your code. Increase that value to 1e-5 may be)

Related

Error in batch size with custom loss function in Keras

I'm working on a detector with Keras, where the output y_true consists in a vector "y" with 500 values, which contains a pulse that indicates the time of the event detected within 500 samples from a signal.
Ex: y=[0, 0, 0,....,0,1,1,1,1,1,1,1,1,1,1,1,0,....0,0,0]
I've worked before with the 'mse' for the loss, and it works, but I want to use a loss function that considers the distance between the middle value from the pulse in y_true and the max value in y_pred. Later I use the max value in the y_pred to normalize it and define the pulse around it.
Since I can't work with just the distance and make it differentiable, I defined this custom loss function, which weights the mean square error with the estimated distance.
import tensorflow as tf
import keras.backend as kb
def custom_loss_function (y_true, y_pred):
t_label = []
t_picking = 0
t_label = tf.where(y_true == 1)[:,0]
mayor = tf.reduce_max(y_pred)
t_picking = tf.where(y_pred == mayor)[:,0]
d = tf.cast(abs(t_label[5]-t_picking)/50,tf.float32)
loss = (kb.mean(kb.square(y_true-y_pred)))*d
return loss
Where t_label[5] and t_picking are the middle value of the pulse in y_trye and the max value in y_pred respectively. And d is the distance between them.
I compiled the model with this loss function, using Adam optimizer and a batch size of 64.
Everything works, and the model can be compiled, but I get this error in the middle of the training:
InvalidArgumentError: Incompatible shapes: [64] vs. [2]
[[node Adam/gradients/gradients/loss/dense_1_loss/custom_loss_function/weighted_loss/mul_grad/BroadcastGradientArgs (defined at C:\Users\Maca\anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_2220]
I've tried before with other custom loss functions and didn't have this problem, but I can't see where's the error is coming from.
Do you know why am I getting this error and how can I fix it?

There are two equal max value in a particular batch. So your t_picking sometimes (rarely) has two (or even more) values instead of one.

Simple L1 loss in PyTorch

I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.

If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)

Keras Updating loss value inside optimizer

I am building an optimizer (https://github.com/keras-team/keras/blob/master/keras/optimizers.py) which calculates a search direction and then tries a few different step lengths to find which gives the lowest loss. However, I am running into problems when trying to change the step length depending on the value of the loss itself. It appears that the loss (which is a tensor dependant upon the weight of the network and the data) cannot be updated/recalculated more than once during each training loop, which I find very odd.
This is the relevant code I have in get_updates(self, loss, params):
L1 = loss
for p, direction in zip(params, directions):
self.updates.append(K.update(p, p+length*direction))
L2 = loss
for p, direction in zip(params, directions):
self.updates.append(K.update(p, tf.cond( L2<L1, lambda: p+0.5*length*direction, lambda: p))
The problem is that L1 and L2 are the same and no matter what I try I can't get the loss to update after I've updated the weights. I've also tried just p = p+length*direction and p.assign() but the loss doesn't update. Does anyone know how I can get an updated value of the loss?
Note, I am able to get the loss from the previous batch/epoch if I save a loss value and update using self.updates.append(K.update(self.prev_loss,loss)), however since the data will change between batches I am no longer working on the same loss function and thus my comparison between the losses to determine if the step length should be lower is not valid.

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loss returns nan in tensorflow - python

Related

Error in batch size with custom loss function in Keras

Simple L1 loss in PyTorch

Keras Updating loss value inside optimizer

Incorporate side conditions into Keras neural network

Why do we need to call zero_grad() in PyTorch?

Categories

Resources