PyTorch: Loss remains constant

PyTorch: Loss remains constant - python

I've written a code in PyTorch with my own implemented loss function focal_loss_fixed. But my loss value stays fixed after every epoch. Looks like weights are not being updated. Here is my code snippet:
optimizer = optim.SGD(net.parameters(),
lr=lr,
momentum=0.9,
weight_decay=0.0005)
for epoch in T(range(20)):
net.train()
epoch_loss = 0
for n in range(len(x_train)//batch_size):
(imgs, true_masks) = data_gen_small(x_train, y_train, iter_num=n, batch_size=batch_size)
temp = []
for tt in true_masks:
temp.append(tt.reshape(128, 128, 1))
true_masks = np.copy(np.array(temp))
del temp
imgs = np.swapaxes(imgs, 1,3)
imgs = torch.from_numpy(imgs).float().cuda()
true_masks = torch.from_numpy(true_masks).float().cuda()
masks_pred = net(imgs)
masks_probs = F.sigmoid(masks_pred)
masks_probs_flat = masks_probs.view(-1)
true_masks_flat = true_masks.view(-1)
print((focal_loss_fixed(tf.convert_to_tensor(true_masks_flat.data.cpu().numpy()), tf.convert_to_tensor(masks_probs_flat.data.cpu().numpy()))))
loss = torch.from_numpy(np.array(focal_loss_fixed(tf.convert_to_tensor(true_masks_flat.data.cpu().numpy()), tf.convert_to_tensor(masks_probs_flat.data.cpu().numpy())))).float().cuda()
loss = Variable(loss.data, requires_grad=True)
epoch_loss *= (n/(n+1))
epoch_loss += loss.item()*(1/(n+1))
print('Step: {0:.2f}% --- loss: {1:.6f}'.format(n * batch_size* 100.0 / len(x_train), epoch_loss), end='\r')
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Epoch finished ! Loss: {}'.format(epoch_loss))
And this is my `focal_loss_fixed' function:
def focal_loss_fixed(true_data, pred_data):
gamma=2.
alpha=.25
eps = 1e-7
# print(type(y_true), type(y_pred))
pred_data = K.clip(pred_data,eps,1-eps)
pt_1 = tf.where(tf.equal(true_data, 1), pred_data, tf.ones_like(pred_data))
pt_0 = tf.where(tf.equal(true_data, 0), pred_data, tf.zeros_like(pred_data))
with tf.Session() as sess:
return sess.run(-K.sum(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1))-K.sum((1-alpha) * K.pow( pt_0, gamma) * K.log(1. - pt_0)))
After each epoch the loss value stays constant(5589.60328). What's wrong with it?

When computing the loss you call focal_loss_fixed() which uses TensorFlow to compute the loss value. focal_loss_fixed() creates a graph and runs it in a session to get the value, and by this point PyTorch has no idea of the sequence of operations that led to the loss because they were computed by the TensorFlow backend. It is likely then, that all PyTorch sees in loss is a constant, as if you had written
loss = 3
So the gradient will be zero, and the parameters will never be updated. I suggest you rewrite your loss function using PyTorch operations so that the gradient with respect to its inputs can be computed.

I think the problem lies in your heavy weight decay.
Essentially, you are not reducing the weight by x, but rather you multiply the weights by x, which means that you are instantaneously only doing very small increments, leading to a (seemingly) plateauing loss function.
More explanation on this can be found in the PyTorch discussion forum (e.g., here, or here).
Unfortunately, the source for SGD alone also does not tell you much about its implementation.
Simply setting it to a larger value should result in better updates. You can start by leaving it out completely, and then iteratively reducing it (from 1.0), until you get more decent results.

Related

Control flow in Tensorflow 2 - gradients are None

I have a Tensorflow 2.x model with the purpose of dynamically choosing a computational path. Here's a schematic drawing of this model:
The only trainable block is the Decision Module (DM), which is essentially a fully connected layer with a single binary output (0 or 1; It's differentiable using a technique called Improved Semantic Hashing). Nets A & B have the same network architecture.
In the training progress, I feed forward a batch of images until the output of the DM, and then process the decision image-by-image, directing each image to the decided net (A or B). The predictions are concatenated into a single tensor, who's used to evaluate the performance. Here's the training code (sigma is the output of the DM; model includes the feature extractor and the DM):
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
# training=True is only needed if there are custom_layers with different
# behavior during training versus inference (e.g. Dropout).
_, sigma = model(images, training=True)
out = []
for img, s in zip(images, sigma):
if s == 0:
o = binary_classifier_model_a(tf.expand_dims(img, axis=0), training=False)
else:
o = binary_classifier_model_b(tf.expand_dims(img, axis=0), training=False)
out.append(o)
predictions = tf.concat(out, axis=0)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
The problem - when running this code, gradients returns [None, None].
What I know for now is:
The first part of the model (until the DM's output) is differentiable; I tested it by running only this section and applying a loss function (MSE) and then applying tape.gradients - I got actual gradients.
I tried choosing a single (constant) net - e.g, net A - and simply multiplying it's output by s (which is either 0 or 1); This is performed instead of the if-else block in the code. In this case I also got gradients.
My concern is that such thing might not be possible - quoting from the official docs:
x = tf.constant(1.0)
v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
if x > 0.0:
result = v0
else:
result = v1**2
Depending on the value of x in the above example, the tape either
records result = v0 or result = v1**2. The gradient with respect to
x is always None.
dx = tape.gradient(result, x)
print(dx)
>> None
I'm not 100% sure that this is my case, but I wanted to ask here for the experts' opinion.
Is what I'm trying to do possible? And if yes - what should I change in order for this to work?
Thanks

You correctly identified the issue. The control statement of the conditional is not differentiable, so you lose your link to the model variables that produced sigma.
In your case, because you state that sigma is either 1 or 0, you can use the value of sigma as a mask, and skip the conditional statement (and even the loop).
with tf.GradientTape() as tape:
_, sigma = model(images, training=True)
predictions = (1.0 - sigma) * binary_classifier_model_a(images, training=False)\
+ sigma * binary_classifier_model_b(images, training=False)
loss = loss_object(labels, predictions)

It seems solution to your ploblem is to control flow operations. Try using tf.where. You can implement your condition by doing something like this.
a = tf.constant([1, 1])
b = tf.constant([2, 2])
p = tf.constant([True, False])
x = tf.where(p, a + b, a * b)
For more information please refer this

Gradient Accumulation with Custom model.fit in TF.Keras?

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to train a tf.keras model with Gradient Accumulation (GA). But I don't want to use it in the custom training loop (like) but customize the .fit() method by overriding the train_step.Is it possible? How to accomplish this? The reason is if we want to get the benefit of keras built-in functionality like fit, callbacks, we don't want to use the custom training loop but at the same time if we want to override train_step for some reason (like GA or else) we can customize the fit method and still get the leverage of using those built-in functions.
And also, I know the pros of using GA but what are the major cons of using it? Why does it's not come as a default but an optional feature with the framework?
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = n_gradients
self.gradient_accumulation = [
tf.zeros_like(this_var) for this_var in self.trainable_variables
]
def train_step(self, data):
x, y = data
batch_size = tf.cast(tf.shape(x)[0], tf.float32)
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(
y, y_pred, regularization_losses=self.losses
)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
accum_gradient = [
(acum_grad+grad) for acum_grad, grad in \
zip(self.gradient_accumulation, gradients)
]
accum_gradient = [
this_grad/batch_size for this_grad in accum_gradient
]
# apply accumulated gradients
self.optimizer.apply_gradients(
zip(accum_gradient, self.trainable_variables)
)
# TODO: reset self.gradient_accumulation
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
Please, run and check with the following toy setup.
# Model
size = 32
input = keras.Input(shape=(size,size,3))
efnet = keras.applications.DenseNet121(
weights=None,
include_top = False,
input_tensor = input
)
base_maps = keras.layers.GlobalAveragePooling2D()(efnet.output)
base_maps = keras.layers.Dense(
units=10, activation='softmax',
name='primary'
)(base_maps)
custom_model = CustomTrainStep(
n_gradients=10, inputs=[input], outputs=[base_maps]
)
# bind all
custom_model.compile(
loss = keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = keras.optimizers.Adam()
)
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
Update
I've found that some others also tried to achieve this and ended up with the same issue. One has got some workaround, here, but it's too messy and I think there should be some better approach.
Update 2
The accepted answer (by Mr.For Example) is fine and works well in single strategy. Now, I like to start 2nd bounty to extend it to support multi-gpu, tpu, and with mixed-precision techniques. There are some complications, see details.

Yes it is possible to customize the .fit() method by overriding the train_step without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation:
import tensorflow as tf
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]
def train_step(self, data):
self.n_acum_step.assign_add(1)
x, y = data
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign_add(gradients[i])
# If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
def apply_accu_gradients(self):
# apply accumulated gradients
self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))
# reset
self.n_acum_step.assign(0)
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))
# Model
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)
Outputs:
Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748
Pros:
Gradient accumulation is a mechanism to split the batch of samples —
used for training a neural network — into several mini-batches of
samples that will be run sequentially
Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, i.e using less memory to training the model like it using large batch size.
Example: If you run a gradient accumulation with steps of 5 and batch
size of 4 images, it serves almost the same purpose of running with a
batch size of 20 images.
We could also parallel the training when using GA, i.e aggregate gradients from multiple machines.
Things to consider:
This technique is working so well so it is widely used, there few things to consider before using it that I don't think it should be called cons, after all, all GA does is turning 4 + 4 to 2 + 2 + 2 + 2.
If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine's memory already can handle.
Reference:
What is Gradient Accumulation in Deep Learning?

Thanks to #Mr.For Example for his convenient answer.
Usually, I also observed that using Gradient Accumulation, won't speed up training since we are doing n_gradients times forward pass and compute all the gradients. But it will speed up the convergence of our model. And I found that using the mixed_precision technique here can be really helpful here. Details here.
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
Here is a complete gist.

Matrix Factorization with PyTorch using GPU

The below code does run, but it's very slow as it's using a for loops. At my university, servers with GPU resources are available. Likewise, I'd like to understand how to use batches to train the model more effectively.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MatrixFactorization(torch.nn.Module):
def __init__(self, n_items=len(movie_ids), n_factors=300):
super().__init__()
self.vectors = nn.Embedding(n_items, n_factors,sparse=True)
def forward(self, i,j):
feat_i = self.vectors(i)
feat_j = self.vectors(j)
result = (feat_i * feat_j).sum(-1)
return result
model = MatrixFactorization(n_items= len(movie_ids),n_factors=300)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
epochs = 100
for epoch in range(epochs):
loss = 0
for r,c in zip(r_index, c_index):
i = torch.LongTensor([int(r)])
j = torch.LongTensor([int(c)])
rating = torch.FloatTensor([Xij[i, j]])
# predict
prediction = model(i, j)
loss += loss_fn(prediction, rating)
# Reset the gradients to 0
optimizer.zero_grad()
# backpropagate
loss.backward()
# update weights
optimizer.step()
print(loss)
I've tried the below alteration but it produced a warning. I'm not sure why my target sizes are mismatched, but that appears to be the cause of the issue.
epochs = 50
for epoch in range(epochs):
loss = 0
# predict
i = torch.LongTensor(r_index)
j = torch.LongTensor(c_index)
ratings = Xij[i, j]
prediction = model(i, j)
loss += loss_fn(prediction, rating)
# Reset the gradients to 0
optimizer.zero_grad()
# backpropagate
loss.backward()
# update weights
optimizer.step()
print(loss)
And the warning (not sure where I went wrong):
/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py:431: UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([5931640])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)

There is a typo in your second code snippet,
loss += loss_fn(prediction, ratings) # instead of rating

How to accumulate my loss over mini batches then calculate my gradient

My main question is; is averaging the loss the same thing as averaging the gradient and how do i accumulate my loss over mini batches then calculate my gradient?
I have been trying to implement policy gradient in Tensorflow and run into the issue where i can not feed all my game states into my network at once and then update. The problem is if i lower my network size then train on all frames at once and take the mean of the loss then it begins to converge nicely. But if I accumulate the gradients over mini batches then average them, my gradients explode and i overflow my weights.
Any help or insight will be very appreciated.
Keep in mind also, this is my first time asking a question here.

What you can do is to accumulate gradients after each mini-batch and then update the weights based on gradient averages. Consider following simple case for fitting 50 Gaussian blobs with a single-layered perceptron:
from sklearn.datasets import make_blobs
import tensorflow as tf
import numpy as np
x_train, y_train = make_blobs(n_samples=50,
n_features=2,
centers=[[1, 1], [-1, -1]],
cluster_std=0.5)
with tf.name_scope('x'):
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.int32, [None])
with tf.name_scope('layer'):
logits = tf.layers.dense(x,
units=2,
kernel_initializer=tf.contrib.layers.xavier_initializer())
with tf.name_scope('loss'):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss_op = tf.reduce_mean(xentropy)
The minimize() method of the tensorflow optimizers calls compute_gradients() and then apply_gradients(). Instead of calling the minimize(), I'm going to call both methods directly. First, to get the gradients we call compute_gradients() (which returns a list of tuples grads_and_vars) and for apply_gradients() instead of gradients I'm going to feed placeholders for future gradient's averages:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
grads_and_vars = optimizer.compute_gradients(loss_op)
grads = [g for g, v in grads_and_vars]
# placeholders for gradients averages
placeholder_grads = [tf.placeholder(tf.float32, [None] + g.get_shape().as_list())
for g in grads]
new_grads_and_vars = [(tf.reduce_mean(p, axis=0), gv[1])
for p, gv in zip(placeholder_grads, grads_and_vars)]
apply_grads_op = optimizer.apply_gradients(new_grads_and_vars)
During mini-batches we only compute losses (you can accumulate losses as well - append to some list and then compute average) and gradients, without applying gradients to weights. At the end of each epoch we execute apply_grads_op operation while feeding accumulated gradients to its placeholders:
data = tf.data.Dataset.from_tensor_slices({'x':x_train, 'y':y_train}).batch(10)
iterator = data.make_initializable_iterator()
n_epochs = 2
with tf.Session() as sess:
_ = sess.run([tf.global_variables_initializer(), iterator.initializer])
next_batch = iterator.get_next()
for epoch in range(n_epochs):
epoch_grads = []
while True:
try:
batch = sess.run(next_batch)
evaled = sess.run([loss_op] + grads,
feed_dict={x:batch['x'], y:batch['y']})
epoch_grads.append(evaled[1:])
print('batch loss:', evaled[0])
except tf.errors.OutOfRangeError:
_ = sess.run(iterator.initializer)
feed_dict = {p:[g[i] for g in epoch_grads]
for i, p in enumerate(placeholder_grads)}
_ = sess.run(apply_grads_op, feed_dict=feed_dict)
break

How to use a tensorflow tensor value in a formula?

I have a quick question. I am developing a model in tensorflow, and need to use the iteration number in a formula during the construction phase. I know how to use global_step, but I am not using an already existing optimizer.
I am calculating my own gradients with
grad_W, grad_b = tf.gradients(xs=[W, b], ys=cost)
grad_W = grad_W +rnd.normal(0,1.0/(1+epoch)**0.55)
and then using
new_W = W.assign(W - learning_rate * (grad_W))
new_b = b.assign(b - learning_rate * (grad_b))
and would like to use the epoch value in the formula before updating my weights. How can I do it in the best way possible? I have a sess.run() part and would like to pass to the model the epoch number, but cannot directly use a tensor.
From my run call
_, _, cost_ = sess.run([new_W, new_b ,cost],
feed_dict = {X_: X_train_tr, Y: labels_, learning_rate: learning_r})
I would like to pass the epoch number. How do you usually do it?
Thanks in advance, Umberto
EDIT:
Thanks for the hints. So seems to work
grad_W = grad_W + tf.random_normal(grad_W.shape,
0.0,1.0/tf.pow(0.01+tf.cast(epochv, tf.float32),0.55))
but I still have to see if that is what I need and if is working as intended. Ideas and Feedback would be great!

You can define epoch as a non-trainable tf.Variable in your graph and increment it at the end of each epoch. You can define an operation with tf.assign_add to do the incrementation and run it end of each epoch.
Instead of rnd.normal you will also need to use tf.random_normal then.
Example:
epoch = tf.Variable(0, trainable=False) # 0 is initial value
# increment by 1 when the next op is run
epoch_incr_op = tf.assign_add(epoch, 1, name='incr_epoch')
# Define any operations that depend on 'epoch'
# Note we need to cast the integer 'epoch' to float to use in tf.pow
grad_W = grad_W + tf.random_normal(grad_W.shape, 0.0,
1.0/tf.pow(1+tf.cast(epoch, tf.float32), 0.55))
# Training loop
while running_epoch:
_, _, cost_ = sess.run([new_W, new_b ,cost],
feed_dict = {X_: X_train_tr, Y: labels_, learning_rate: learning_r})
# At end of epoch, increment epoch counter
sess.run(epoch_incr_op)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch: Loss remains constant - python

Related

Control flow in Tensorflow 2 - gradients are None

Gradient Accumulation with Custom model.fit in TF.Keras?

Matrix Factorization with PyTorch using GPU

How to accumulate my loss over mini batches then calculate my gradient

How to use a tensorflow tensor value in a formula?

Categories

Resources