Different loss values for test_on_batch and train_on_batch

Different loss values for test_on_batch and train_on_batch - python

While trying to train a GAN for image generation I ran into a problem which I cannot explain.
When training the generator, the loss which is returned by train_on_batch after just 2 or 3 iterations directly drops to zero. After investigating I realized some strange behavior of the train_on_batch method:
When I check the following:
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
predictions = GAN.stackedModel.predict(noise)
This returns values all close to zero as I would expect since the generator is not trained yet.
However:
y = np.ones([batch_size, 1])
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
loss = GAN.stackedModel.train_on_batch(noise, y)
here the loss is almost zero even though my expected targets are obvious ones.
When I run:
y = np.ones([batch_size, 1])
noise = np.random.uniform(-1.0, 1.0, size=[batch_size, gen_noise_length])
loss = GAN.stackedModel.test_on_batch(noise, y)
the returned loss is high as I would expect.
What is going on with the train_on_batch method? I'm really clueless here...
edit
My loss is binary-crossentropy and I build the model like:
def createStackedModel(self):
# Build stacked GAN model
gan_in = Input([self.noise_length])
H = self.genModel(gan_in)
gan_V = self.disModel(H)
GAN = Model(gan_in, gan_V)
opt = RMSprop(lr=0.0001, decay=3e-8)
GAN.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
return GAN
edit 2
The generator is constructed by stacking some of those blocks each containing a BatchNormalization:
self.G.add(UpSampling2D())
self.G.add(Conv2DTranspose(int(depth/8), 5, padding='same'))
self.G.add(BatchNormalization(momentum=0.5))
self.G.add(Activation('relu'))
edit 3
I loaded my code to https://gitlab.com/benjamingraf24/DCGAN/
Apparently the problem results from the way how I build the GAN network. So in GANBuilder.py there must be something wrong. However, I cant find it...

BatchNormalization layers behave differently during training and testing phase.
During training phase they will use the current batch mean and variance of the activations to normalize.
However, during testing phase they use the moving mean and moving variance that they collected during training. Without enough previous training these collected values can be far from the actual batch statistics, resulting in significant loss value differences.
Refer to the Keras documentation for BatchNormalization. The momentum argument is used to define how fast the moving mean and moving average will adapt to freshly collected values of batches during training.

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

how to throw some of samples within a mini-batch during training

I'm using AdamOptimizer to train a simple DNN network.(I'm using version tf1.4).
And I want to throw away some bad samples within a batch during training. Say I have 4096 samples within a batch, and I want to throw 96 samples away and only use the remaining 4000 samples to calculate loss and do backpropagation.
How can I achieve this?
The code set up is very straightforward like below:
lables = tf.reshape(labels, [batch_size, 1])
logits = tf.reshape(logits, [batch_size, 1])
loss_vector = tf.nn.sigmoid_cross_entropy_with_logits(multi_class_labels=labels,
logits=logits)
loss_scalar = tf.reduce_mean(loss_vector)
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = opt.minimize(loss_scalar, global_step=global_step)
one possible solution is to do a mask operation after loss_vector and before reduce_mean. But I'm not sure if it's the right solution and I have some questions about what's going on underhood:
in the minimize() operation, since the input parameter is a scalar, how will it know that some of input samples are masked out?
in the minimize() operation, how many times of backpropagation will happen?
during training, samples are feed in the graph one by one, how can TF know that which should be kept and which should throw away?

Keras multiple input, output, loss model

I am working on super-resolution GAN and having some doubts about the code I found on Github. In particular, I have multiple inputs, multiple outputs in the model. Also, I have two different loss functions.
In the following code will the mse loss be applied to img_hr and fake_features?
# Build and compile the discriminator
self.discriminator = self.build_discriminator()
self.discriminator.compile(loss='mse',
optimizer=optimizer,
metrics=['accuracy'])
# Build the generator
self.generator = self.build_generator()
# High res. and low res. images
img_hr = Input(shape=self.hr_shape)
img_lr = Input(shape=self.lr_shape)
# Generate high res. version from low res.
fake_hr = self.generator(img_lr)
# Extract image features of the generated img
fake_features = self.vgg(fake_hr)
# For the combined model we will only train the generator
self.discriminator.trainable = False
# Discriminator determines validity of generated high res. images
validity = self.discriminator(fake_hr)
self.combined = Model([img_lr, img_hr], [validity, fake_features])
self.combined.compile(loss=['binary_crossentropy', 'mse'],
loss_weights=[1e-3, 1],
optimizer=optimizer)

In the following code will the mse loss be applied to img_hr and
fake_features?
From the documentation, https://keras.io/models/model/#compile
"If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses."
In this case, the mse loss will be applied to fake_features and the corresponding y_true passed as part of self.combined.fit().

In neural networks Loss is applied to the Outputs of a network in order to have a way of measurement of "How wrong is this output?" so you can take this value and minimize it via Gradient decent and backprop.
Following this Intuition the Losses in keras are a List with the same length as the Outputs of your model. They are appied to the Output with the same index.
self.combined = Model([img_lr, img_hr], [validity, fake_features])
This gives you a model with 2 Inputs (img_lr, img_hr) and 2 outputs (validity, fake_features). So combined.compile(loss=['binary_crossentropy', 'mse']... uses binary_crossentropy loss for validity and Mean Squared Error for fake_features.

What should I do to get low average loss?

I'm an student in hydraulic engineering, working on a neural network in my internship so it's something new for me.
I created my neural network but it gives me a high loss and I don't know what is the problem ... you can see the code :
def create_model():
model = Sequential()
# Adding the input layer
model.add(Dense(26,activation='relu',input_shape=(n_cols,)))
# Adding the hidden layer
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
# Adding the output layer
model.add(Dense(2))
# Compiling the RNN
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
return model
kf = KFold(n_splits = 5, shuffle = True)
model = create_model()
scores = []
for i in range(5):
result = next(kf.split(data_input), None)
input_train = data_input[result[0]]
input_test = data_input[result[1]]
output_train = data_output[result[0]]
output_test = data_output[result[1]]
# Fitting the RNN to the Training set
model.fit(input_train, output_train, epochs=5000, batch_size=200 ,verbose=2)
predictions = model.predict(input_test)
scores.append(model.evaluate(input_test, output_test))
print('Scores from each Iteration: ', scores)
print('Average K-Fold Score :' , np.mean(scores))
And whene I execute my code, the result is like :
Scores from each Iteration: [[93.90406122928908, 0.8907562990148529], [89.5892979597845, 0.8907563030218878], [81.26530176050522, 0.9327731132507324], [56.46526102659081, 0.9495798339362905], [54.314151876112994, 0.9579831877676379]]
Average K-Fold Score : 38.0159922589274
Can anyone help me please ? how could I do to make the loss low ?

There are several issues, both with your questions and with your code...
To start with, in general we cannot say that an MSE loss of X value is low or high. Unlike the accuracy in classification problems which is by definition in [0, 1], the loss is not similarly bounded, so there is no general way of saying that a particular value is low or high, as you imply here (it always depends on the specific problem).
Having clarified this, let's go to your code.
First, judging from your loss='mean_squared_error', it would seem that you are in a regression setting, in which accuracy is meaningless; see What function defines accuracy in Keras when the loss is mean squared error (MSE)?. You have not shared what exact problem you are trying to solve here, but if it is indeed a regression one (i.e. prediction of some numeric value), you should get rid of metrics=['accuracy'] in your model compilation, and possibly change your last layer to a single unit, i.e. model.add(Dense(1)).
Second, as your code currently is, you don't actually fit independent models from scratch in each of your CV folds (which is the very essence of CV); in Keras, model.fit works cumulatively, i.e. it does not "reset" the model each time it is called, but it continues fitting from the previous call. That's exactly why if you see your scores, it is evident that the model is significantly better in the later folds (which already gives a hint for improving: add more epochs). To fit independent models as you should do for a proper CV, you should move create_model() inside the for loop.
Third, your usage of np.mean() here is again meaningless, as you average both the loss and the accuracy (i.e. apples with oranges) together; the fact that from 5 values of loss between 54 and 94 you end up with an "average" of 38 should have already alerted you that you are attempting something wrong. Truth is, if you dismiss the accuracy metric, as argued above, you would not have this problem here.
All in all, here is how it seems that your code should be in principle (but again, I have not the slightest idea of the exact problem you are trying to solve, so some details might be different):
def create_model():
model = Sequential()
# Adding the input layer
model.add(Dense(26,activation='relu',input_shape=(n_cols,)))
# Adding the hidden layer
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
model.add(Dense(60,activation='relu'))
# Adding the output layer
model.add(Dense(1)) # change to 1 unit
# Compiling the RNN
model.compile(optimizer='adam', loss='mean_squared_error') # dismiss accuracy
return model
kf = KFold(n_splits = 5, shuffle = True)
scores = []
for i in range(5):
result = next(kf.split(data_input), None)
input_train = data_input[result[0]]
input_test = data_input[result[1]]
output_train = data_output[result[0]]
output_test = data_output[result[1]]
# Fitting the RNN to the Training set
model = create_model() # move create_model here
model.fit(input_train, output_train, epochs=10000, batch_size=200 ,verbose=2) # increase the epochs
predictions = model.predict(input_test)
scores.append(model.evaluate(input_test, output_test))
print('Loss from each Iteration: ', scores)
print('Average K-Fold Loss :' , np.mean(scores))

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.

Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Different loss values for test_on_batch and train_on_batch - python

Related

Model not improving with GradientTape but with model.fit()

how to throw some of samples within a mini-batch during training

Keras multiple input, output, loss model

What should I do to get low average loss?

Why do we need to call zero_grad() in PyTorch?

Categories

Resources