Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease)

Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease) - python

I am working on a very sparse dataset with the point of predicting 6 classes.
I have tried working with a lot of models and architectures, but the problem remains the same.
When I start training, the acc for training will slowly start to increase and loss will decrease where as the validation will do the exact opposite.
I have really tried to deal with overfitting, and I simply cannot still believe that this is what is coursing this issue.
What have I tried
Transfer learning on VGG16:
exclude top layer and add dense layer with 256 units and 6 units softmax output layer
finetune the top CNN block
finetune the top 3-4 CNN blocks
To deal with overfitting I use heavy augmentation in Keras and dropout after the 256 dense layer with p=0.5.
Creating own CNN with VGG16-ish architecture:
including batch normalization wherever possible
L2 regularization on each CNN+dense layer
Dropout from anywhere between 0.5-0.8 after each CNN+dense+pooling layer
Heavy data augmentation in "on the fly" in Keras
Realising that perhaps I have too many free parameters:
decreasing the network to only contain 2 CNN blocks + dense + output.
dealing with overfitting in the same manner as above.
Without exception all training sessions are looking like this:
Training & Validation loss+accuracy
The last mentioned architecture looks like this:
reg = 0.0001
model = Sequential()
model.add(Conv2D(8, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(16, (3, 3), input_shape=input_shape, padding='same',
kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.7))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(16, kernel_regularizer=regularizers.l2(reg)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='SGD',metrics=['accuracy'])
And the data is augmented by the generator in Keras and is loaded with flow_from_directory:
train_datagen = ImageDataGenerator(rotation_range=10,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.05,
zoom_range=0.05,
rescale=1/255.,
fill_mode='nearest',
channel_shift_range=0.2*255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
shuffle = True,
class_mode='categorical')
validation_datagen = ImageDataGenerator(rescale=1/255.)
validation_generator = validation_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=1,
shuffle = True,
class_mode='categorical')

What I can think of by analyzing your metric outputs (from the link you provided):
Seems to me that approximately near epoch 30 your model is starting to overfit. Therefore you can try stopping your training in that iteration, or well just train it for ~30 epochs (or the exact number). The Keras Callbacks may be useful here, specially the ModelCheckpoint to enable you to stop your training when desired (Ctrl +C) or when certain criteria is met. Here is an example of basic ModelCheckpoint use:
#save best True saves only if the metric improves
chk = ModelCheckpoint("myModel.h5", monitor='val_loss', save_best_only=False)
callbacks_list = [chk]
#pass callback on fit
history = model.fit(X, Y, ... , callbacks=callbacks_list)
(Edit:) As suggested in comments, another option you have available is to use the EarlyStopping callback, where you can specify the minimum change tolerated and the 'patience' or epochs without such improvement before stopping the training. If using this, you have to pass it to the callbacks argument as explained before.
At the current setup you model has (and with the modifications you have tried) that point in your training seems to be the optimal training time for your case; training it further will bring no benefits to your model (in fact, will make it generalize worse).
Given you have tried several modifications, one thing you can do is to try to increase your Network Depth, to give it more capacity. Try adding more layers, one at a time, and check for improvements. Also, you usually you want to start with simpler models first, before attempting a multi-layer solution.
If a simple model doesn't work, add one layer and test again, repeating until satisfied or possible. And by simple I mean really simple, have you tried a non-convolutional approach? Although CNN are great for images, maybe you are overkilling it here.
If nothing seems to work, maybe it is time to get more data, or to generate more data from the one you have by sampling or other techniques. For that last suggestion, try checking this keras blog I have found really useful. Deep learning algorithms usually require substantial amount of training data, specially for complex models, like images, so be aware this may not be an easy task. Hope this helps.

IMHO, this is just normal situation for DL. In Keras you can setup a callback that will save the best model (depending on evaluation metric that you provide), and callback that will stop training if model isn't improving.
See ModelCheckpoint & EarlyStopping callbacks respectively.
P.S. Sorry, maybe I misunderstood question - do you have validation loss decreasing form first step?

Validation loss is increasing. This means you need more data, or more regularization. Standard situation here, and nothing to be worried about. By the way, more parameters (bigger model) is just going to worsen this problem unless you fix it.
So you can now investigate profitably by introducing more examples, L2, L1, or dropout.

I faced a similar problem and managed to fix it by removing the Batch Normalisation layer that's just before the output dense layer. This made a ton of difference. Also one of the suggestions I was given is to remove the Dropout layer as it might be causing Shift Variance. Check this paper
I got part of the solution from this thread.

Related

Sparse Categorical Crossentropy Loss Seems Scaled Really High, Despite Very Successful Model

I'm training some CNN networks on proprietary data using Tensorflow. We have boatloads of data, and it seems that these models are capable of learning a great deal of information about classifying data (all binary classifications so far).
Sometimes, the train/test accuracy curves can be remarkably good, upwards of 95% in some cases. However, the loss functions are suspicious in terms of scale. Visually, they look alright and about how I'd expect for something performing well, but it isn't the correct order of magnitude.
Can anyone tell me how this scaling is usually appropriately done in TF/Keras? I'm confident in these models, as they've been tested on other datasets and generalized very well, but the screwy loss function isn't very nice to report.
The learning rate is on the order of 0.0001. L1 and L2 are using the same lambda value, which I've had the most success with when providing to the model as somewhere between 0.01 and 0.03. I'm currently not using any dropout.
I'm including photos of a particularly highly variant accuracy run. This isn't always the case, but it does happen sometimes. I suspect that this problem is partly due to outlier data, or possibly the regularization values.
Here are relevant code snippets.
model = tf.keras.models.Sequential()
if logistic_regression is not True:
for i in range(depth):
# 1
model.add(Conv2D(
15,
kernel_size=(10, 3),
strides=1,
padding='same',
activation='relu',
data_format='channels_last',
kernel_regularizer=tf.keras.regularizers.l1_l2(
l1=regularizer_param,
l2=regularizer_param)
))
model.add(MaxPooling2D(
pool_size=(3, 3),
strides=1,
padding='valid',
data_format='channels_last'))
model.add(BatchNormalization())
if dropout is not None:
model.add(Dropout(dropout))
# flatten
model.add(Flatten(data_format='channels_last'))
model.add(Dense(
len(self.groups),
# use_bias=True if initial_bias is not None else False,
# bias_initializer=initial_bias
# if initial_bias is not None
# else None,
kernel_regularizer=tf.keras.regularizers.l1_l2(
l1=regularizer_param,
l2=regularizer_param)
))
model.compile(
optimizer=tf.keras.optimizers.Adagrad(
learning_rate=learning_rate,
initial_accumulator_value=0.1,
epsilon=1e-07,
name='Adagrad'),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

You should not worry about the scale of your loss function values. Remember, the loss function is simply a measure of how far away your network is. However, you can always scale this any way you like. What does matter is the trend in the loss over epochs? You want it to be a smooth decrease, which is what your second figure shows.
Losses are just that: an arbitrary number that's only meaningful in a relative sense, for the same network, for the same dataset. It has no other meaning. In fact, losses do not correspond well with metrics either: see Huang et al., 2019.
as they've been tested on other datasets and generalized very well,
That's what matters.
but the screwy loss function isn't very nice to report.
You could scale these losses by 1,000. They're only meaningful in a relative sense.
References:
Huang et al., 2019. Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment

Accuracy increase but loss also increases

I am using this model.While using this model validation accuracy is increasing but at a same time validation loss is also increasing.What happening here?
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import adam
model_alpha1 = Sequential()
model_alpha1.add(Dense(64, input_dim=96, activation='relu'))
model_alpha1.add(Dense(2, activation='softmax'))
opt_alpha1 = adam(lr=0.001)
model_alpha1.compile(loss='sparse_categorical_crossentropy', optimizer=opt_alpha1, metrics=
['accuracy'])
history = model_alpha1.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=200, verbose=1)
If need any more details i will provide just comment for the detail.Thank you

It seems that your model is overfitting. In other words, your model is fitting too much for the training data and that is why it is not performing as well for the validation data anymore.
Typical way to prevent overfitting is to use regularization techniques. For example:
Dropout layers https://keras.io/api/layers/regularization_layers/dropout/
Early stopping https://keras.io/api/callbacks/early_stopping/
Noise https://keras.io/api/layers/regularization_layers/gaussian_noise/
Try to train less deep NN for your problem or try dropout layers (or both obviously depending how these would affect). From your figure, we can see that the overfitting starts after ~25 epochs.
Overfitting may be caused, for example, by using too complex model for data set, which is not large enough. Or you just train the model too long! (here early stopping will fix the issue)
Here some regularization examples with TF: https://tensorflow.rstudio.com/tutorials/beginners/basic-ml/tutorial_overfit_underfit/

When training a classification model, it is not possible to optimize for accuracy directly since it is not a differentiable function. Therefore, we use cross-entropy as our loss function, which is highly correlated with accuracy. When inspecting our metrics it is important to remember that these are still two different metrics.
In terms of CE loss, your model is exhibiting textbook overfitting. However, in terms of accuracy, which is what you are actually interested in, it simply "finished training". This is why we track not only the loss but also the actual metrics we are interested in in the bottom line - so that we make our decisions based on them.

Different overfitting for three models with the same structure

I want to design three models that they have the same structure but at the end one of them should have some serious overfitting and another model has less overfitting and the last model has no overfitting.
The idea is that i want to see how much information exist in last layer of each model for some test data. let's say I m using mnist dataset as training and testing set and the structure of all models should be like this.
# Network architecture
network = Sequential()
# input layer
network.add(Dense(512, activation='relu', input_shape=(28*28,) ))
# Hidden layers
network.add(Dense(64, activation='relu', name='features'))
# Output layer
network.add((Dense(10,activation='softmax')))
network.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
#train the model
history = network.fit(train_img, train_label, epochs=50, batch_size=256, validation_split=0.2)
So now the question is how to change this train model that fulfills my needs for three models with different overfitting.
I m new in machine learning topics and i hope i have explain my question as good as possible.
Thanks in advance

Overfit:
The MNIST dataset is rather simple, therefore it should be easy to overfit with the model you are suggesting. Increase the number of epochs: eventually, your model will memorize the training data very well. If you struggle to overfit the data, you might need a more complex network - but I doubt that this will be the case.
Just right:
Probably the easiest wat to obtain the model which is just right (no overfit or underfit) use a callback. Specifically, we can use early stopping. The callback will stop training if the validation loss stops improving. For your code, all you have to do is modify the training as follows:
First define a callback
callback_es = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss')
Add the callback to your training
history = network.fit(train_img, train_label, epochs=50, batch_size=256, validation_split=0.2, callback = [callback_es])
Underfit
Similar idea as with overfitting. In this case, you want to stop your training early on. Train your model for a limited number of epochs only. If you find that your model overfits to quickly, try to lower the learning rate.

Validation and training accuracy high in the first epoch [Keras]

I am training an image classifier with 2 classes and 53k images, and validating it with 1.3k images using keras. Here is the structure of the neural network :
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])
Training accuracy increases from ~50% to ~85% in the first epoch, with 85% validation accuracy. Subsequent epochs increase the training accuracy consistently, however, validation accuracy stays in the 80-90% region.
I'm curious, is it possible to get high validation and training accuracy in the first epoch? If my understanding is correct, it starts small and increases steadily with each passing epoch.
Thanks
EDIT : The image size is 150x150 after rescaling and the mini-batch size is 16.

Yes, it is entirely possible to get high accuracy on first epoch and then only modest improvements.
If there is enough redundancy in the data and you make enough updates (wrt. the complexity of your model, which seems fairly easy to optimize) in the first epoch (i.e. you use small minibatches), it's entirely possible that you learn most of the important stuff during the first epoch. When you show the data again, the model will start overfitting to pecularities introduced by the specific images in your train set (thus you get increasing training accuracy), but since you do not provide any novel samples, it will not learn anything new about the underlying properties of your classes.
You can think of your training data as an infinite stream (which actually SGD would like to enjoy all the convergence theorems). Do you think that you need more than 50k samples to learn what is important? You can actually test the data-hunger of your model by providing less data or reporting performance after a some sub-epoch number of updates.

You cannot expect to get an accuracy over 90-95% with image classification using feed forward neural networks.
You need to use another architecture called Convolution Neural network, state of the art in image recognition.
Also it is very easy to build that using keras, but it is computationally more intensive than this.
If you want to stick with feed forward layers the best thing you can do is early stopping, but even that wouldn't give you accuracy over 90%.

Yeah, epoch are supposed to fit the data on the model.
Try using 2 neurons at the end and one hot encode on ur Class label!
Like I have seen one case where I got better results doing that, instead of binary output.

Output the loss/cost function in keras

I am trying to find the cost function in Keras. I am running an LSTM with the loss function categorical_crossentropy and I added a Regularizer. How do I output what the cost function looks like after my Regularizer this for my own analysis?
model = Sequential()
model.add(LSTM(
NUM_HIDDEN_UNITS,
return_sequences=True,
input_shape=(PHRASE_LEN, SYMBOL_DIM),
kernel_regularizer=regularizers.l2(0.01)
))
model.add(Dropout(0.3))
model.add(LSTM(NUM_HIDDEN_UNITS, return_sequences=False))
model.add(Dropout(0.3))
model.add(Dense(SYMBOL_DIM))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=1e-03, rho=0.9, epsilon=1e-08))

How do i output what the cost function looks like after my regularizer this for my own analysis?
Surely you can achieve this by obtaining the output (yourlayer.output) of the layer you want to see and print it (see here). However there are better ways to visualize these things.
Meet Tensorboard.
This is a powerful visualization tool that enables you to track and visualize your metrics, outputs, architecture, kernel_initializations, etc. The good news is that there is already a Tensorboard Keras Callback that you can use for this purpose; you just have to import it. To use it just pass an instance of the Callback to your fit method, something like this:
from keras.callbacks import TensorBoard
#indicate folder to save, plus other options
tensorboard = TensorBoard(log_dir='./logs/run1', histogram_freq=1,
write_graph=True, write_images=False)
#save it in your callback list
callbacks_list = [tensorboard]
#then pass to fit as callback, remember to use validation_data also
model.fit(X, Y, callbacks=callbacks_list, epochs=64,
validation_data=(X_test, Y_test), shuffle=True)
After that, start your Tensorboard sever (it runs locally on your pc) by executing:
tensorboard --logdir=logs/run1
For example, this is what my Kernels look like on two different models I tested (to compare them you have to save separate runs and then start Tensorboard on the parent directory instead). This is on the Histograms tab, on my second layer:
The model on the left I initialized with kernel_initializer='random_uniform', thus its shape is the one of a Uniform Distribution. The model on the right I initialized with kernel_initializer='normal', thus why it appears as a Gaussian distribution throughout my epochs (about 30).
This way you could visualize how your kernels and layers "look like", in a more interactive and understandable way than printing outputs. This is just one of the great features Tensorboard has, and it can help you develop your Deep Learning models faster and better.
Of course there are more options to the Tensorboard Callback and for Tensorboard in general, so I do suggest you thoroughly read the links provided if you decide to attempt this. For more information you can check this and also this questions.
Edit: So, you comment you want to know how your regularized loss "looks" analytically. Let's remember that by adding a Regularizer to a loss function we are basically extending the loss function to include some "penalty" or preference in it. So, if you are using cross_entropy as your loss function and adding an l2 regularizer (that is Euclidean Norm) with a weight of 0.01 your whole loss function would look something like:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.