Keras: freezing layers during training does not give consistent output - python

I am trying to fine-tune a model using keras, according to this description:
However, during training I discovered that the output of the network does not remain constant after training when using the same input (while all relevant layers were frozen), which I do not want.
I constructed the following toy example to investigate this:
import keras.applications.resnet50 as resnet50
from keras.layers import Dense, Flatten, Input
from keras.models import Model
from keras.utils import to_categorical
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
# data
i = np.random.rand(1,224,224,3)
X = np.random.rand(32,224,224,3)
y = to_categorical(np.random.randint(751, size=32), num_classes=751)
# model
base_model = resnet50.ResNet50(weights='imagenet', include_top=False, input_tensor=Input(shape=(224,224,3)))
layer = base_model.output
layer = Flatten(name='myflatten')(layer)
layer = Dense(751, activation='softmax', name='fc751')(layer)
model = Model(inputs=base_model.input, outputs=layer)
# freeze all layers
for layer in model.layers:
layer.trainable = False
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# features and predictions before training
feat0 = base_model.predict(i)
pred0 = model.predict(i)
weights0 = model.layers[-1].get_weights()
# before training output is consistent
feat00 = base_model.predict(i)
pred00 = model.predict(i)
print(np.allclose(feat0, feat00)) # True
print(np.allclose(pred0, pred00)) # True
# train, y, batch_size=2, epochs=3, shuffle=False)
# features and predictions after training
feat1 = base_model.predict(i)
pred1 = model.predict(i)
weights1 = model.layers[-1].get_weights()
# these are not the same
print(np.allclose(feat0, feat1)) # False
# Optionally: printing shows they are in fact very different
# print(feat0)
# print(feat1)
# these are not the same
print(np.allclose(pred0, pred1)) # False
# Optionally: printing shows they are in fact very different
# print(pred0)
# print(pred1)
# these are the same and loss does not change during training
# so layers were actually frozen
print(np.allclose(weights0[0], weights1[0])) # True
# Check again if all layers were in fact untrainable
for layer in model.layers:
assert layer.trainable == False # All succeed
# Being overly cautious also checking base_model
for layer in base_model.layers:
assert layer.trainable == False # All succeed
Since I froze all layers i fully expect both the predictions and both the features to be equal, but surprisingly they aren't.
So probably I am making some kind of mistake, but I can't figure what.. Any suggestions would be greatly appreciated!

So the problem seems to be that the model uses batch normalization layers, which do update their internal state (i.e. their weights) based on the seen data during training. This even happens when their trainable flag have been set to False. And as their weights are thus updated, the output also changes. You can check this by using the code in the question and changing the following codelines:
This weights0 = model.layers[-1].get_weights()
to weights0 = model.layers[2].get_weights()
and this weights1 = model.layers[-1].get_weights()
to weights1 = model.layers[2].get_weights()
or the index of any other batch normalization layer.
Because then the following assertion will no longer hold:
print(np.allclose(weights0, weights1)) # Now this is False
As far as I am aware there is currently no solution for this yet..
See also my issue on Keras' Github page.

One more reason for unstable training could be since you are using a very small batch size, i.e., batch_size=2. At least, use batch_size=32. This value is too small for the batch normalization to compute reliably the estimation of the training distribution statistics (mean and variance). These mean and variance values are then used to normalize first the distribution and followed by learning of beta and gamma parameters (actual distribution).
Check the following links for more details:
In the introduction and related works, the authors criticized BatchNorm and do check figure 1:
Nice article on "Curse of Batch Norm":


Adding a rescaling layer (or any layer for that matter) to a trained tensorflow keras model

I have a tensorflow keras model trained with tensorflow 2.3. The model takes as input an image, however the model was trained with scaled inputs and therefore we have to scale the image by 255 before inputting them into the model.
As we use this model across a variety of platforms, I am trying to simplify this by modifying the model to simply insert a rescale layer at the start of the keras model (i.e. immediately after the input). Therefore any future consumption of this model can simply pass an image without having to scale them.
I am having a lot of trouble getting this to work. I understand I need to use the following function to create a rescaling layer;
tf.keras.layers.experimental.preprocessing.Rescaling(255, 0.0, "rescaling")
But I am unsure how to insert this to the start of the model.
Thank you in advance
you can insert this layer at the top of your trained model. below an example where first we train a model manual scaling the input and the we using the same trained model but adding at the top a Rescaling layer
from tensorflow.keras.layers.experimental.preprocessing import Rescaling
# generate dummy data
input_dim = (28,28,3)
n_sample = 10
X = np.random.randint(0,255, (n_sample,)+input_dim)
y = np.random.uniform(0,1, (n_sample,))
# create base model
inp = Input(input_dim)
x = Conv2D(8, (3,3))(inp)
x = Flatten()(x)
out = Dense(1)(x)
# fit base model with manual scaling
model = Model(inp, out)
model.compile('adam', 'mse'), y, epochs=3)
# create new model with pretrained weight + rescaling at the top
inp = Input(input_dim)
scaled_input = Rescaling(1/255, 0.0, "rescaling")(inp)
out = model(scaled_input)
scaled_model = Model(inp, out)
# compare prediction with manual scaling vs layer scaling
pred = model.predict(X/255)
pred_scaled = scaled_model.predict(X)
(pred.round(5) == pred_scaled.round(5)).all() # True
Rescaling the images is part of data preprocessing, also rescaling images is called image normalization, this process is useful for providing a uniform scale for the dataset or numerical values you are using before building your model.In keras you can do this in many ways using one of the following according to your target:
If you are training using an Artificial neural network model you can use:-
"Batch normalization layer" or "Layer Normalization" or by the rescale method of keras you mentioned. You can look at this resource for more information about normalization .
to use the rescale method you mentioned:
#importing you libraries 1st
import tensorflow as tf
from tensorflow.keras.layers import BatchNormalization
#if your are using dataset from directory
import pathlib
then import your Dataset:
Dataset_Dir = '/Dataset/ path'
image size = (256,256) #the image size in your dataset
image shape = (96,96,3) #The shape you wish for your images in your network
Then divide your dataset to train-test I use 70-30 percent
Training_set = tf.keras.preprocessing.image_dataset_from_directory(Dataset_Dir,batch_size= 32,
image_size= image_size,
validation_split= 0.3,subset = "training",seed =123)
Test set
Testing_set = tf.keras.preprocessing.image_dataset_from_directory(Dataset_Dir,image_size= image_size,
validation_split=0.3,seed=123,subset ="validation")
normalization layer:
normalization_layer = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)
normalized_training_set = x, y: (normalization_layer(x), y))
training_image_batch,training_labels_batch = next(iter(normalized_training_set))
for more about this method too:
look at tensorflow tutorial:

How can i improve my CNN's accuracy evolution?

So, i'm trying to create a CNN which can predict if there is any "support devices" in a x-ray thorax image, but when training my model it seems it's not learning anything.
I'm using a dataset called "CheXpert" which has over 200.000 images to use. After doing some "cleaning", the final dataset ended up with 100.000 images.
As far as the model is concerned, i imported the convolutional base of the vgg16 pretrained model and added by my self 2 fully conected layers. Then, i freezed all the convolutional base and make only trainable the fully conected layers. Here's the code:
from keras.layers import GlobalAveragePooling2D
from keras.models import Model
pretrained_model = VGG16(weights='imagenet', include_top=False)
for layer in pretrained_model.layers:
layer.trainable = False
x = pretrained_model.output
x = GlobalAveragePooling2D()(x)
dropout = Dropout(0.25)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
x = dropout(x)
x = Dense(1024, activation = 'relu')(x)
predictions = Dense(1, activation='sigmoid')(x)
final_model = Model(inputs=pretrained_model.input, outputs=predictions)
final_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
As far as i know, the normal behavior should be that the accuracy should start low and then grow up with the epochs. But here it only oscillates through the same values (0.93 and 0.95). I'm sorry i cannot upload images to show you the graphs.
To sum up, i want to know if that little variance in the accuracy means that the model is not learning anything.
I have an hypothesis: from all the 100.000 images of the dataset, 95.000 have the label "1" and only 5.000 have the label "0". I think that if diminish the images with "1" equate them with the images with "0" the results would change.
The lack of images labeled "0" doesn't help the CNN for sure. I also suggest to lower the learning rate and play around with the batch size to see if something changes.
I wish it helps.
Because of imbalance training data, I suggest that you can set "class_weight" during the training step. The more data you have, the lower class_weight you set.
class_weight = {0: 1.5, 1: 0.5}, Y, class_weight=class_weight)
You can check the augment of class_weight in keras document.
class_weight: Optional dictionary mapping class indices (integers) to
a weight (float) value, used for weighting the loss function (during
training only). This can be useful to tell the model to "pay more
attention" to samples from an under-represented class.

Different results when training a model with same initial weights and same data

I'm trying to make some transfer learning to adjust the ResNet50 to my data set.
the problem is when I run the training again with the same parameters, I get a different result (loss and accuracy for train and val sets, so I guess also different weights and as a result different error rate for the test set)
here is my model:
the weights parameter is 'imagenet', all other parameter value isn't really important, the important thing is they are the same for each run...
def ImageNet_model(train_data, train_labels, param_dict, num_classes):
X_datagen = get_train_augmented()
validatin_cut_point= math.ceil(len(train_data)*(1-param_dict["validation_split"]))
base_model = applications.resnet50.ResNet50(weights=param_dict["weights"], include_top=False, pooling=param_dict["pooling"],
input_shape=(param_dict["image_size"], param_dict["image_size"],3))
# Define the layers in the new classification prediction
x = base_model.output
x = Dense(num_classes, activation='relu')(x) # new FC layer, random init
predictions = Dense(num_classes, activation='softmax')(x) # new softmax layer
model = Model(inputs=base_model.input, outputs=predictions)
# Freeze layers
layers_to_freeze = param_dict["freeze"]
for layer in model.layers[:layers_to_freeze]:
layer.trainable = False
for layer in model.layers[layers_to_freeze:]:
layer.trainable = True
sgd = optimizers.SGD(lr=param_dict["lr"], momentum=param_dict["momentum"], decay=param_dict["decay"])
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
lables_ints = [y.argmax() for y in np.array(train_labels)]
class_weights = class_weight.compute_class_weight('balanced',
train_generator = X_datagen.flow(np.array(train_data)[0:validatin_cut_point],np.array(train_labels)[0:validatin_cut_point], batch_size=param_dict['batch_size'])
validation_generator = X_datagen.flow(np.array(train_data)[validatin_cut_point:len(train_data)],
history= model.fit_generator(
steps_per_epoch=validatin_cut_point // param_dict['batch_size'],
validation_steps=(len(train_data)-validatin_cut_point) // param_dict['batch_size'],
return model
what can make the output of each run different?
Since the initial weights are the same, it can't explain the difference ( I also tried to freeze some layers, didn't help). any ideas?
When you initialize the weights randomly in Dense layer, weights are initialized differently across runs and also converge to different local minima.
x = Dense(num_classes, activation='relu')(x) # new FC layer, random init
If you want the output to be same you need to initialize weights with same value across runs. You can read the details on how to obtain reproducible results on Keras here. These are the steps you need to follow
Set the PYTHONHASHSEED environment variable to 0
Set random seed for numpy generated random numbers np.random.seed(SEED)
Set random seed for Python generated random numbers random.seed(SEED)
Set random state for tensorflow backend tf.set_random_seed(SEED)

Adding Dropout to testing/inference phase

I've trained the following model for some timeseries in Keras:
input_layer = Input(batch_shape=(56, 3864))
first_layer = Dense(24, input_dim=28, activation='relu',
first_layer = Dropout(0.3)(first_layer)
second_layer = Dense(12, activation='relu')(first_layer)
second_layer = Dropout(0.3)(second_layer)
out = Dense(56)(second_layer)
model_1 = Model(input_layer, out)
Then I defined a new model with the trained layers of model_1 and added dropout layers with a different rate, drp, to it:
input_2 = Input(batch_shape=(56, 3864))
first_dense_layer = model_1.layers[1](input_2)
first_dropout_layer = model_1.layers[2](first_dense_layer)
new_dropout = Dropout(drp)(first_dropout_layer)
snd_dense_layer = model_1.layers[3](new_dropout)
snd_dropout_layer = model_1.layers[4](snd_dense_layer)
new_dropout_2 = Dropout(drp)(snd_dropout_layer)
output = model_1.layers[5](new_dropout_2)
model_2 = Model(input_2, output)
Then I'm getting the prediction results of these two models as follow:
result_1 = model_1.predict(test_data, batch_size=56)
result_2 = model_2.predict(test_data, batch_size=56)
I was expecting to get completely different results because the second model has new dropout layers and theses two models are different (IMO), but that's not the case. Both are generating the same result. Why is that happening?
As I mentioned in the comments, the Dropout layer is turned off in inference phase (i.e. test mode), so when you use model.predict() the Dropout layers are not active. However, if you would like to have a model that uses Dropout both in training and inference phase, you can pass training argument when calling it, as suggested by François Chollet:
# ...
new_dropout = Dropout(drp)(first_dropout_layer, training=True)
# ...
Alternatively, If you have already trained your model and now want to use it in inference mode and keep the Dropout layers (and possibly other layers which have different behavior in training/inference phase such as BatchNormalization) active, you can define a backend function that takes the model's inputs as well as Keras learning phase:
from keras import backend as K
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
# to use it pass 1 to set the learning phase to training mode
outputs = func([input_arrays] + [1.])
your question has a simple solution in the latest version of Tensorflow. you can set the training argument of the call method to true.
you can run a code like the below code:
by using training=True TensorFlow automatically applies the Dropout layer in inference mode.
As there are already some working code solutions above, I will simply add a few more details regarding dropout during inference to prevent confusion.
Based on the original paper, Dropout layers play the role of turning off (setting gradients to zero) the neuron nodes during training to reduce overfitting. However, once we finish off with training and start testing the model, we do not 'touch' any neurons, thus, all the units are considered to make the decision when inferencing. This causes previously 'dead' neuron weights to be large than expected due to the usage of Dropout. To prevent this, a scaling factor is applied to balance the network node. To be more precise, if a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p during the prediction stage.

ResNet always predicting one class

I am trying to do transfer learning in Keras + Tensorflow on a selected subset of Places-205 dataset, containing only 27 categories. I am using InceptionV3, DenseNet121 and ResNet50, pre-trained on ImageNet, and add a couple of extra layers to adapt to my classes. If the model is ResNet, I add Flatten + Dense for classfication, and if it is DenseNet or Inceptionv3, I add Global Avg Pool + Dense (relu) + Dense (classification).
This is the code snippet:
x = base_model.output
if FLAGS.model in 'resnet50':
x = Flatten(name="flatten")(x)
x = GlobalAveragePooling2D()(x)
# Let's add a fully-connected layer
x = Dense(1024, activation = 'relu')(x)
# And a logistic layer
predictions = Dense(classes, activation = 'softmax')(x)
For DenseNet and Inceptionv3 the training is ok, and the validation accuracy hits 70%, but for ResNet the validation accuracy stays fixed at 0.0369/0.037 (which is 1/27, my number of classes). It seems like it always predicts one class, but it's weird because its training progresses ok and the unspecific model code is exactly the same as for DenseNet and InceptionV3, which do work as expected.
Do you have any idea why it happens?
Thanks a lot!
I had a similar issue as you #Ciprian Andrei Focsaneanu, and what I have found to have worked was to make the previous layers (before the fully connected layers) trainable, as the filters/features of the ResNet50 were not suitable for my application.
Strangely enough, I also trained the VGG16 models, which was initially on the same images (imagenet) but its filters worked for my application, but I digress.
Here's the link to a page that inspired me to do this:
Hope this helps!

