Tensorflow model performing significantly worse than Keras model

Tensorflow model performing significantly worse than Keras model - python

I was having an issue with my Tensorflow model and decided to try Keras. It appears to me at least that I am creating the same model with the same parameters, but the Tensorflow model just outputs the mean value of train_y while the Keras model actually varies according the input. Am I missing something in my tf.Session? I usually use Tensorflow and have never had a problem like this.
Tensorflow Code:
score_inputs = tf.placeholder(np.float32, shape=(None, 100))
targets = tf.placeholder(np.float32, shape=(None), name="targets")
l2 = tf.contrib.layers.l2_regularizer(0.01)
first_layer = tf.layers.dense(score_inputs, 100, activation=tf.nn.relu, kernel_regularizer=l2)
outputs = tf.layers.dense(first_layer, 1, activation = None, kernel_regularizer=l2)
optimizer = tf.train.AdamOptimizer(0.001)
l2_loss = tf.losses.get_regularization_loss()
loss = tf.reduce_mean(tf.square(tf.subtract(targets, outputs)))
loss += l2_loss
rmse = tf.sqrt(tf.reduce_mean(tf.square(outputs - targets)))
mae = tf.reduce_mean(tf.sqrt(tf.square(outputs - targets)))
training_op = optimizer.minimize(loss)
batch_size = 32
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(10):
avg_train_error = []
for i in range(len(train_x) // batch_size):
batch_x = train_x[i*batch_size: (i+1)*batch_size]
batch_y = train_y[i*batch_size: (i+1)*batch_size]
_, train_loss = sess.run([training_op, loss], {score_inputs: batch_x, targets: batch_y})
feed = {score_inputs: test_x, targets: test_y}
test_loss, test_mae, test_rmse, test_ouputs = sess.run([loss, mae, rmse, outputs], feed)
This has a mean absolute error of 0.682 and root mean squared error of 0.891.
The Keras Code:
inputs = Input(shape=(100,))
hidden = Dense(100, activation="relu", kernel_regularizer = regularizers.l2(0.01))(inputs)
outputs = Dense(1, activation=None, kernel_regularizer = regularizers.l2(0.01))(hidden)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(lr=0.001), loss='mse', metrics=['mae'])
model.fit(train_x, train_y, batch_size=32, epochs=10, shuffle=False)
keras_pred = model.predict(test_x)
This has a mean absolute error of 0.601 and root mean square error of 0.753.
It appears to me that I am defining the same network in both instances, yet as I said the Tensorflow model only outputs the mean value of train_y, while the Keras model performs a lot better. Any suggestions?

I'm going to try to point out the differences between the two codes.
Keras documentation here shows that the weights are initialized by 'glorot_uniform' whereas your weights are initialized by default, most probably at random as the documentation doesn't clearly specify what it is tensorflow intialization. So initialization is most probably different and it definitely
matters.
The second difference most probably is because of the difference in the data type of input, one being numpy.float32 and other being keras default input type, which again hasn't been specified by the documentation

#Priyank Pathak and #lehiester have given some valid points. Taking their suggestions into account, I can suggest you to change the following things and check again:
Use same kernel_initializer and data_type
Use more epochs for better generalisation
Seed your random, numpy and tensorflow functions

There isn't any obvious difference in the models, but the different results could possibly be explained due to random variation in training. Especially since you're only training for 10 epochs, the results could be fairly sensitive to the randomly chosen initial weights for the models.
Try running with more epochs (e.g. 1000) and running each one several times (e.g. 5)--the average results should be fairly close.

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

Validation generator accuracy drop near to chance when I change its properties-- keras ImageDataGenerator, model.evaluate

I am reading images from a directory hierarchy (flow_from_directory using generators from the ImageDataGenerator class). The model is a fixed parameter mobilenetv2 + a trainable softmax layer. When I fit the model to training data, accuracy levels are comparable for training and validation. If I play with the validation parameters or reset the generator, accuracy for the validation generator drops significantly using model.evaluate or if I restart fitting the model with model.fit. The database is a 3D view database.
Relevant code:
'''
batch_size=16
rescaled3D_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, zoom_range=0.2,
shear_range=0.2,
horizontal_flip=True)
train_gen =rescaled3D_gen.flow_from_directory(data_directory + '/train/', seed=3,
target_size = (pixels, pixels), shuffle=True,
batch_size = batch_size, class_mode='binary')
val_gen =rescaled3D_gen.flow_from_directory(data_directory + '/test/', seed=3,
target_size = (pixels, pixels), shuffle=True,
batch_size = batch_size, class_mode='binary')
#MODEL
inputs = tf.keras.Input(shape=(None, None, 3), batch_size=batch_size)
x = tf.keras.layers.Lambda(lambda img: tf.image.resize(img, (pixels,pixels)))(inputs)
x = tf.keras.layers.Lambda(tf.keras.applications.mobilenet_v2.preprocess_input)(x)
mobilev2 = tf.keras.applications.mobilenet_v2.MobileNetV2(weights = 'imagenet', input_tensor = x,
input_shape=(pixels,pixels,3),
include_top=True, pooling = 'avg')
#add a dense layer for task-specific categorization.
full_model = tf.keras.Sequential([mobilev2,
tf.keras.layers.Dense(train_gen.num_classes, activation='softmax')])
for idx, layers in enumerate(mobilev2.layers):
layers.trainable = False
mobilev2.layers[-1].trainable=True
full_model.compile(optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0001),
loss = 'sparse_categorical_crossentropy',
metrics=['accuracy'])
#start fitting
val_gen.reset()
train_gen.reset()
full_model.fit(train_gen,
steps_per_epoch = samples_per_epoch,
epochs=30,
validation_data=val_gen,
validation_steps = int(np.floor(val_gen.samples/val_gen.batch_size)))
good_acc_score = full_model.evaluate(val_gen, steps=val_gen.n//val_gen.batch_size)
'''
reproduce strangeness by doing something like this:
'''
val_gen.batch_size=4
val_gen.reset()
val_gen.batch_size=batch_size
'''
Then validation accuracy is automatically lower (perhaps to chance) during fit or evaluation
'''
bad_acc_score = full_model.evaluate(val_gen, steps=val_gen.n//val_gen.batch_size)
#or
full_model.fit(train_gen,
steps_per_epoch = samples_per_epoch,
epochs=1,
validation_data=val_gen,
validation_steps = int(np.floor(val_gen.samples/val_gen.batch_size)))
'''

here area few things you might try. You can eliminate the Lamda layers by changing the train_gen as follows
rescaled3D_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, zoom_range=0.2,shear_range=0.2, horizontal_flip=True,
preprocessing_function=tf.keras.applications.mobilenet_v2.preprocess_input)
You do not need the Lamda resize layer since you specify the target size in flow from directory.
In the val_gen you have shuffle=True. This will shuffle the validation image order for each epoch. Better to set it to False for consistancy.
In the code for mobilenet you have include_top=True and pooling='avg' When include_top is True the pooling parameter is ignored.
Setting include_top=True leaves the top layer of your model with a dense layer of 1000 nodes and a softmax activation function.
I would set include_top=False. That way the mobilenet output is a global pooling layer that can directly feed your dense categorization layer. In the generators you set the class_mode='binary'. but in model.compile you set the loss as
sparse_categorical_crossentropy. This will work but better to compile with loss=BinaryCrossentropy.
It is preferable to go through the validation samples exactly one time per epoch for consistancy. To do that the the batch size should be selected such that validation samples/batch_size is an integer and use that integer as the number of validation steps.
The code below will do that for you.
b_max=80 # set this to the maximum batch size you will allow based on memory capacity
length=val_gen.samples
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]
val_steps=int(length/batch_size)
Changing validation batch size can change the results of validation loss and accuracy. Generally a larger batch size will result in less fluctuation of the loss but can lead to a higher probability of getting stuck in a local minimum. Try these changes and see if there is less variance in the results.

I think the lead answer from this board solves the problem I observed
Keras: Accuracy Drops While Finetuning Inception

Transfer learning with pretrained model by tf.GradientTape can't converge

I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks

Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.

Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.

Different results when training a model with same initial weights and same data

I'm trying to make some transfer learning to adjust the ResNet50 to my data set.
the problem is when I run the training again with the same parameters, I get a different result (loss and accuracy for train and val sets, so I guess also different weights and as a result different error rate for the test set)
here is my model:
the weights parameter is 'imagenet', all other parameter value isn't really important, the important thing is they are the same for each run...
def ImageNet_model(train_data, train_labels, param_dict, num_classes):
X_datagen = get_train_augmented()
validatin_cut_point= math.ceil(len(train_data)*(1-param_dict["validation_split"]))
base_model = applications.resnet50.ResNet50(weights=param_dict["weights"], include_top=False, pooling=param_dict["pooling"],
input_shape=(param_dict["image_size"], param_dict["image_size"],3))
# Define the layers in the new classification prediction
x = base_model.output
x = Dense(num_classes, activation='relu')(x) # new FC layer, random init
predictions = Dense(num_classes, activation='softmax')(x) # new softmax layer
model = Model(inputs=base_model.input, outputs=predictions)
# Freeze layers
layers_to_freeze = param_dict["freeze"]
for layer in model.layers[:layers_to_freeze]:
layer.trainable = False
for layer in model.layers[layers_to_freeze:]:
layer.trainable = True
sgd = optimizers.SGD(lr=param_dict["lr"], momentum=param_dict["momentum"], decay=param_dict["decay"])
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
lables_ints = [y.argmax() for y in np.array(train_labels)]
class_weights = class_weight.compute_class_weight('balanced',
np.unique(lables_ints),
np.array(lables_ints))
train_generator = X_datagen.flow(np.array(train_data)[0:validatin_cut_point],np.array(train_labels)[0:validatin_cut_point], batch_size=param_dict['batch_size'])
validation_generator = X_datagen.flow(np.array(train_data)[validatin_cut_point:len(train_data)],
np.array(train_labels)[validatin_cut_point:len(train_data)],
batch_size=param_dict['batch_size'])
history= model.fit_generator(
train_generator,
epochs=param_dict['epochs'],
steps_per_epoch=validatin_cut_point // param_dict['batch_size'],
validation_data=validation_generator,
validation_steps=(len(train_data)-validatin_cut_point) // param_dict['batch_size'],
class_weight=class_weights)
shuffle=False,class_weight=class_weights)
graph_of_loss_and_acc(history)
model.save(param_dict['model_file_name'])
return model
what can make the output of each run different?
Since the initial weights are the same, it can't explain the difference ( I also tried to freeze some layers, didn't help). any ideas?
Thank!

When you initialize the weights randomly in Dense layer, weights are initialized differently across runs and also converge to different local minima.
x = Dense(num_classes, activation='relu')(x) # new FC layer, random init
If you want the output to be same you need to initialize weights with same value across runs. You can read the details on how to obtain reproducible results on Keras here. These are the steps you need to follow
Set the PYTHONHASHSEED environment variable to 0
Set random seed for numpy generated random numbers np.random.seed(SEED)
Set random seed for Python generated random numbers random.seed(SEED)
Set random state for tensorflow backend tf.set_random_seed(SEED)

Extracting weights from best Neural Network in Tensorflow/Keras - multiple epochs

I am working on a 1 - hidden - layer Neural Network with 2000 neurons and 8 + constant input neurons for a regression problem.
In particular, as optimizer I am using RMSprop with learning parameter = 0.001, ReLU activation from input to hidden layer and linear from hidden to output. I am also using a mini-batch-gradient-descent (32 observations) and running the model 2000 times, that is epochs = 2000.
My goal is, after the training, to extract the weights from the best Neural Network out of the 2000 run (where, after many trials, the best one is never the last, and with best I mean the one that leads to the smallest MSE).
Using save_weights('my_model_2.h5', save_format='h5') actually works, but at my understanding it extract the weights from the last epoch, while I want those from the epoch in which the NN has perfomed the best. Please find the code I have written:
def build_first_NN():
model = keras.Sequential([
layers.Dense(2000, activation=tf.nn.relu, input_shape=[len(X_34.keys())]),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error']
)
return model
first_NN = build_first_NN()
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000)
first_NN.save_weights('my_model_2.h5', save_format='h5')
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['dense_1/dense_1/kernel:0'][:])
The then extracted weights should be those from the last of the 2000 epochs: how can I get those from, instead, the one in which the MSE was the smallest?
Looking forward for any comment.
EDIT: SOLVED
Building on the received suggestions, as for general interest, that's how I have updated my code, meeting my scope:
# build_first_NN() as defined before
first_NN = build_first_NN()
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
checkpoint = ModelCheckpoint(trained_weights_path,
monitor='mean_squared_error',
verbose=1,
save_best_only=True,
mode='min')
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000,
callbacks = [checkpoint])
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['model_weights/dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['model_weights/dense_1/dense_1/kernel:0'][:])

Use ModelCheckpoint callback from Keras.
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
use this as a callback in your model.fit() . This will always save the model with the highest validation accuracy (lowest MSE on validation) at the location specified by filepath.
You can find the documentation here.
Of course, you need validation data during training for this. Otherwise I think you can probably be able to check on lowest training MSE by writing a callback function yourself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.