Keras training crashes mid epoch after multiple correct executions

Keras training crashes mid epoch after multiple correct executions - python

I am trying create a Cudgru based model that predicts sequence of 7 features that are interrelated. Here's my keras model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnngru_1 (CuDNNGRU) (None, 49, 100) 32700
_________________________________________________________________
dropout_1 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_2 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_2 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_3 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_3 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_4 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_4 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_5 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_5 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_6 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_6 (Dropout) (None, 49, 100) 0
_________________________________________________________________
cu_dnngru_7 (CuDNNGRU) (None, 49, 100) 60600
_________________________________________________________________
dropout_7 (Dropout) (None, 49, 100) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 4900) 0
_________________________________________________________________
dense_1 (Dense) (None, 7) 34307
=================================================================
Total params: 430,607
Trainable params: 430,607
Non-trainable params: 0
I'm trying to run this model for higher number of epochs. First few epochs are fine, but then it errors out:
Model] Model Compiled
Time taken: 0:00:02.314468
[Model] Training Started
[Model] 100 epochs, 1000 batch size, 20.0 batches per epoch
Epoch 1/100
20/20 [==============================] - 5s 240ms/step - loss: 0.1631 - acc: 0.2905
Epoch 2/100
20/20 [==============================] - 2s 81ms/step - loss: 0.1288 - acc: 0.2455
Epoch 3/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0952 - acc: 0.5058
Epoch 4/100
20/20 [==============================] - 2s 76ms/step - loss: 0.1141 - acc: 0.3288
Epoch 5/100
20/20 [==============================] - 2s 75ms/step - loss: 0.1064 - acc: 0.3425
Epoch 6/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0767 - acc: 0.4213
Epoch 7/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0635 - acc: 0.4764
Epoch 8/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0555 - acc: 0.5274
Epoch 9/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0544 - acc: 0.5141
Epoch 10/100
...
Epoch 61/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0506 - acc: 0.3925
Epoch 62/100
20/20 [==============================] - 1s 72ms/step - loss: 0.0495 - acc: 0.4323
Epoch 63/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0495 - acc: 0.4118
Epoch 64/100
2/20 [==>...........................] - ETA: 1s - loss: 0.0495 - acc: 0.4885Traceback (most recent call last):
File "./run.py", line 118, in <module>
main()
File "./run.py", line 92, in main
steps_per_epoch=steps_per_epoch)
File "/home/sridhar/PE_CSV/alarmProj/rnn/lstm/core/model.py", line 149, in train_generator
workers=70)
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1415, in fit_generator
initial_epoch=initial_epoch)
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_generator.py", line 213, in fit_generator
class_weight=class_weight)
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1209, in train_on_batch
class_weight=class_weight)
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 749, in _standardize_user_data
exception_prefix='input')
File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_utils.py", line 127, in standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected cu_dnngru_1_input to have 3 dimensions, but got array with shape (380, 1)
If I reduce the number of epochs to less than that value (say epoch 64 here), I don't have any issues, but increasing the number of epochs causes the above error at some point. The exact number of epochs where it crashes seem to vary with any change to the configuration. The same issue is seen with vanilla GRU/LSTM layers.
This is keras-2.2.2 and the model is being compiled with 70 worker threads.
Is there something I could do to avoid this issue?
Edit:
Here's the relevant approximate code used:
session_conf = tf.ConfigProto(
inter_op_parallelism_threads=multiprocessing.cpu_count(),
intra_op_parallelism_threads=multiprocessing.cpu_count())
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
self.model.add(CuDNNGRU(
100,
input_shape=(49,7),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
100,
input_shape=(None,None),
kernel_initializer='orthogonal',
return_sequences=true))
self.model.add(Dropout(0.4))
elf.model.add(Flatten())
self.model.add(Dense(7, activation='relu'))
sgd = SGD(lr=0.1, decay=1e-2, clipnorm=5.0)
self.model.compile(
loss='mse',
metrics=["accuracy"],
optimizer=sgd)
===================
def train_generator(self, data_gen, epochs, batch_size, steps_per_epoch):
timer = Timer()
timer.start()
print('[Model] Training Started')
print('[Model] %s epochs, %s batch size, %s batches per epoch' %
(epochs, batch_size, steps_per_epoch))
save_fname = '%s/%s-e%s.h5' % (self.model_dir, dt.datetime.now()
.strftime('%d%m%Y-%H%M%S'), str(epochs))
callbacks = [
ModelCheckpoint(
filepath=save_fname, monitor='loss', save_best_only=True)
]
try:
self.model.fit_generator(
data_gen,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
callbacks=callbacks)
except:
pdb.set_trace()
)
print('[Model] Training Completed. Model saved as %s' % save_fname)
timer.stop()
=============
#invoked from main function
model.train_generator(
data_gen=data.generate_train_batch(
seq_len=50,
batch_size=1000,
normalise=false),
epochs=100,
batch_size=1000,
steps_per_epoch=steps_per_epoch)
=============
def generate_train_batch(self, seq_len, batch_size, normalise):
'''Yield a generator of training data from filename on given list of cols split for train/test'''
i = 0
while i < (self.len_train - seq_len):
x_batch = []
y_batch = []
for b in range(batch_size):
if i >= (self.len_train - seq_len):
# stop-condition for a smaller final batch if data doesn't divide evenly
yield np.array(x_batch), np.array(y_batch)
x, y = self._next_window(i, seq_len, normalise)
x_batch.append(x)
y_batch.append(y)
i += 1
yield np.array(x_batch), np.array(y_batch)
=======================

The generator was wrong. It incorrectly assumes a finite generator whilst keras expects an infinite one.

Related

how can i continue training from last epoch?

I saved the history of training by
history = model.fit(train_generator, epochs=epochs, steps_per_epoch=train_steps,
verbose=1, callbacks=callbacks, validation_data=val_generator,
validation_steps=val_steps,batch_size=16)
with open('history_epochs.pkl', 'wb') as f:
dump(history.history, f)
Can I use the file of history to continue from the last epoch? and how please

Below applies to any deep learning library …
Build model
Train model.
Save model (should be saving parameters/weights as well).
Load model from the saved file (any time any where).
Continue with more training.

You can use the pickle file to save and load your model and continue training:
Create your model
Train your model
Save your model as a pickle file
Code for the above steps:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import joblib
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
fig, axes = plt.subplots(2,5,figsize=(15,6))
for idx, axe in enumerate(axes.flatten()):
axe.axis('off')
idx_img = np.argwhere(y_train==idx)[0][0]
axe.imshow(X_train[idx_img], cmap=plt.cm.binary)
axe.set_title(class_names[y_train[idx_img]])
X_train = X_train.astype('float32') / 255.0
X_train = tf.expand_dims(X_train, axis=-1)
X_test = X_test.astype('float32') / 255.0
X_test = tf.expand_dims(X_test, axis=-1)
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(X_train.shape[1], X_train.shape[1], 1)))
model.add(tf.keras.layers.Conv2D(128, (3,3), activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Conv2D(64, (3,3), activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Conv2D(128, (3,3), activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, batch_size=256, epochs=3, verbose=1, validation_split=.2)
model.evaluate(X_test, y_test, verbose=1)
joblib.dump(model, 'model.pkl')
Output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 128) 1280
batch_normalization (BatchN (None, 26, 26, 128) 512
ormalization)
dropout (Dropout) (None, 26, 26, 128) 0
conv2d_1 (Conv2D) (None, 24, 24, 64) 73792
batch_normalization_1 (Batc (None, 24, 24, 64) 256
hNormalization)
dropout_1 (Dropout) (None, 24, 24, 64) 0
conv2d_2 (Conv2D) (None, 22, 22, 128) 73856
batch_normalization_2 (Batc (None, 22, 22, 128) 512
hNormalization)
dropout_2 (Dropout) (None, 22, 22, 128) 0
flatten (Flatten) (None, 61952) 0
dense (Dense) (None, 512) 31719936
dropout_3 (Dropout) (None, 512) 0
dense_1 (Dense) (None, 128) 65664
dropout_4 (Dropout) (None, 128) 0
dense_2 (Dense) (None, 10) 1290
=================================================================
Total params: 31,937,098
Trainable params: 31,936,458
Non-trainable params: 640
_________________________________________________________________
Epoch 1/3
188/188 [==============================] - 19s 81ms/step - loss: 0.8264 - accuracy: 0.7398 - val_loss: 3.4644 - val_accuracy: 0.1245
Epoch 2/3
188/188 [==============================] - 14s 75ms/step - loss: 0.4896 - accuracy: 0.8283 - val_loss: 1.2240 - val_accuracy: 0.5802
Epoch 3/3
188/188 [==============================] - 14s 77ms/step - loss: 0.4055 - accuracy: 0.8544 - val_loss: 0.3711 - val_accuracy: 0.8675
313/313 [==============================] - 2s 5ms/step - loss: 0.3850 - accuracy: 0.8591
[0.3849639296531677, 0.8590999841690063]
INFO:tensorflow:Assets written to: ram://****/assets
['model.pkl']
Load your model
Continue Training
Code for the above steps:
model = joblib.load("/content/model.pkl")
model.fit(X_train, y_train, batch_size=256, epochs=2, verbose=1, validation_split=.2)
model.evaluate(X_test, y_test, verbose=1)
Output:
Epoch 1/2
188/188 [==============================] - 17s 84ms/step - loss: 0.4414 - accuracy: 0.8496 - val_loss: 0.3449 - val_accuracy: 0.8697
Epoch 2/2
188/188 [==============================] - 15s 82ms/step - loss: 0.3704 - accuracy: 0.8708 - val_loss: 0.2884 - val_accuracy: 0.8965
313/313 [==============================] - 1s 5ms/step - loss: 0.3114 - accuracy: 0.8938
[0.31136029958724976, 0.8938000202178955]

Tensorflow save and load_model not working but save and load_weights does

I am using tensorflow version 2.8.0:
I have seen this issue from multiple sources all over forums, githubs, and even some here for the past 5 years with no definitive answer that has worked for me... For some reason, in certain situations, a loaded model from a previous save yields very different results from the original model evaluation. I haven't seen any well documented and investigative questions about this so I thought I'd show my full code below (simple illustration of the issue).
This is an application of transfer learning from a pre-trained tensorflow model. The model is first trained through 5 epochs on train_data, then fine tuned (with more trainable params) for 5 more. Evaluating the model on test_data shows an accuracy of 0.5671. The model is then saved and loaded in .h5 format (I have also tried the tf SavedModel format and the result is the same). The resultant loaded_model yields an evaluation accuracy on the same, unaltered test_data of 0.4535.
The result should be the same (0.5671)... so to further investigate I decided to save the fine tuned model's weights independently, construct and compile the same model architecture in new_model, and load the saved model's weights into new_model. Evaluating new_model yields the correct result, an accuracy of 0.5671. ----- Okay, so it must be the weights not saving properly right? I pulled the weights from each of these three models (model, loaded_model, new_model) and compared their flattened results. They are all the same. I really have no idea what's going on here but I'm assuming it is not random initialization, because the loaded_model evaluation results really did not perform anywhere near the fine tuned model - I would assume they would converge much closer.
import tensorflow as tf
tf.random.set_seed(42)
import pandas as pd
import numpy as np
import os
import pathlib
data_dir = pathlib.Path("101_food_classes_10_percent/train")
class_names = np.array(sorted([item.name for item in data_dir.glob('*')]))
train_dir = './101_food_classes_10_percent/train/'
test_dir = './101_food_classes_10_percent/test/'
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen=ImageDataGenerator()
train_data = datagen.flow_from_directory(directory = train_dir,
target_size = (224,224),
batch_size = 32,
class_mode='categorical')
test_data = datagen.flow_from_directory(directory = test_dir,
target_size = (224,224),
batch_size = 32,
class_mode='categorical')
from tensorflow.keras.layers.experimental import preprocessing
data_augmentation = tf.keras.Sequential([
preprocessing.RandomFlip('horizontal'),
preprocessing.RandomRotation(0.2),
preprocessing.RandomZoom(0.2),
preprocessing.RandomHeight(0.2),
preprocessing.RandomWidth(0.2)
#preprocessing.Rescaling(1/255.) in EfficientNet it's already scaled but could use this for non-scaled
], name = 'data_augmentation')
Found 7575 images belonging to 101 classes.
Found 25250 images belonging to 101 classes.
# Build headless model - Feature Extraction
# Setup base with frozen layers
base_model = tf.keras.applications.EfficientNetB0(include_top=False)
base_model.trainable=False
inputs = tf.keras.layers.Input(shape = (224,224,3))
x = data_augmentation(inputs)
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x) # Pool base_model's outputs into a feature vector
outputs = tf.keras.layers.Dense(len(class_names), activation='softmax')(x)
model = tf.keras.Model(inputs,outputs)
model.compile('Adam', 'categorical_crossentropy', metrics=['accuracy'])
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
data_augmentation (Sequentia (None, None, None, 3) 0
_________________________________________________________________
efficientnetb0 (Functional) (None, None, None, 1280) 4049571
_________________________________________________________________
global_average_pooling2d_1 ( (None, 1280) 0
_________________________________________________________________
dense_1 (Dense) (None, 101) 129381
=================================================================
Total params: 4,178,952
Trainable params: 129,381
Non-trainable params: 4,049,571
_________________________________________________________________
history = model.fit(train_data, validation_data=test_data,
validation_steps=int(0.15*len(test_data)),
epochs=5, callbacks = [checkpoint_callback])
Epoch 1/5
237/237 [==============================] - 63s 230ms/step - loss: 3.4712 - accuracy: 0.2482 - val_loss: 2.4446 - val_accuracy: 0.4497
Epoch 2/5
237/237 [==============================] - 52s 221ms/step - loss: 2.3575 - accuracy: 0.4561 - val_loss: 2.0051 - val_accuracy: 0.5093
Epoch 3/5
237/237 [==============================] - 51s 216ms/step - loss: 1.9838 - accuracy: 0.5265 - val_loss: 1.8313 - val_accuracy: 0.5360
Epoch 4/5
237/237 [==============================] - 51s 212ms/step - loss: 1.7497 - accuracy: 0.5761 - val_loss: 1.7417 - val_accuracy: 0.5461
Epoch 5/5
237/237 [==============================] - 53s 221ms/step - loss: 1.6035 - accuracy: 0.6141 - val_loss: 1.7012 - val_accuracy: 0.5601
model.evaluate(test_data)
790/790 [==============================] - 87s 110ms/step - loss: 1.7294 - accuracy: 0.5481
[1.7294203042984009, 0.5480791926383972]
# Fine tuning: unfreeze some layers, lower leaning rate by 10x
base_model.trainable=True
# Refreeze every layer except last 5, adjust tiner tuned features down the model
for layer in base_model.layers[:-5]:
layer.trainable=False
# recompile and lower learning rate by 10x
model.compile(tf.keras.optimizers.Adam(learning_rate=0.0001), 'categorical_crossentropy', metrics=['accuracy'])
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
data_augmentation (Sequentia (None, None, None, 3) 0
_________________________________________________________________
efficientnetb0 (Functional) (None, None, None, 1280) 4049571
_________________________________________________________________
global_average_pooling2d_1 ( (None, 1280) 0
_________________________________________________________________
dense_1 (Dense) (None, 101) 129381
=================================================================
Total params: 4,178,952
Trainable params: 910,821
Non-trainable params: 3,268,131
_________________________________________________________________
# Fine Tune for 5 more epochs starting with last epoch left off at:
fine_tune_epochs=10 # Total number of epochs we're after: 5 feature extraction, 5 fine tuning
history_fine_tune = model.fit(train_data,
validation_data = test_data,
validation_steps=int(0.15*len(test_data)),
epochs = fine_tune_epochs,
initial_epoch = history.epoch[-1])
Epoch 5/10
237/237 [==============================] - 59s 220ms/step - loss: 1.3571 - accuracy: 0.6543 - val_loss: 1.6403 - val_accuracy: 0.5567
Epoch 6/10
237/237 [==============================] - 51s 213ms/step - loss: 1.2478 - accuracy: 0.6688 - val_loss: 1.6805 - val_accuracy: 0.5596
Epoch 7/10
237/237 [==============================] - 46s 193ms/step - loss: 1.1424 - accuracy: 0.6964 - val_loss: 1.6352 - val_accuracy: 0.5736
Epoch 8/10
237/237 [==============================] - 45s 191ms/step - loss: 1.0902 - accuracy: 0.7065 - val_loss: 1.6494 - val_accuracy: 0.5657
Epoch 9/10
237/237 [==============================] - 46s 193ms/step - loss: 1.0229 - accuracy: 0.7275 - val_loss: 1.6348 - val_accuracy: 0.5633
Epoch 10/10
237/237 [==============================] - 45s 191ms/step - loss: 0.9704 - accuracy: 0.7434 - val_loss: 1.6990 - val_accuracy: 0.5670
model.evaluate(test_data)
790/790 [==============================] - 83s 105ms/step - loss: 1.6578 - accuracy: 0.5671
[1.657836675643921, 0.5670890808105469]
model.save("./101_food_classes_10_percent/big_modelh5")
loaded_model = tf.keras.models.load_model("./101_food_classes_10_percent/big_modelh5.h5")
loaded_model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
data_augmentation (Sequentia (None, None, None, 3) 0
_________________________________________________________________
efficientnetb0 (Functional) (None, None, None, 1280) 4049571
_________________________________________________________________
global_average_pooling2d_1 ( (None, 1280) 0
_________________________________________________________________
dense_1 (Dense) (None, 101) 129381
=================================================================
Total params: 4,178,952
Trainable params: 910,821
Non-trainable params: 3,268,131
_________________________________________________________________
loaded_model.evaluate(test_data)
790/790 [==============================] - 85s 104ms/step - loss: 2.1780 - accuracy: 0.4535 - loss: 2.1790 - accuracy
[2.1780412197113037, 0.4534653425216675]
# Try save_weights to another model
model.save_weights('my_model_weights.h5')
inputs = tf.keras.layers.Input(shape = (224,224,3))
x = data_augmentation(inputs)
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x) # Pool base_model's outputs into a feature vector
outputs = tf.keras.layers.Dense(len(class_names), activation='softmax')(x)
new_model = tf.keras.Model(inputs,outputs)
new_model.compile('Adam', 'categorical_crossentropy', metrics=['accuracy'])
new_model.summary()
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
data_augmentation (Sequentia (None, None, None, 3) 0
_________________________________________________________________
efficientnetb0 (Functional) (None, None, None, 1280) 4049571
_________________________________________________________________
global_average_pooling2d_2 ( (None, 1280) 0
_________________________________________________________________
dense_2 (Dense) (None, 101) 129381
=================================================================
Total params: 4,178,952
Trainable params: 910,821
Non-trainable params: 3,268,131
_________________________________________________________________
new_model.load_weights('my_model_weights.h5')
# Saving weights works... but not save and load_model
new_model.evaluate(test_data)
790/790 [==============================] - 88s 109ms/step - loss: 1.6578 - accuracy: 0.5671
[1.6578353643417358, 0.5670890808105469]
# Check if weights are the same?
m1 = model.get_weights()
m2 = new_model.get_weights()
m3 = loaded_model.get_weights()
len(m1)==len(m2)==len(m3)
True
from collections.abc import Iterable
def flatten(l):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
m1 = flatten(m1)
m2 = flatten(m2)
m3 = flatten(m3)
print(list(m1)==list(m2))
print(list(m1)==list(m3))
True
True

This is because you have not saved your entire model using .h5 extension, but you are using .h5 for saving the weights. Please check below code section:
model.save("./101_food_classes_10_percent/big_modelh5") # add .h5
loaded_model = tf.keras.models.load_model("./101_food_classes_10_percent/big_modelh5.h5")
loaded_model.summary()
Use this code to save the entire model to a HDF5 file format and try again loading it:
model.save("./101_food_classes_10_percent/big_modelh5.h5")
Check this for more details on saving model in .hdf5 format.

Validation accuracy not improving imbalanced data

Attempting to make predictions using Kaggle Diabetic retinopathy data set and a CNN model. There are five classes to be predicted. Distribution % of the data label wise is as below.
0 0.73
2 0.15
1 0.07
3 0.02
4 0.02
Name: level, dtype: float64
The relevant important code blocks are furnished below.
# Network training parameters
EPOCHS = 25
BATCH_SIZE =50
VERBOSE = 1
lr=0.0001
OPTIMIZER = tf.keras.optimizers.Adam(lr)
target_size =(256, 256)
NB_CLASSES = 5
THe Image generator class and the preprocessing codes as below.
data_gen=tf.keras.preprocessing.image.ImageDataGenerator(rotation_range=45,
horizontal_flip=True,
vertical_flip=True,
rescale=1./255,
validation_split=0.2)
train_gen=data_gen.flow_from_dataframe(
dataframe=label_csv, directory=IMAGE_FOLDER_PATH,
x_col='image', y_col='level',
target_size=target_size,
class_mode='categorical',
batch_size=BATCH_SIZE, shuffle=True,
subset='training',
validate_filenames=True
)
Found 28101 validated image filenames belonging to 5 classes.
validation_gen=data_gen.flow_from_dataframe(
dataframe=label_csv, directory=IMAGE_FOLDER_PATH,
x_col='image', y_col='level',
target_size=target_size,
class_mode='categorical',
batch_size=BATCH_SIZE, shuffle=True,
subset='validation',
validate_filenames=True
)
Found 7025 validated image filenames belonging to 5 classes.
train_gen.image_shape
(256, 256, 3)
Model building code blocks as below.
# Architect your CNN model1
model1=tf.keras.models.Sequential()
model1.add(tf.keras.layers.Conv2D(256,(3,3),input_shape=INPUT_SHAPE,activation='relu'))
model1.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model1.add(tf.keras.layers.Conv2D(128,(3,3),activation='relu'))
model1.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model1.add(tf.keras.layers.Conv2D(64,(3,3),activation='relu'))
model1.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model1.add(tf.keras.layers.Conv2D(32,(3,3),activation='relu'))
model1.add(tf.keras.layers.MaxPool2D(pool_size=(2,2)))
model1.add(tf.keras.layers.Flatten())
model1.add(tf.keras.layers.Dense(units=512,activation='relu'))
model1.add(tf.keras.layers.Dense(units=256,activation='relu'))
model1.add(tf.keras.layers.Dense(units=128,activation='relu'))
model1.add(tf.keras.layers.Dense(units=64,activation='relu'))
model1.add(tf.keras.layers.Dense(units=32,activation='relu'))
model1.add(tf.keras.layers.Dense(units=NB_CLASSES,activation='softmax'))
model1.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 254, 254, 256) 7168
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 127, 127, 256) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 125, 125, 128) 295040
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 62, 62, 128) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 60, 60, 64) 73792
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 30, 30, 64) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 28, 28, 32) 18464
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 6272) 0
_________________________________________________________________
dense (Dense) (None, 512) 3211776
_________________________________________________________________
dense_1 (Dense) (None, 256) 131328
_________________________________________________________________
dense_2 (Dense) (None, 128) 32896
_________________________________________________________________
dense_3 (Dense) (None, 64) 8256
_________________________________________________________________
dense_4 (Dense) (None, 32) 2080
_________________________________________________________________
dense_5 (Dense) (None, 5) 165
=================================================================
Total params: 3,780,965
Trainable params: 3,780,965
Non-trainable params: 0
# Compile model1
model1.compile(optimizer=OPTIMIZER,metrics=['accuracy'],loss='categorical_crossentropy')
print (train_gen.n,train_gen.batch_size)
28101 50
STEP_SIZE_TRAIN=train_gen.n//train_gen.batch_size
STEP_SIZE_VALID=validation_gen.n//validation_gen.batch_size
print(STEP_SIZE_TRAIN)
print(STEP_SIZE_VALID)
562
140
# Fit the model1
history1=model1.fit(train_gen,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=validation_gen,
validation_steps=STEP_SIZE_VALID,
epochs=EPOCHS,verbose=1)
History of the epoch as below and trained stopped at epoch -14 as no improvement observed.
Epoch 1/25
562/562 [==============================] - 1484s 3s/step - loss: 0.9437 - accuracy: 0.7290 - val_loss: 0.8678 - val_accuracy: 0.7309
Epoch 2/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8748 - accuracy: 0.7337 - val_loss: 0.8673 - val_accuracy: 0.7309
Epoch 3/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8681 - accuracy: 0.7367 - val_loss: 0.8614 - val_accuracy: 0.7306
Epoch 4/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8619 - accuracy: 0.7333 - val_loss: 0.8592 - val_accuracy: 0.7306
Epoch 5/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8565 - accuracy: 0.7375 - val_loss: 0.8625 - val_accuracy: 0.7304
Epoch 6/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8608 - accuracy: 0.7357 - val_loss: 0.8556 - val_accuracy: 0.7310
Epoch 7/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8568 - accuracy: 0.7335 - val_loss: 0.8614 - val_accuracy: 0.7304
Epoch 8/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8541 - accuracy: 0.7349 - val_loss: 0.8591 - val_accuracy: 0.7301
Epoch 9/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8582 - accuracy: 0.7321 - val_loss: 0.8583 - val_accuracy: 0.7303
Epoch 10/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8509 - accuracy: 0.7354 - val_loss: 0.8599 - val_accuracy: 0.7311
Epoch 11/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8521 - accuracy: 0.7325 - val_loss: 0.8584 - val_accuracy: 0.7304
Epoch 12/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8422 - accuracy: 0.7352 - val_loss: 0.8481 - val_accuracy: 0.7307
Epoch 13/25
562/562 [==============================] - 1463s 3s/step - loss: 0.8511 - accuracy: 0.7345 - val_loss: 0.8477 - val_accuracy: 0.7307
Epoch 14/25
562/562 [==============================] - 1462s 3s/step - loss: 0.8314 - accuracy: 0.7387 - val_loss: 0.8528 - val_accuracy: 0.7300
Epoch 15/25
73/562 [==>...........................] - ETA: 17:12 - loss: 0.8388 - accuracy: 0.7344
Validation accuracy not improving more than 73 % even after several epochs.In the earlier trial i tried the learning rate 0.001 but the case was same with no improvements.
Request suggestions to improve the model accuracy.
Also how can we use Grid search when we use the Image generator for preprocessing and would invite suggestions for the same
Many thanks in advance

your problem is most likely due to overfitting. your data is quite unbalanced and in addition to finding a better model, a better learning rate or a better optimizer. you could also create a custom generator to augment and select your data in a more balanced way.
I use custom generators for most of the models at work, I can't share the full code of generators but I'll show you a pseudocode example of how to create one. it's actually quite fun to play around and add more steps to it. you can -and you probably should- add pre-processing and post-processing steps but I hope this code gives you an overall idea of the process.
import random
import numpy as np
class myCostumGenerator:
def __init__(self) -> None:
# load dataset into a dict, if it's too big then just load filenames and load them at runtime
# each dict key is a class name, and each value is a list of images or filenames
self.dataSet, self.imageHeight, self.imageWidth, self.imageChannels = loadData()
def labelBinarizer(self, label):
# this is how you convert class names into target Y
pass
def augment(self, image):
# this is how you augment your images
pass
def yeildData(self):
while True:#keras generators need to run infinitly
for className, data in self.dataSet.items():
yield self.augment(random.choice(data)), self.labelBinarizer(className)
def getEmptyBatch(self, batchSize):
return (
np.empty([batchSize, self.imageHeight, self.imageWidth, self.imageChannels]),
np.empty([batchSize, len(self.dataset.keys())]), 0)
def getBatches(self, batchSize):
X, Y, i = self.getEmptyBatch(batchSize)
for image, label in self.yieldData():
X[i, ...] = image
Y[i, ...] = label
i += 1
if i== batchSize:
yield X, Y
X, Y, i = self.getEmptyBatch(batchSize)
# your model definition and other stuff
# ...
# ...
# ...
# with this method of defining a generator, you have to set number of steps per epoch
generator = myCostumGenerator()
model.fit(
generator.getBatches(batchSize=256),
steps_per_epoch = 500
# other params
)

TensorFlow not running correct number of epochs with no errors

I am very much novice at neural networks / machine learning. I am trying to learn more by using RotNet, a NN that will classify rotation angles in images. I am trying to train my network using the MNIST dataset, and have changed only one line of the repo (a log directory file path) but other than that have been able to run it successfully.
Here is how I am running it based on the README:
& .../Anaconda3/envs/tflow/python.exe .../RotNet/train/train_mnist.py
and then the output:
Using TensorFlow backend.
Input shape: (28, 28, 1)
60000 train samples
10000 test samples
2020-10-16 12:18:17.031214: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 26, 26, 64) 640
_________________________________________________________________
conv2d_2 (Conv2D) (None, 24, 24, 64) 36928
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 64) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 12, 12, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 9216) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 1179776
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 360) 46440
=================================================================
Total params: 1,263,784
Trainable params: 1,263,784
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
1/468 [..............................] - ETA: 2:21 - loss: 5.8862 - angle_error: 87.14062020-10-16 12:18:18.337183: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
469/468 [==============================] - 61s 130ms/step - loss: 5.0338 - angle_error: 81.4492 - val_loss: 4.1144 - val_angle_error: 65.9470
Epoch 2/50
469/468 [==============================] - 61s 131ms/step - loss: 4.3072 - angle_error: 64.7485 - val_loss: 3.4630 - val_angle_error: 53.0140
Epoch 3/50
469/468 [==============================] - 63s 134ms/step - loss: 4.0303 - angle_error: 56.3245 - val_loss: 3.2241 - val_angle_error: 47.0283
Epoch 4/50
469/468 [==============================] - 63s 134ms/step - loss: 3.8824 - angle_error: 52.2043 - val_loss: 3.3227 - val_angle_error: 43.2439
Epoch 5/50
469/468 [==============================] - 63s 135ms/step - loss: 3.7982 - angle_error: 49.9996 - val_loss: 3.1930 - val_angle_error: 41.1242
Epoch 6/50
469/468 [==============================] - 73s 155ms/step - loss: 3.7288 - angle_error: 48.4027 - val_loss: 2.9600 - val_angle_error: 39.9322
Epoch 7/50
469/468 [==============================] - 63s 133ms/step - loss: 3.6781 - angle_error: 46.5616 - val_loss: 3.2243 - val_angle_error: 38.6193
Epoch 8/50
469/468 [==============================] - 62s 132ms/step - loss: 3.6439 - angle_error: 45.2133 - val_loss: 2.8629 - val_angle_error: 38.0046
Epoch 9/50
469/468 [==============================] - 62s 132ms/step - loss: 3.6132 - angle_error: 44.7204 - val_loss: 3.0085 - val_angle_error: 37.4514
Epoch 10/50
469/468 [==============================] - 62s 132ms/step - loss: 3.5817 - angle_error: 43.8439 - val_loss: 3.0073 - val_angle_error: 35.8109
The script train_mnist.py is located here and it specifies 50 epochs. I am getting no error, the program simply stops after the 8th or 10th epoch. I am at a loss for how to fix this issue. Any advice would be appreciated!

I took a quick look at the code. In it there is this line:
callbacks=[checkpointer, early_stopping, tensorboard]
The call back early_stopping by default monitors the validation loss. The code used for early stopping is set such that if the validation loss fails to improve for more than 2 consecutive epochs training will halt. That is why it does not train for 50 epochs. If you want it to continue training for the full 50 remove early_stopping from the line of code above. You can see that early_stopping is causing the training to terminate by changing the code in the script from
early_stopping = EarlyStopping(patience=2)
# change code to
early_stopping = EarlyStopping(patience=2, verbose=1)
From the training data this model does not appear to be training very well. I suggest you try transfer learning with MobileNet. Code below shows how to use it,
mobile = tf.keras.applications.mobilenet.MobileNet( include_top=False, input_shape=(img_size, img_size,3), pooling='max', weights='imagenet', dropout=.5)
x=mobile.layers[-1].output # this is the last layer in the mobilenet model the global max pooling layer
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x=Dense(126, activation='relu')(x)
x=Dropout(rate=.3, seed = 123)(x)
predictions=Dense (len(classes), activation='softmax')(x)
model = Model(inputs=mobile.input, outputs=predictions)
Adapt the above to your situation it should work much better
for layer in model.layers:
layer.trainable=True
model.compile(Adamax(lr=lr), loss='categorical_crossentropy', metrics=['accuracy'])

My validation accuracy is stuck and training accuracy is decreased continuously

I am new to working with LSTM models, but I have a small network. I have extracted MFCC features from my audio files and have flattened it and given as input. But the validation accuracy is stuck between 2 values and my accuracy is decreasing continuously.
I have used RMSprop with a learning rate of 0.001.
I have tried changing Optimizer, adding dropout, and batch normalization.
The dataset is evenly balanced also.
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 3460, 1) 0
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM) (None, 3460, 1024) 4206592
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM) (None, 1024) 8396800
_________________________________________________________________
dense_1 (Dense) (None, 512) 524800
_________________________________________________________________
batch_normalization_1 (Batch (None, 512) 2048
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 256) 131328
_________________________________________________________________
batch_normalization_2 (Batch (None, 256) 1024
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 257
=================================================================
Total params: 13,262,849
Trainable params: 13,261,313
Non-trainable params: 1,536
_________________________________________________________________
Train on 385 samples, validate on 165 samples
Epoch 1/10
385/385 [==============================] - 61s 160ms/step - loss: 1.0811 - accuracy: 0.5143 - val_loss: 0.6917 - val_accuracy: 0.5273
Epoch 2/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7536 - accuracy: 0.5169 - val_loss: 0.6980 - val_accuracy: 0.4727
Epoch 3/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7484 - accuracy: 0.5039 - val_loss: 0.7002 - val_accuracy: 0.4727
Epoch 4/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7333 - accuracy: 0.5091 - val_loss: 0.7030 - val_accuracy: 0.5273
Epoch 5/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7486 - accuracy: 0.4675 - val_loss: 0.6917 - val_accuracy: 0.5273
Epoch 6/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7222 - accuracy: 0.4935 - val_loss: 0.6917 - val_accuracy: 0.5273
Epoch 7/10
385/385 [==============================] - 55s 143ms/step - loss: 0.7208 - accuracy: 0.4883 - val_loss: 0.6919 - val_accuracy: 0.5273
Epoch 8/10
385/385 [==============================] - 55s 142ms/step - loss: 0.7134 - accuracy: 0.4805 - val_loss: 0.6919 - val_accuracy: 0.5273
Epoch 9/10
385/385 [==============================] - 55s 143ms/step - loss: 0.7168 - accuracy: 0.4987 - val_loss: 0.6927 - val_accuracy: 0.5273
Epoch 10/10
385/385 [==============================] - 55s 143ms/step - loss: 0.7089 - accuracy: 0.4909 - val_loss: 0.6926 - val_accuracy: 0.5273
Here is my code:
def build_model():
input = Input((20*173,1))
x = Conv1D(filters=16, kernel_size=4, activation='relu')(input)
x = AveragePooling1D(pool_size=2)(x)
x = Conv1D(filters=16, kernel_size=3, activation='relu')(x)
x = AveragePooling1D(pool_size=2)(x)
x = Flatten()(x)
x = keras.layers.Reshape((13808, 1))(x)
x = CuDNNLSTM(1024, return_sequences=True)(x)
x = CuDNNLSTM(512)(x)
x = Dense(256,activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(128,activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(1,activation='sigmoid')(x)
model = Model(inputs=input, outputs=x)
return model
reduce_lr = ReduceLROnPlateau(monitor='val_accuracy', factor=0.2,patience=3, min_lr=0.001)
opt = RMSprop(lr=0.0001)
m2 = build_model()
m2.compile(loss = "binary_crossentropy", metrics=['accuracy'],optimizer = opt)
m2.fit(X, y, batch_size=16, epochs=10, validation_split=0.3,callbacks = [reduce_lr])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras training crashes mid epoch after multiple correct executions - python

The generator was wrong. It incorrectly assumes a finite generator whilst keras expects an infinite one.

Related

how can i continue training from last epoch?

Tensorflow save and load_model not working but save and load_weights does

Validation accuracy not improving imbalanced data

TensorFlow not running correct number of epochs with no errors

My validation accuracy is stuck and training accuracy is decreased continuously

Categories

Resources