So I am doing binary image classification on a small data set containing 250 images in each class, I am using transfer learning using Resnet50 as base network architecture and over it I've added 2 hidden layer and one final output layer, after training for 20 epochs, what I've saw is that loss is suddenly increases in initial epoch, I am unable to understand the reason behind it.
Network architecture -
image_input = Input(shape=(224, 224, 3))
model = ResNet50(input_tensor=image_input,include_top=True, weights='imagenet')
last_layer = model.get_layer('avg_pool').output
x = Flatten(name='flatten')(last_layer)
x = Dense(1000, activation='relu', name='fc1000')(x)
x = Dropout(0.5)(x)
x = Dense(200, activation='relu', name='fc200')(x)
x = Dropout(0.5)(x)
out = Dense(num_classes, activation='softmax', name='output')(x)
custom_model = Model(image_input, out)
I am using binary_crossentropy, Adam with default parameters
Loss -
Accuracy -
With such small class of data, there is definitely chance of overfitting do increase your dataset size and check it out use data augmentation if possible
Related
I want to use the pretrained VGG19 (with imagenet weights) to build a two class classifier using a dataset of about 2.5k images that i've curated and split into 2 classes. It seems that not only is training taking a very long time, but accuracy seems to not increase in the slightest.
Here's my implementation:
def transferVGG19(train_dataset, val_dataset):
# conv_model = VGG19(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
conv_model = VGG19(
include_top=True,
weights="imagenet",
input_tensor=None,
input_shape=(224, 224, 3),
pooling=None,
classes=1000,
classifier_activation="softmax",
)
for layer in conv_model.layers:
layer.trainable = False
input = layers.Input(shape=(224, 224, 3))
scale_layer = layers.Rescaling(scale=1 / 127.5, offset=-1)
x = scale_layer(input)
x = conv_model(x, training=False)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(64, activation='relu')(x)
predictions = layers.Dense(1, activation='softmax')(x)
full_model = models.Model(inputs=input, outputs=predictions)
full_model.summary()
full_model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['acc'])
history = full_model.fit(
train_dataset,
epochs=10,
validation_data=val_dataset,
workers=10,
)
Model performance seems to be awful...
I imagine this behaviour comes from my rudimentary understanding of how layers work and how to best the new model's architecture. As VGG19 is trained on 1000 classes, i saw it best fit to add to the output a couple of dense layers to reduce the size of the feature maps, as well as a dropout layer in between to randomly discard neurons and help ease the risk of overfitting. At first i suspected i might have dropped too many neurons, but i was expecting my network to learn slower rather than not at all.
Is there something obviously wrong in my implementation that would cause such poor performance? Any explanation is welcomed. Just to mention, i would rule out the dataset as an issue because i've implemented transfer learning on Xception and have managed to get 98% validation accuracy that was monotonously increasing over 20 epochs. That implementation used different layers (i can provide it if necessary) because i was experimenting with different network layouts.
TLDR; Change include_top= True to False
Explaination-
Model graphs are represented in inverted manner i.e last layers are shown at the top and initial layers are shown at bottom.
When include_top=False, the top dense layers which are used for classification and not representation of data are removed from the pretrained VGG model. Only till the last conv2D layers are preserved.
During transfer-learning, you need to keep the learned representation layers intact and only learn the classification part for your data. Hence you are adding your stack of classification layers i.e.
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(64, activation='relu')(x)
predictions = layers.Dense(1, activation='softmax')(x)
If you keep the top classification layers of VGG, it will give 1000 probabilities for 1000 classes due to softmax activation at its top layer in model graph.This activation is not relu. We dont need softmax in intermediate layer as softmax "squishes" the unscaled inputs so that sum(input) = 1. Effectively it produces a smooth software defined approximation of argmax. Hence your accuracy is suffering.
I am new to machine learning and I am currently trying to create a siamese network that can predict the similarity of brand logos.
I have a dataset with ~210.000 brand logos.
The CNN for the siamese network looks like the following:
def build_cnn(inputShape, embeddingDim=48):
# specify the inputs for the feature extractor network
inputs = Input(shape=inputShape)
# define the first set of CONV => RELU => POOL => DROPOUT layers
x = Conv2D(64, (2, 2), padding="same", activation="relu")(inputs)
x = MaxPooling2D(pool_size=(5, 5))(x)
x = Dropout(0.3)(x)
# second set of CONV => RELU => POOL => DROPOUT layers
x = Conv2D(64, (2, 2), padding="same", activation="relu")(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
pooledOutput = GlobalAveragePooling2D()(x)
outputs = Dense(embeddingDim)(pooledOutput)
# build the model
model = Model(inputs, outputs)
model.summary()
plot_model(model, to_file=os.path.sep.join([config.BASE_OUTPUT,'model_cnn.png']))
# return the model to the calling function
return model
The siamese network looks like this (the model here is the cnn described above):
imgA = Input(shape=config.IMG_SHAPE)
imgB = Input(shape=config.IMG_SHAPE)
featureExtractor = siamese_network.build_cnn(config.IMG_SHAPE)
featsA = featureExtractor(imgA)
featsB = featureExtractor(imgB)
distance = Lambda(euclidean_distance)([featsA, featsB])
outputs = Dense(1, activation="sigmoid")(distance)
model = Model(inputs=[imgA, imgB], outputs=outputs)
My first test was with 800 positive and 800 negative pairs and the accuracy and loss looks like this:
My thoughts to this were that there is some overfitting happening and my approach was to create more training data (2000 positive and negative pairs) and train the model again, but unfortunately the model was not improving at all after, even after 20+ epochs.
For both cases I used the following to train my network:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
print("[INFO] training model...")
history = model.fit(
[pairTrain[:, 0], pairTrain[:, 1]], labelTrain[:],
validation_data=([pairTest[:, 0], pairTest[:, 1]], labelTest[:]),
batch_size=10,
shuffle=True,
epochs=50)
I cannot figure out what is happening here, so I am really gratefull for every help.
My question here is why is the siamese network learning (or at least it looks like it is learning) with less training data, but as soon as I add more the accuracy is constant and not improving at all?
EDIT
According to Albertos comment I tried it with selu (still not working):
EDIT2
With LeakyReLU it looks like this:
My latest training result with 10k Pairs looks like this:
I've seen this happening other times, idk if it's a case but this Github issue has literally the same loss
As in that case, I think that is more a problem of initialization, so at this point I think you should use He-initialziation, or non-saturating activation function (for example, try tf.keras.layers.LeakyReLU or tf.keras.activations.selu)
I've been trying to train audio classification model. When i used SGD with learning_rate=0.01, momentum=0.0 and nesterov=False i get the following Loss and Accuracy graphs:
I can't figure out what what causes the instant decrease in loss at around epoch 750. I tried different learning rates, momentum values and their combinations, different batch sizes, initial layer weights etc. to get more appropriate graph but no luck at all. So if you have any knowledge about what causes this please let me know.
Code i used for this training is below:
# MFCCs Model
x = tf.keras.layers.Dense(units=512, activation="sigmoid")(mfcc_inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(units=256, activation="sigmoid")(x)
x = tf.keras.layers.Dropout(0.5)(x)
# Spectrograms Model
y = tf.keras.layers.Conv2D(32, kernel_size=(3,3), strides=(2,2))(spec_inputs)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Flatten()(y)
y = tf.keras.layers.Dense(units=256, activation="sigmoid")(y)
y = tf.keras.layers.Dropout(0.5)(y)
# Chroma Model
t = tf.keras.layers.Dense(units=512, activation="sigmoid")(chroma_inputs)
t = tf.keras.layers.Dropout(0.5)(t)
t = tf.keras.layers.Dense(units=256, activation="sigmoid")(t)
t = tf.keras.layers.Dropout(0.5)(t)
# Merge Models
concated = tf.keras.layers.concatenate([x, y, t])
# Dense and Output Layers
z = tf.keras.layers.Dense(64, activation="sigmoid")(concated)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(64, activation="sigmoid")(z)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(1, activation="sigmoid")(z)
mdl = tf.keras.Model(inputs=[mfcc_inputs, spec_inputs, chroma_inputs], outputs=z)
mdl.compile(optimizer=SGD(), loss="binary_crossentropy", metrics=["accuracy"])
mdl.fit([M_train, X_train, C_train], y_train, batch_size=8, epochs=1000, validation_data=([M_val, X_val, C_val], y_val), callbacks=[tensorboard_cb])
I'm not too sure myself, but as Frightera said, sigmoid activations in hidden layers can cause trouble since it is more sensitive to weight initialization, and if the weights aren't perfectly set, it can cause gradients to be very small. Perhaps the model eventually deals with the small sigmoid gradients and loss finally decreases around epoch 750, but just my hypothesis. If ReLU doesn't work, try using LeakyReLU since it doesn't have the dead neuron effect that ReLU does.
Hi I am trying to do Transfer Learning in Keras and I am trying to load weights into a new model that I have self trained from a different task.
I have trained my own set of weights from another task. This other task, however, is a binary classification problem while my new problem is a multi-label classification problem.
I got my first set of weights doing this:
n_classes = 1
epochs = 100
batch_size = 32
input_shape = (224, 224, 3)
base_model = MobileNetV2(input_shape=input_shape, weights= None, include_top=False)
x = GlobalAveragePooling2D()(base_model.output)
output = Dense(n_classes, activation='sigmoid')(x)
model = tf.keras.models.Model(inputs=[base_model.input], outputs=[output])
opt = optimizers.Adam(lr = 0.001)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
...
...
history = model.fit(train_generator, epochs=epochs,
steps_per_epoch=step_size_train,verbose=1,
validation_data=valid_generator,
validation_steps=STEP_SIZE_VALID,
class_weight=class_weights,
)
model.save_weights("initial-weights.h5")
But when I try to load these weights into my new model:
weights_path = 'initial-weights.h5'
n_classes = 14
epochs = 1000
batch_size = 32
input_shape = (224, 224, 3)
base_model = MobileNetV2(input_shape=input_shape, weights= None, include_top=False)
x = GlobalAveragePooling2D()(base_model.output)
output = Dense(n_classes, activation='sigmoid')(x)
model = tf.keras.models.Model(inputs=[base_model.input], outputs=[output])
opt = optimizers.Adam(lr = 0.001)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.load_weights(weights_path)
I get the following error:
ValueError: Shapes (1280, 14) and (1280, 1) are incompatible
I understand that based on the error, it is very likely to be due to the difference in the number of classes, but from what I know about transfer learning, it is possible to transfer weights from different tasks even if the number of classes are different (like how ImageNet weights are used for tasks that have different number of classes).
How do I initialize my own set of custom weights that are trained from a different task that has a different number of classes?
I think that the best approach is to transfer the weights for all layers except the last (ie. the feature extraction part). Then you can freeze all the transferred weights, and train the the model again, where only the weights on the last layer (ie. classification layer) will be trained.
I am trying to use a CNN architecture to classify text sentences. The architecture of the network is as follows:
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
pool1 = MaxPooling1D(pool_size=2)(drop21)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
drop22 = Dropout(0.5)(conv22)
pool2 = MaxPooling1D(pool_size=2)(drop22)
dense = Dense(16, activation='relu')(pool2)
flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I have some callbacks as early_stopping and reduceLR to stop the training and to reduce the learning rate when the validation loss is not improving (reducing).
early_stopping = EarlyStopping(monitor='val_loss',
patience=5)
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath,
save_weights_only=False,
monitor='val_loss',
mode="auto",
save_best_only=True)
learning_rate_decay = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
patience=2,
verbose=1,
mode='auto',
min_delta=0.0001,
cooldown=0,
min_lr=0)
Once the model is trained the history of the training goes as follows:
We can observe here that the validation loss is not improving from epoch 5 on and that the training loss is being overfitted with each step.
I will like to know if I'm doing something wrong in the architecture of the CNN? Aren't enough the dropout layers to avoid the overfitting? Which are other ways to reduce overfitting?
Any suggestion?
Thanks in advance.
Edit:
I have tried also with regularization an the result where even worse:
kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)
Edit 2:
I have tried to apply BatchNormalization layers after each convolution and the result is the next one:
norm = BatchNormalization()(conv2)
Edit 3:
After applying the LSTM architecture:
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(drop21)
drop22 = Dropout(0.5)(conv22)
lstm1 = Bidirectional(LSTM(128, return_sequences = True))(drop22)
lstm2 = Bidirectional(LSTM(64, return_sequences = True))(lstm1)
flat = Flatten()(lstm2)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
overfitting can caused by many factors, it happens when your model fits too well to the training set.
To handle it you can do some ways:
Add more data
Use data augmentation
Use architectures that generalize well
Add regularization (mostly dropout, L1/L2 regularization are also possible)
Reduce architecture complexity.
for more clearly you can read in https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-overfitting-2bd5d99abe5d
This is screaming Transfer Learning. google-unversal-sentence-encoder is perfect for this use case. Replace your model with
import tensorflow_hub as hub
import tensorflow_text
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
# this next layer might need some tweaking dimension wise, to correctly fit
# X_train in the model
text_input = tf.keras.layers.Lambda(lambda x: tf.squeeze(x))(text_input)
# conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
# drop21 = Dropout(0.5)(conv2)
# pool1 = MaxPooling1D(pool_size=2)(drop21)
# conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
# drop22 = Dropout(0.5)(conv22)
# pool2 = MaxPooling1D(pool_size=2)(drop22)
# 1) you might need `text_input = tf.expand_dims(text_input, axis=0)` here
# 2) If you're classifying English only, you can use the link to the normal `google-universal-sentence-encoder`, not the multilingual one
# 3) both the English and multilingual have a `-large` version. More accurate but slower to train and infer.
embedded = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')(text_input)
# this layer seems out of place,
# dense = Dense(16, activation='relu')(embedded)
# you don't need to flatten after a dense layer (in your case) or a backbone (in my case (google-universal-sentence-encoder))
# flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
I think since you are doing a text Classification, adding 1 or 2 LSTM layers might help the network learn better, since it will be able to better associate with the context of the data. I suggest adding the following code before the flatten layer.
lstm1 = Bidirectional(LSTM(128, return_sequence = True))
lstm2 = Bidirectional(LSTM(64))
LSTM layers can help neural network learn association between certain words and might improve the accuracy of your network.
I also Suggest dropping the Max Pooling layers as max pooling especially in text classification can lead the network to drop some of the useful features.
Just keep the convolutional Layers and the dropout. Also remove the Dense layer before flatten and add the aforementioned LSTMs.
It is unclear how you feed the text into your model. I am assuming that you tokenize the text to represent it as a sequence of integers, but do you use any word embedding prior to feeding it into your model? If not, I suggest you to throw atrainable tensorflow Embedding layer at the start of your model. There is a clever technique called Embedding Lookup to speed up its training, but you can save it for later. Try adding this layer to your model. Then your Conv1D layer would have a much easier time working on a sequence of floats. Also, I suggest you throw BatchNormalization after each Conv1D, it should help to speed up convergence and training.