Overfitting - huge difference between training and validation accuracy

Overfitting - huge difference between training and validation accuracy - python

I have a dataset of 180k images for which I try to recognize the characters on the images (License plate recognition). All of these license plates contain seven characters and 35 characters are possible, so the output vector y is of shape (7, 35). I therefore onehot-encoded every license plate label.
I applied the bottom of the EfficicentNet-B0 model (https://keras.io/api/applications/efficientnet/#efficientnetb0-function) together with a customized top, which is divided in 7 branches (because of seven characters per license plate). I used the weights of the imagenet and freezed the bottom layers of efnB0_model:
def create_model(input_shape = (224, 224, 3)):
input_img = Input(shape=input_shape)
model = efnB0_model (input_img)
model = GlobalAveragePooling2D(name='avg_pool')(model)
model = Dropout(0.2)(model)
backbone = model
branches = []
for i in range(7):
branches.append(backbone)
branches[i] = Dense(360, name="branch_"+str(i)+"_Dense_16000")(branches[i])
branches[i] = BatchNormalization()(branches[i])
branches[i] = Activation("relu") (branches[i])
branches[i] = Dropout(0.2)(branches[i])
branches[i] = Dense(35, activation = "softmax", name="branch_"+str(i)+"_output")(branches[i])
output = Concatenate(axis=1)(branches)
output = Reshape((7, 35))(output)
model = Model(input_img, output)
return model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
For training and validating the model I only use 10.000 training images and 3.000 validation images due to the big size of my model and the huge number of data which would make my training very, very slow.
I use this DataGenerator to feed batches to my model:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
I fit the model using this code:
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 32,
validation_steps = num_val_samples // 32,
epochs = 10, workers=6, use_multiprocessing=True)
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy. I think, one reason for that is the small size of data. Which other factors influence this overfitting in my model? Do you think, there is something completely wrong with my code/model? Do you think the model is to big and complex as well or is it maybe due to the preprocessing of the data?
Note: I already experimented with Data Augmentation and tried the model without Transfer Learning. That leads to poor results on training AND validation data. So, is there anything what I could do additionally?

First a disclaimer
Are you sure that this is the correct approach to follow? The EfficientNet is a model created for image recognition, and your task demands the correct localization of 7 characters in one image, recognition of each one of them, and also demand to keep the order of the characters. Maybe an approach of detection + segmentation followed by recognition like in this medium post is more efficient (pun intended). Even though I think that this is very likely to be your real problem, I will try to answer your original question.
Now some general tips regarding overfitting
There's a very good guide here in Keras documentation on how to use EfficientNet for transfer learning. I will try to summarize some tips here.
From your question seems that you are not even doing the fine-tuning step which is essential for the network to learn the task better.
Now, after several epochs of training, I observed big differences regarding training accuracy and validation accuracy.
With several, you mean how many epochs? Because from the image you put in the question I think that the second complete epoch is too soon to infer that your model is overfitting. Also, from the code (10 epochs) and for the image you posted (20 epochs) I would say to train for more epochs, like 40.
Increase the dropout. Try some configurations like 30%, 40%, 50%.
Data Augmentation in practice will increase the number of samples that you have. However, you have 180K images and are only using 10K of images, data augmentation is good but when you have more images available try using them first. From that guide I mentioned, seems that with this model and Google colab is feasible to use more images to train. So, try to increase the train size. Still in the topic of DA, some transformations may be harmful to your task, like too much rotation or reflection since you are trying to recognize numbers and letters.
Reducing the batch size to 16 may provide more regularization which helps to fight overfitting. Speaking of regularization, try to apply regularization to the dense layers that you are adding.
EDIT:
After quickly reading the paper you linked, I reaffirm my point about the epochs since in the paper the results are shown for 100 epochs. Also from the charts in the paper, we can see it's not possible to confirm that the author did not had overfitting too. Additionally, the changes in the Xception network are not clear at all. Changing the input layer has an impact on the dimensions of all other layers because of the way that the Convolution operation work and this is not discussed in the paper. The operations performed to achieve that output dimension is not clear too. Besides what you did I would suggest using a pooling layer to get the output dimensions that you want. Finally, the paper doesn't explain how the positioning of the plate is guaranteed. I would try to get more details about this paper that you are trying to reproduce to be sure that you are not missing anything in your model.

I have been working on character detection + recognition problem for industrial application. From my experience using only deep CNN and dense layer to predict character class is not the best approach to solve this problem. There are good research papers for scene text recognition problem, one common approach to design character recognition problem is to have ---
any deep CNN model like VGG, ResNet or EfficientNet to extract the image feature.
Then add some RNN layers on top of CNN backbone to get character sequence from the extracted features. This would be a great plus if you want to predict variable length of character.
After getting character sequence from RNN layers, the next step is to decode this character sequence. For this you can either use CTC based method or attention mechanism. Both of these methods have their own pros and cons. CTC based methods are fast but performance is bit poor, on the other hand, attention based models give good results but they are very slow. So the selection of the method depends on your requirement.
Below image from very famous text recognition paper CRNN gives general idea about above steps.
[
For training the model, #Hemerson has given good suggestions. Try to build and train this type of model with multiple stages and I am sure you will get better results:)
Best regards!

Related

Getting Validation Accuracy of 99% with MNIST with less than 10000 parameters CNN

Given MNIST dataset in keras,the challenge is to develop a CNN neural net model with less than 10k parameters with 99% validation accuracy.
I tried making the model for the same but am getting accuracy as 98.71.
def create_model():
lay1=Conv2D(2,kernel_size=(1,1),activation='relu',padding='same')(inputs)
lay1=Conv2D(2,kernel_size=(7,7),strides=(2,2),activation='relu',padding='same')(lay1)
lay1=Conv2D(2,kernel_size=(1,1),activation='relu',padding='same')(lay1)
lay1=MaxPooling2D(pool_size=(7,7),strides=(2,2),padding='same')(lay1)
lay2=Conv2D(4,kernel_size=(1,1),activation='relu',padding='same')(inputs)
lay2=Conv2D(4,kernel_size=(7,7),strides=(2,2),activation='relu',padding='same')(lay2)
lay2=Conv2D(4,kernel_size=(1,1),activation='relu',padding='same')(lay2)
lay2=MaxPooling2D(pool_size=(7,7),strides=(2,2),padding='same')(lay2)
lay3=Conv2D(6,kernel_size=(1,1),activation='relu',padding='same')(inputs)
lay3=Conv2D(6,kernel_size=(7,7),strides=(2,2),activation='relu',padding='same')(lay3)
lay3=Conv2D(6,kernel_size=(1,1),activation='relu',padding='same')(lay3)
lay3=MaxPooling2D(pool_size=(7,7),strides=(2,2),padding='same')(lay3)
fc=concatenate([lay1,lay2,lay3])
fc=Flatten()(fc)
fc=Dense(10,activation='relu')(fc)
outputs=Dense(10,activation='softmax')(fc)
model=Model(input=inputs,output=outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
return model
The total parameters coming are 8,862 and the batch size used for the above is 32 and the number of epochs are 10.
Can you please suggest ways to improve the model with the constraints on the number of parameters so that the validation accuracy is 99% or above?

You should be able to add another Conv/Pool block when reducing the kernel_size in the previous ones to 3 or 5 instead of 7. If research told us one thing recently is that deeper is (mostly) better.
Otherwise play with the hyperparameters (learning-rate, batch-size, ...) or augment the input data randomly. Batch- or Layer-Normalization have proven themselves very util lately (and they only introduce a handful of parameters).

Here is a blog post which aims to train a CIFAR10 ResNet Classifier to reach 94% in few seconds. Although the goal is a liitle bit different, some tricks might be useful for you, such as CELU activation or label smoothing.

reducing output feature number of pretraind mode weight in keras

I want extract 1000 image features using pretrained Xception model.
but xception models last layer(avg_pool) give 2048 features.
Can i reduce final output feature number without additional training?
I want image features before the softmax not predcition result.
base_model = xception.Xception(include_top=True, weights='imagenet')
base_model.summary()
self.model = Model(inputs = base_model.input, outputs=base_model.get_layer('avg_pool').output)

This model was trained to produce embeddings in 2048-dimensional space to the classifier after it. There is no sense in trying to reduce the dimensionality of the embedding space, unless you are combining very complex and inflexible models. If you are just doing simple transfer learning with no memory constraints, just snap your new classifier (extra layers) on top of it, and retrain after freezing (or not) all layers in the original Xception. That should work regardless of Xception output_shape. See the keras docs.
That said, if you REALLY need to reduce dimensionality to 1000-d, you will need a method that preserves (or at least tries to preserve) the original topology of the embedding space, otherwise your model will not benefit at all from transfer learning. Take a look at PCA, SVD, or T-SNE.

CTC model does not learn

I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example, I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.
My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%).
There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.
This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn] – [] are part of the sequence, representing the pause at start and end of the audio.
I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,
How do I get this CTC model to learn?
connector = inputs
for l in [100, 100, 150]:
lstmf, lstmb = Bidirectional(
LSTM(
units=l,
dropout=0.1,
return_sequences=True,
), merge_mode=None)(connector)
connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])
output = Dense(
units=len(dataset.SEGMENTS)+1,
activation=softmax)(connector)
loss_out = Lambda(
ctc_lambda_func, output_shape=(1,),
name='ctc')([output, labels, input_length, label_length])
ctc_model = Model(
inputs=[inputs, labels, input_length, label_length],
outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
optimizer=SGD(
lr=0.02,
decay=1e-6,
momentum=0.9,
nesterov=True,
clipnorm=5))
ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example.

It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository. The error lies actually in the data preparation:
The data are log spectrograms. They are unnormalized, and mostly highly negative. The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally.
The solution to this is to scale the input spectrograms such that they contain both positive and negative values:
for i, file in enumerate(files):
sg = numpy.load(file.with_suffix(".npy").open("rb"))
spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1

Neural network classifies everything into one class, recall=1 on imbalanced dataset

I'm trying to do a binary classification with a Deep Neural Network (esp. VGG16) in Keras. Unfortunately I have a very imbalanced data-set (15.000/1.800 images) but just can't find a way to circumvent that..
Results that I'm seeing
(on training and validation data)
Recall = 1
Precision = 0.1208 (which is exactly the ratio between class 0 and class 1 samples)
AUC = 0.88 (after ~30 epochs with SGD, which seems to be 1 - Precision)
What I've done
Switching from loss/accuracy metrics to AUC with this little helper
Utilizing class_weight like described here which doesn't seem to help
Trying different optimizers (SGD, Adam, RMSProp)
Adding BatchNormalization layers to my (untrained) VGG16 and set use_bias to False on Convolutional Layers. See my whole network as a gist here.
Doing Augmentation to enlarge dataset with Keras inbuilt ImageDataGenerator.
What I think could help further (but did not try yet)
Doing more data augmentation for one class than the other. Unfortunately I'm using one ImageDataGenerator for my whole training data and I don't know how to augment one class more than the other.
Maybe a custom loss-function which penalises false decisions more? How would I implement that? Currently I'm just using binary_crossentropy.
Theoretically I could adjust the class-membership-threshold for prediction but that doesn't help with training and would not improve the result, right?
Maybe decrease batch-size like suggested here. But I don't really see why that should help. Currently I'm determining the batch-size programmatically to show all the training and validation data to the network in one epoch:
steps_per_epoch = int(len(train_gen.filenames) / args.batch_size)
validation_steps = int(len(val_gen.filenames) / args.batch_size)
What do you think should I tackle first or do you have a better idea? I'm also glad for every help with implementation details.
Thank you so much in advance!

Maybe try to prepare class-balanced batches ( includes doublings of class 1 ) like described in https://community.rstudio.com/t/ensure-balanced-mini-batches-while-training/7505 ( R Studio ). Also read Neural Network - Working with a imbalanced dataset and balancing an imbalanced dataset with keras image generator
Another possibility is to perform feature extraction in the pre-processing meaning letting run image processing algorithms over the images to highlight characteristic features

gensim Doc2Vec vs tensorflow Doc2Vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.
I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:
Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?
Does the window=5 parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10.
Any insights as to why Gensim is performing better? Is my model any different to how they implement it?
Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.
Gensim
model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
model.train(corpus)
TF
batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
# The variables
word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
stddev=1.0 / np.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
###########################
# Model.
###########################
# Look up embeddings for inputs and stack words side by side
embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
shape=[int(batch_size/context_window),-1])
embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
embed = tf.concat(1,[embed_words, embed_docs])
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Update:
Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.

Old question, but an answer would be useful for future visitors. So here are some of my thoughts.
There are some problems in the tensorflow implementation:
window is 1-side size, so window=5 would be 5*2+1 = 11 words.
Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. So train_word_dataset shape would be batch_size * context_window, while train_doc_dataset and train_labels shapes would be batch_size.
More importantly, sampled_softmax_loss is not negative_sampling_loss. They are two different approximations of softmax_loss.
So for the OP's listed questions:
This implementation of doc2vec in tensorflow is working and correct in its own way, but it is different from both the gensim implementation and the paper.
window is 1-side size as said above. If document size is less than context size, then the smaller one would be use.
There are many reasons why gensim implementation is faster. First, gensim was optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such as min_count filtering in gensim would reduce the dataset size. More importantly, gensim uses negative_sampling_loss, which is much faster than sampled_softmax_loss, I guess this is the main reason.
Is it easier to find somethings when there are many of them? Just kidding ;-)
It's true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. Interestingly, in neural network, most local optima are "good enough". It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.