CNN on small dataset is overfiting - python

I want to classify pattern on image. My original image shape are 200 000*200 000 i reshape it to 96*96, pattern are still recognizable with human eyes. Pixel value are 0 or 1.
i'm using the following neural network.
train_X, test_X, train_Y, test_Y = train_test_split(cnn_mat, img_bin["Classification"], test_size = 0.2, random_state = 0)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(train_Y),
train_Y)
train_Y_one_hot = to_categorical(train_Y)
test_Y_one_hot = to_categorical(test_Y)
train_X,valid_X,train_label,valid_label = train_test_split(train_X, train_Y_one_hot, test_size=0.2, random_state=13)
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
train = model.fit(train_X, train_label, batch_size=80,epochs=20,verbose=1,validation_data=(valid_X, valid_label),class_weight=class_weights)
I have already run some experiment to find a "good" number of hidden layer and fully connected layer. it's probably not the most optimal architecture since my computer is slow, i just ran different model once and selected best one with matrix confusion, i didn't use cross validation,I didn't try more complex architecture since my number of data is small, i have read small architecture are the best, is it worth to try more complex architecture?
here the result with 5 and 12 epoch, bach size 80. This is the confusion matrix for my test set
As you can see it's look like i'm overfiting. When i only run 5 epoch, most of the class are assigned to class 0; With more epoch, class 0 is less important but classification is still bad
I added 0.8 dropout after each convolutional layer
e.g
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
With drop out, 95% of my image are classified in class 0.
I tryed image augmentation; i made rotation of all my training image, still used weighted activation function, result didnt improve. Should i try to augment only class with small number of image? Most of the thing i read says to augment all the dataset...
To resume my question are:
Should i try more complex model?
Is it usefull to do image augmentation only on unrepresented class? then should i still use weight class (i guess no)?
Should i have hope to find a "good" model with cnn when we see the size of my dataset?

I think according to the imbalanced data, it is better to create a custom data generator for your model so that each of it's generated data batch, contains at least one sample from each class. And also it is better to use Dropout layer after each dense layer instead of conv layer. For data augmentation it is better to at least use combination of rotate, horizontal flip and vertical flip. there are some other approaches for data augmentation like using GAN network or random pixel replacement.
For Gan you can check This SO post
For using Gan as data augmenter you can read This Article.
For combination of pixel level augmentation and GAN pixel level data augmentation

What I used - in a different setting - was to upsample my data with ADASYN. This algorithm calculates the amount of new data required to balance your classes, and then takes available data to sample novel examples.
There is an implementation for Python. Otherwise, you also have very little data. SVMs are good performing even with little data. You might want to try them or other image classification algorithms depending where the expected pattern is always at the same position, or varies. Then you could also try the Viola–Jones object detection framework.

Related

What can I do to help make my TensorFlow network overfit a large dataset?

The reason I am trying to overfit specifically, is because I am following the "Deep Learning with Python" by François Chollet's steps to designing a network. This is important as this is for my final project in my degree.
At this stage, I need to make a network large enough to overfit my data in order to determine a maximal capacity, an upper-bounds for the size of networks that I will optimise for.
However, as the title suggests, I am struggling to make my network overfit. Perhaps my approach is naïve, but let me explain my model:
I am using this dataset, to train a model to classify stars. There are two classes that a star must be classified by (into both of them): its spectral class (100 classes) and luminosity class (10 classes).
For example, our sun is a 'G2V', it's spectral class is 'G2' and it's luminosity class is 'V'.
To this end, I have built a double-headed network, it takes this input data:
DataFrame containing input data
It then splits into two parallel networks.
# Create our input layer:
input = keras.Input(shape=(3), name='observation_data')
# Build our spectral class
s_class_branch = layers.Dense(100000, activation='relu', name = 's_class_branch_dense_1')(input)
s_class_branch = layers.Dense(500, activation='relu', name = 's_class_branch_dense_2')(s_class_branch)
# Spectral class prediction
s_class_prediction = layers.Dense(100,
activation='softmax',
name='s_class_prediction')(s_class_branch)
# Build our luminosity class
l_class_branch = layers.Dense(100000, activation='relu', name = 'l_class_branch_dense_1')(input)
l_class_branch = layers.Dense(500, activation='relu', name = 'l_class_branch_dense_2')(l_class_branch)
# Luminosity class prediction
l_class_prediction = layers.Dense(10,
activation='softmax',
name='l_class_prediction')(l_class_branch)
# Now we instantiate our model using the layer setup above
scaled_model = Model(input, [s_class_prediction, l_class_prediction])
optimizer = keras.optimizers.RMSprop(learning_rate=0.004)
scaled_model.compile(optimizer=optimizer,
loss={'s_class_prediction':'categorical_crossentropy',
'l_class_prediction':'categorical_crossentropy'},
metrics=['accuracy'])
logdir = os.path.join("logs", "2raw100k")
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
scaled_model.fit(
input_data,{
's_class_prediction':spectral_targets,
'l_class_prediction':luminosity_targets
},
epochs=20,
batch_size=1000,
validation_split=0.0,
callbacks=[tensorboard_callback])
In the code above you can see me attempting a model with two hidden layers in both branches, one layer with a shape of 100 000, following into another layer with 500, before going to the output layer. The training targets are one-hot encoded, so there is one node for every class.
I have tried a wide range of sizes with one to four hidden layers, ranging from a shape of 500 to 100 000, only stopping because I ran out of RAM. I have only used dense layers, with the exception of trying a normalisation layer to no affect.
Graph of losses
They will all happily train and slowly lower the loss, but they never seem to overfit. I have run networks out to 100 epochs and they still will not overfit.
What can I do to make my network fit the data better? I am fairly new to machine learning, having only been doing this for a year now, so I am sure there is something that I am missing. I really appreciate any help and would be happy to provide the logs shown in the graph.
After a lot more training I think I have this answered. Basically, the network did not have adequate capacity and needed more layers. I had tried more layers earlier but because I was not comparing it to validation data the overfitting was not apparent!
The proof is in the pudding:
So thank you to #Aryagm for their comment, because that let me work it out. As you can see, the validation data (grey and blue) clearly overfits, while the training data (green and orange) does not show it.
If anything, this goes to show why a separate validation set is so important and I am a fool for not having used it in the first place! Lesson learned.

How to reduce model file size in ".h5"

I'm using tensorflow and keras 2.8.0 version.
I have the following network:
#defining model
model=Sequential()
#adding convolution layer
model.add(Conv2D(256,(3,3),activation='relu',input_shape=(256,256,3)))
#adding pooling layer
model.add(MaxPool2D(2,2))
#adding fully connected layer
model.add(Flatten())
model.add(Dense(100,activation='relu'))
#adding output layer
model.add(Dense(len(classes),activation='softmax'))
#compiling the model
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
#fitting the model
model.fit(x_tr,y_tr,epochs=epochs, )
# Alla 12-esima epoca, va a converge a 1
# batch size è 125 credo, non so il motivo
#evaluting the model
loss_value, accuracy = model.evaluate(x_te, y_te)
#loss_value, accuracy, top_k_accuracy = model.evaluate(x_te, y_te, batch_size=batch_size)
print("loss_value: " + str(loss_value))
print("acuracy: " + str(accuracy))
#predict first 4 images in the test set
ypred = model.predict(x_te)
The point is that now i'm trying to save the model in ".h5" format but if i train it for 100 epochs or for 1 epochs i will get a 4.61Gb file model.
Why the size of this file is that big?
How can i reduce this model size ?
General reason: The size of your h5 file is based only on the number of parameters your model has.
After constructing the model add the line model.summary() and look at the number of parameters the model has in general.
Steps to reduce model size: You have a LOT of filters in your conv layer. Since I don't know what you want to achieve with your model, I would still advise you to seperate the number of filters to different conv layers and add Pooling layers in between. The will scale down the image and will especially reduce the number of parameters for the Flatten layer.
More information on Pooling layers can be found here.
What I find out, after 5 months of experience, is that the steps to do in order to reduce the model size, improve the accuracy score and reduce the loss value are the following:
categorize the labels and then change the loss function of the model
normalize data in values in the range [-1,1]
use the dense layer increase the parameters and then the dimension of the model: is not even helpful sometimes. Having more parameters doesn't mean have more accuracy. In order to find a solution you have to do several try changing the network, using different activation function and optimizer such as SGD or Adam.
Choose good parameters for learning_rate, decay_rate, decay_values and so on. These parameters give you a better or worse result.
Use batch_size = 32 or 64
Use function that load the dataset step by step and not all in one time in RAM because it makes the process slower and is not even needed: if you are using keras then you can use tf.data.Dataset.from_tensor_slices((x, y)).batch(32 , drop_remainder=True) of course it should be done for train,test,validation
Hope that it helps

How can i call model.fit() for different data set in for loop?

I develop Convolutional Neural Network in python. I create Sequential model and i want to fit this model for different data set. So i call fit model in for loop. But calling with one data set vs calling in for loop give different results. How can i reset model parameters?
My code is below:
for tr_ind in range(len(train_set_month_list)):
test_dataset_month_info = test_set_month_list[tr_ind];
train_dataset_month_info = train_set_month_list[tr_ind];
model = Sequential()
history = fit_model_cnn(model, train_x_df, train_x_df_reshaped, train_y_df, validation_data_x_df_reshaped,
validation_data_y_df, timesteps, epoch_size, batch_size);
def fit_model_cnn(model, train_x_df, train_x_df_reshaped, train_y_df, validation_data_x_df_reshaped,
validation_data_y_df,
timesteps, epoch_size, batch_size):
model.add(
Conv1D(filters=filter_size, kernel_size=kernel_size, activation=activation_func, padding='same',
input_shape=(timesteps, train_x_df.shape[1] / timesteps)))
model.add(MaxPooling1D(pool_size=1))
model.add(Flatten())
model.add(Dense(node_count, activation=activation_func, kernel_initializer='he_uniform'))
model.add(Dense(1))
model.compile(optimizer=optimizer_type, loss='mse')
# fit model
history = model.fit(train_x_df_reshaped, train_y_df.values,
validation_data=(validation_data_x_df_reshaped, validation_data_y_df), batch_size=batch_size,
epochs=epoch_size)
return history;
I think you should clear your previous model inside loop so you could use this function which is keras.backend.clear_session().From https://keras.io/backend/:
This will be solved your problem.
From a very simplistic point of view, the data is fed in sequentially, which suggests that at the very least, it's possible for the data order to have an effect on the output. If the order doesn't matter, randomization certainly won't hurt. If the order does matter, randomization will help to smooth out those random effects so that they don't become systematic bias. In short, randomization is cheap and never hurts, and will often minimize data-ordering effects.
In other words, when you feed your neural network with different datasets, your model can get biased towards the latest dataset it was trained on.
You should always make sure that your are randomly sampling from all datasets you have.

Could not increase accuracy from a fixed threshold using Keras Dense layer ANN

I'm learning the simplest neural networks using Dense layers using Keras. I'm trying to implement face recognition on a relatively small dataset (In total ~250 images with 50 images per class).
I've downloaded the images from google images and resized them to 100 * 100 png files. Then I've read those files into a numpy array and also created a one hot label array for training my model.
Here is my code for processing the training data:
X, Y = [], []
feature_map = {
'Alia Bhatt': 0,
'Dipika Padukon': 1,
'Shahrukh khan': 2,
'amitabh bachchan': 3,
'ayushmann khurrana': 4
}
for each_dir in os.listdir('.'):
if os.path.isdir(each_dir):
for each_file in os.listdir(each_dir):
X.append(cv2.imread(os.path.join(each_dir, each_file), -1).reshape(1, -1))
Y.append(feature_map[os.path.basename(each_file).split('-')[0]])
X = np.squeeze(X)
X = X / 255.0 # normalize the training data
Y = np.array(Y)
Y = np.eye(5)[Y]
print (X.shape)
print (Y.shape)
This is printing (244, 40000) and (244, 5). Here is my model:
model = Sequential()
model.add(Dense(8000, input_dim = 40000, activation = 'relu'))
model.add(Dense(1200, activation = 'relu'))
model.add(Dense(700, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(5, activation = 'softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, Y, epochs=25, batch_size=15)
When I train the model, It stuck at the accuracy 0.2172, which is almost the same as random predictions (0.20).
I've also tried to train mode with grayscale images but still not getting expected accuracy. Also tried with different network architectures by changing the number of hidden layers and neurons in hidden layers.
What am I missing here? Is my dataset too small? or am I missing any other technical detail?
For more details of code, here is my notebook: https://colab.research.google.com/drive/1hSVirKYO5NFH3VWtXfr1h6y0sxHjI5Ey
Two suggestions I can make:
Your data set is probably too small. If you are splitting training and validation at 80/20, that means you are only training on 200 images, which is probably too small. Try increasing your data set to see if results improve.
I would recommend adding Dropout to each layer of your network as your training set is so small. Your network is most likely over-fitting your training data set since it is so small, and Dropout is an easy way to help avoid this problem.
Let me know if these suggestions make a difference!
I agree that the dataset is too small, 50 instances of each person is probably not enough. You can use data augmentation with the keras ImageDataGenerator method to increase the number of images, and rewrite your numpy reshaping code as a pre-processing function for the generator. I also noticed that you haven't shuffled the data, so the network is likely predicting the first class for everything (which is maybe why the accuracy is near random chance).
If increasing the dataset size doesn't help, you'll probably have to play around with the learning rate for the Adam optimizer.

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?
Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .
I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

Categories

Resources