Keras - text classification, overfitting, and how to improve my model?

Keras - text classification, overfitting, and how to improve my model? - python

i am developing a text classification neural network
based on this two articles - https://github.com/jiegzhan/multi-class-text-classification-cnn-rnn
https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
For the training i am using, text data in Russian language (language essentially doesn't matter,because text contains a lot of special professional terms, and sadly to employ existing word2vec won't be an option.)
I have such parameters of training data -
Maximum lengths of an article - 969 words
Size of vocabulary - 53886
Amount of labels - 12 (sadly they are distributed quite unevenly, for instance i have first label - and have around 5000 examples of this, and second contains only 1500 examples.)
Amount of training data set - Only 9876 entries. I'ts the biggest problem, because sadly i can't increase size of the training set by any means (only way out to wait another year☻, but even it will only make twice the size of training date, and even double amount is'not enough)
Here is my code -
x, x_test, y, y_test = train_test_split(x_, y_, test_size=0.1)
x_train, x_dev, y_train, y_dev = train_test_split(x, y, test_size=0.1)
embedding_vecor_length = 100
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=7, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=9, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=12, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=15, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(LSTM(200,dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(labels_count, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, epochs=25, batch_size=30)
scores = model.evaluate(x_, y_)
I tried different parameters and it gets really high accuracy in training (up to 98%)
But i really performs badly on test set. Maximum that i managed to achieve was around 74%, usual result something around 64%
And the best result was achieved with small embedding_vecor_length and small batch_size.
I know - that my test set is only 10 percent of training test, and overall data-set is the biggest problem, but i want to find a way around this problem.
So my questions are -
1) Is it correctly builded model for text classification purpose? (it works)
Do i need to use simultaneous convolution an merge results instead?
I just don't get how the text information doesn't get lost in the process of convolution with different filter sized (like in my example)
Can you explain hot the convolution works with text data?
There are mainly articles about image recognition..
2)i obliviously got a problem with overfitting my model. How can i make the performance better?
I have already added Dropout layers. What can i do next?
3)May be i need something different? I mean pure RNN without convolution?

Related

CNN model did not learn anything from the training data. Where are the mistakes I made?

The shape of the train/test data is (samples, 256, 256, 1). The training dataset has around 1400 samples, the validation dataset has 150 samples, and the test dataset has 250 samples. Then I build a CNN model for a six-object classification task. However, no matter how hard I tuning the parameters and add/remove layers(conv&dense), I get a chance level of accuracy all the time (around 16.5%). Thus, I would like to know whether I made some deadly mistakes while building the model. Or there is something wrong with the data itself, not the CNN model.
Code:
def build_cnn_model(input_shape, activation='relu'):
model = Sequential()
# 3 Convolution layer with Max polling
model.add(Conv2D(64, (5, 5), activation=activation, padding = 'same', input_shape=input_shape))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (5, 5), activation=activation, padding = 'same'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(256, (5, 5), activation=activation, padding = 'same'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
# 3 Full connected layer
model.add(Dense(1024, activation = activation))
model.add(Dropout(0.5))
model.add(Dense(512, activation = activation))
model.add(Dropout(0.5))
model.add(Dense(6, activation = 'softmax')) # 6 classes
# summarize the model
print(model.summary())
return model
def compile_and_fit_model(model, X_train, y_train, X_vali, y_vali, batch_size, n_epochs, LR=0.01):
# compile the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
# fit the model
history = model.fit(x=X_train,
y=y_train,
batch_size=batch_size,
epochs=n_epochs,
verbose=1,
validation_data=(X_vali, y_vali))
return model, history
I transformed the MEG data my professor recorded into Magnitude Scalogram using CWT. pywt.cwt(data, scales, wavelet) was used. And if I plot the coefficients I got from cwt, I will have a graph like this (I emerged 62 channels into one graph). enter image description here
I used the coefficients as train/test data for the CNN model. However, I tuned the parameters and tried to add/remove layers for the CNN model, and the classification accuracy was unchanged. Thus, I want to know where I made mistakes. Did I make mistakes with building the CNN model, or did I make mistakes with CWT (the way I handled data)?
Please give me some advices, thank you.

How is the accuracy of the training data? If you have a small dataset and the model does not overfit after training for a while, then something is wrong with the model. You can also test with existing datasets, which the model should be able to handle (like Fashion MNIST).
Testing if you handled the data correctly is harder. Did you write unit tests for the different steps in the preprocessing pipeline?

Is passing activity_regularizer as argument to Conv2D() the same as passing it seperately right after Conv2D()? (Tensorflow)

I was wondering whether creating the model by passing activity_regularizer='l1_l2' as an argument to Conv2D()
model = keras.Sequential()
model.add(Conv2D(filters=16, kernel_size=(6, 6), strides=3, padding='valid', activation='relu',
activity_regularizer='l1_l2', input_shape=X_train[0].shape))
model.add(Dropout(0.2))
model.add(MaxPooling2D(pool_size=(3, 1), strides=3, padding='valid'))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
model.summary()
history = model.fit(X_train, y_train, epochs = 10, validation_data = (X_val, y_val), verbose=0)
will mathematically make a difference to creating the model by adding model.add(ActivityRegularization(l1=..., l2=...)) seperately?
model = keras.Sequential()
model.add(Conv2D(filters=16, kernel_size=(6, 6), strides=3, padding='valid', activation='relu',
input_shape=X_train[0].shape))
model.add(Dropout(0.2))
model.add(ActivityRegularization(l1=some_number, l2=some_number))
model.add(MaxPooling2D(pool_size=(3, 1), strides=3, padding='valid'))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
model.summary()
history = model.fit(X_train, y_train, epochs = 10, validation_data = (X_val, y_val), verbose=0)
For me, it is hard to tell, as training always involves some randomness. But the results seem similar.
One additional question I have is: I accidentally passed the activity_regularizer='l1_l2' argument to the MaxPooling2D() layer before, and the code ran. How can that be, considering that activity_regularizer is not given as a possible argument for MaxPooling2D() in tensorflow?

Technically, if you are not applying any other constraint on the layer output, applying the activity regularizer inside the layer as well as outside the convolution layer is same. However, applying it outside the convolution layer gives the user more flexibility. For instance, the user might want to regularize the output units after the skip connections are set up instead of after the convolution. It is just like to have an activation function inside the convolution layer or using keras.activations to use the activations after he convolution layer. Sometimes this is done after batch normalization.
For your second question, the MaxPool2D layer takes the activity regularizer constraint. Even though this is not mentioned in their documentation, it kind of makes sense intuitionally, since the user might want to regularize the outputs after max-pooling. You can check that activity_regularizer does not only work with the MaxPool2D layer but also with other layers such as the BatchNormalization layer for the same reason.

tf.keras.layers.Conv2D(64 , 2 , padding='same', activity_regularizer='l1_l2')
and this code,
tf.keras.layers.Conv2D(64 , 2 , padding='same')
tf.keras.layers.ActivityRegularization()
They both do the same job, actually doing inside or outside has the same impact. Moreover, Tensorflow on the backend makes a graph of it which will first apply the ConvLayer then it will apply the Activity-Regularization, in both cases, the computation shall be done in the same way with no difference...

LSTM for 30 classes, badly overfitting, cannot go over 76% test accuracy

How to classify job descriptions into their respective industries?
I'm trying to classify text using LSTM, in particular converting job description
Into industry categories, unfortunately the things I've tried so far
Have only resulted in 76% accuracy.
What is an effective method to classify text for more than 30 classes using LSTM?
I have tried three alternatives
Model_1
Model_1 achieves test accuracy of 65%
embedding_dimension = 80
max_sequence_length = 3000
epochs = 50
batch_size = 100
model = Sequential()
model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Model_2
Model_2 achieves test accuracy of 64%
model = Sequential()
model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
model.add(LSTM(100))
model.add(Dropout(rate=0.5))
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(rate=0.5))
model.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))
model.add(Dropout(rate=0.5))
model.add(Dense(output_dim, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
Model_3
Model_3 achieves test accuracy of 76%
model.add(Embedding(max_words, embedding_dimension, input_length= x_shape, trainable=False))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(100, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(128, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.039, seed=None)))
model.add(BatchNormalization())
model.add(Dense(64, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.55, seed=None)) )
model.add(BatchNormalization())
model.add(Dense(32, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.55, seed=None)) )
model.add(BatchNormalization())
model.add(Dense(output_dim, activation='softmax'))
model.compile(optimizer= "adam" , loss='categorical_crossentropy', metrics=['acc'])
I'd like to know how to improve the accuracy of the network.

Start with a minimal base line
You have a simple network at the top of your code, but try this one as your baseline
model = Sequential()
model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
model.add(LSTM(output_dim//4)),
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The intuition here is to see how much work LSTM can do. We don't need it to output the full 30 output_dims (the number of classes) but instead a smaller set of features base the decision of the classes on.
Your larger networks have layers like Dense(128) with 100 input. That's 100x128 = 12,800 connections to learn.
Improving imbalance right away
Your data may have a lot of imbalance so for the next step, let's address that with a loss function called the top_k_loss. This loss function will make your network only train on the training examples that it is having the most trouble on. This does a great job of handling class imbalance without any other plumbing
def top_k_loss(k=16):
#tf.function
def loss(y_true, y_pred):
y_error_of_true = tf.keras.losses.categorical_crossentropy(y_true=y_true,y_pred=y_pred)
topk, indexs = tf.math.top_k( y_error_of_true, k=tf.minimum(k, y_true.shape[0]) )
return topk
return loss
Use this with a batch size of 128 to 512. You add it to your model compile like so
model.compile(loss=top_k_loss(16), optimizer='adam', metrics=['accuracy']
Now, you'll see that using model.fit on this will return some dissipointing numbers. That's because it is only reporting THE WORST 16 out of each training batch. Recompile with your regular loss and run model.evaluate to find out how it does on the training and again on the test.
Train for 100 epochs, and at this point you should already see some good results.
Next Steps
Make the whole model generate and testing into a function like so
def run_experiment(lstm_layers=1, lstm_size=output_dim//4, dense_layers=0, dense_size=output_dim//4):
model = Sequential()
model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
for i in range(lstm_layers-1):
model.add(LSTM(lstm_size, return_sequences=True)),
model.add(LSTM(lstm_size)),
for i in range(dense_layers):
model.add(Dense(dense_size, activation='tanh'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss=top_k_loss(16), optimizer='adam', metrics=['accuracy'])
model.fit(x=x,y=y,epochs=100)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
loss, accuracy = model.evaluate(x=x_test, y=y_test)
return loss
that can run a whole experiment for you. Now it is a matter of finding a better architecture by searching. One way to search is random. Random is actually really good. If you want to get fancy, I recommend hyperopt. Don't bother with grid search, random usually beats it for large search spaces.
best_loss = 10**10
best_config = []
for trial in range(100):
config = [
randint(1,4), # lstm layers
randint(8,64), # lstm_size
randint(0,8), # dense_layers
randint(8,64) # dense_size
]
result = run_experiment(*config)
if result < best_loss:
best_config = config
print('Found a better loss ',result,' from config ',config)

CNN (VGG-16) strange behaviour on validation accuracy

I have built and tested two convolutional Neural Network models (VGG-16 and 3-layer CNN) to predict classification of lung CT scans for COVID-19.
Prior to the classification, I've performed image segmentation via k-means clustering on images to try to improve the classification performance.
The segmented images look like below.
And I've trained and evaluated VGG-16 model on both segmented images and raw images separately. And lastly, trained and evaluated a 3-layer CNN on the segmented images only. Below is the results for their train/validation loss and accuracy.
For the simple 3-layer CNN model, I can clearly see that the model is trained well and also it starts to overfit once epochs are over 2. But, I don't understand how validation accuracy of the VGG model doesn't look like an exponential curve instead it looks like a horizontally straight line or a fluctuating horizontal line.
And besides, the simple 3-layer CNN models seems to perform better. Is this due to gradient vanishing in VGG model ? Or the image itself is simple that deep architecture doesn't benefit?
I'd appreciate if you could share your knowledge on such learning behaviour of the models.
This is the code for the VGG-16 model:
# build model
img_height = 256
img_width = 256
model = Sequential()
model.add(Conv2D(input_shape=(img_height,img_width,1),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Flatten())
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=4096,activation="relu"))
model.add(Dense(units=1, activation="sigmoid"))
opt = Adam(lr=0.001)
model.compile(optimizer=opt, loss=keras.losses.binary_crossentropy, metrics=['accuracy'])
And this is a code for the 3-layer CNN.
# build model
model2 = Sequential()
model2.add(Conv2D(32, 3, padding='same', activation='relu',input_shape=(img_height, img_width, 1)))
model2.add(MaxPool2D())
model2.add(Conv2D(64, 5, padding='same', activation='relu'))
model2.add(MaxPool2D())
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))
opt = Adam(lr=0.001)
model2.compile(optimizer=opt, loss=keras.losses.binary_crossentropy, metrics=['accuracy'])
Thank you!

Looking at the accuracies for an assumed to be binary problem you can observe that the model is just random guessing (acc ~ 0.5).
The fact that your 3-layer model gives much better results on the train set indicates that you are not training long enough to overfit.
In addition you do not seem to use a proper initalization of the NN. Note: at the beginning of an implementation process overfitting is indicating that implementation training just works fine. Hence it is a good thing in this phase.
Therefore, first step would be to get the model overfitting. You seem to train from scratch. In that case it can take a few 100 epochs until the gradients impact the first convolutions on a complex model like VGG16.
As the 3Layer CNN seems to overfit quite heavily I conclude that your dataset is rather small.
Hence, I would recommend to start from a pre-trained model (VGG16) and just re-train the last two layers. This should give much better result.

As per what #CAFEBABE suggested, I have tried two approaches. First, I have increased epochs size to 200, changed optimiser to SGD and reduced learning rate down to 1e-5.
And second, I have implemented pre-trained weights for the VGG-16 model and only trained the last two convolutional layers. Below is the plot displaying the tuned VGG-16 model, the pre-trained VGG-16 model and the 3-layer CNN model (from top to bottom).
Certainly, tuning had an effect on the performance but it was very marginal. I guess the learnable features from the dataset with ~600 images were not sufficient enough to train the model. And the pre-trained model significantly benefitted the model reaching overfitting at ~25 epochs. However, in comparion with the 3-layer CNN model, the testing accuracies of these two models are similar ranging between 0.7 and 0.8. I guess this is again due to the limitation of the datasets.
Thanks again to #CAFEBABE for helping my problem and I hope this can help other people who might face similar problem as I did.

Keras conv nn predicting only one class?

So I've been building a convolutional neural network. I'm trying to predict whether a boardgame state (10x10 matrix) will lead to a win (binary 0 or 1) or not.
I have six million examples, which you would think would be enough, but clearly not, as my network is predicting all of one class...
Is there something obvious I'm missing? I tried giving it even 10 examples and it still predicts them all as the same class.
The input matrices are 10x10 of integers.
Input reshaping:
x_train = x_train.reshape(len(x_train),10,10,1)
Actual model building:
model = Sequential()
model.add(Conv2D(3, kernel_size=(1, 1), strides=(1, 1), activation='relu', input_shape=(10,10,1)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(500, activation='tanh'))
model.add(Dropout(0.5))
model.add(keras.layers.Dense(75, activation='relu'))
model.add(BatchNormalization())
model.add(keras.layers.Dense(10, activation='sigmoid'))
model.add(keras.layers.Dense(1,kernel_initializer='normal',activation='sigmoid'))
optimizerr = keras.optimizers.SGD(lr=0.001, momentum=0.9, decay=0.01, nesterov=True)
model.compile(optimizer=optimizerr, loss='binary_crossentropy', metrics=[metrics.binary_accuracy])
model.fit(x_train, y_train,epochs = 100, batch_size = 128, verbose=1)
I've tried modifying the learning rate, momentum, decay, the kernel_sizes, layer types, sizes... I checked for dying relu and that didn't seem to be the problem. Removing the dropout/batch normalization layers (or various random layers) didn't do anything either.
The data have roughly 53/47% split across the labels, so it's not that either.
I'm more confused because even when I ask it to predict the train set, it STILL insists on only labeling things one class, even if there are only ~20 samples or fewer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras - text classification, overfitting, and how to improve my model? - python

Related

CNN model did not learn anything from the training data. Where are the mistakes I made?

Is passing activity_regularizer as argument to Conv2D() the same as passing it seperately right after Conv2D()? (Tensorflow)

LSTM for 30 classes, badly overfitting, cannot go over 76% test accuracy

CNN (VGG-16) strange behaviour on validation accuracy

Keras conv nn predicting only one class?

Categories

Resources