Text classification with LSTM Network and Keras - python

I'm currently using a Naive Bayes algorithm to do my text classification.
My end goal is to be able to highlight parts of a big text document if the algorithm has decided the sentence belonged to a category.
Naive Bayes results are good, but I would like to train a NN for this problem, so I've followed this tutorial:
http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ to build my LSTM network on Keras.
All these notions are quite difficult for me to understand right now, so excuse me if you see some really stupid things in my code.
1/ Preparation of the training data
I have 155 sentences of different sizes that have been tagged to a label.
All these tagged sentences are in a training.csv file:
8,9,1,2,3,4,5,6,7
16,15,4,6,10,11,12,13,14
17,18
22,19,20,21
24,20,21,23
(each integer representing a word)
And all the results are in another label.csv file:
6,7,17,15,16,18,4,27,30,30,29,14,16,20,21 ...
I have 155 lines in trainings.csv, and of course 155 integers in label.csv
My dictionnary has 1038 words.
2/ The code
Here is my current code:
total_words = 1039
## fix random seed for reproducibility
numpy.random.seed(7)
datafile = open('training.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
X = data;
Y = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")
max_sentence_length = 500
X_train = sequence.pad_sequences(X, maxlen=max_sentence_length)
X_test = sequence.pad_sequences(X, maxlen=max_sentence_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(total_words, embedding_vecor_length, input_length=max_sentence_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_train, Y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
This model is never converging:
155/155 [==============================] - 4s - loss: 0.5694 - acc: 0.0000e+00
Epoch 2/3
155/155 [==============================] - 3s - loss: -0.2561 - acc: 0.0000e+00
Epoch 3/3
155/155 [==============================] - 3s - loss: -1.7268 - acc: 0.0000e+00
I would like to have one of the 24 labels as a result, or a list of probabilities for each label.
What am I doing wrong here?
Thanks for your help!

I've updated my code thanks to the great comments posted to my question.
Y_train = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")
Y_test = numpy.genfromtxt("labels_test.csv", dtype="int", delimiter=",")
Y_train = np_utils.to_categorical(Y_train)
Y_test = np_utils.to_categorical(Y_test)
max_review_length = 50
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_review_length))
model.add(LSTM(10, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(31, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.fit(X_train, Y_train, epochs=100, batch_size=30)
I think I can play with LSTM size (10 or 100), number of epochs and batch size.
Model has a very poor accuracy (40%). But currently I think it's because I don't have enough data (150 sentences for 24 labels).
I will put this project in standby mode until I get more data.
If someone has some ideas to improve this code, feel free to comment!

Related

Loss is NAN using Keras on the MNIST digit set

I am following an example from a data science textbook and have run into an issue where I am getting NaN values for the loss when running simple Keras neural networks to find the optimal learning rate.
# Get data and split into test/train/valid and normalize
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.
# Callback to grow the learning rate at each iteration.
# Also record learning rate and loss at each iteration.
K = keras.backend
class ExponentialLearningRate(keras.callbacks.Callback):
def __init__(self, factor):
self.factor = factor
self.rates = []
self.losses = []
def on_batch_end(self, batch, logs):
self.rates.append(K.get_value(self.model.optimizer.lr))
self.losses.append(logs["loss"])
K.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)
# Define the model and compile/fit.
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="relu"),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(lr=1e-3),
metrics=["accuracy"])
expon_lr = ExponentialLearningRate(factor=1.005)
history = model.fit(X_train, y_train, epochs=1,
validation_data=(X_valid, y_valid),
callbacks=[expon_lr])
Running this gives an output of:
1719/1719 [==============================] - 6s 4ms/step - loss: nan - accuracy: 0.6030 - val_loss: nan - val_accuracy: 0.0958
Plotting the loss vs learning rate gives (top is my result, bottom is the expected result from the example I am following):
Notably, the example loss is much noisier than mine and ranges from ~2.5 to ~0.25. My loss only ranges from ~2.5 to exactly 1, at which point the loss goes NaN.
Perhaps something with keras/tf has been updated since this example was written, but as I am new to keras I am wondering what might be the issue here.
Your problem is the ExponentialLearningRate, your learning rate go from 0.0010150751 to 5.237502 which is why your loss is exploding, change the optimizer like this
optimizer=tf.keras.optimizers.Adam(0.001)
and remove the callback, your loss will be fine then

Keras neural network takes only few samples to train

data = np.random.random((10000, 150))
labels = np.random.randint(10, size=(10000, 1))
labels = to_categorical(labels, num_classes=10)
model = Sequential()
model.add(Dense(units=32, activation='relu', input_shape=(150,)))
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(data, labels, epochs=30, validation_split=0.2)
I created 10000 random samples to train my net, but it use only few of them(250/10000)
Exaple of the 1st epoch:
Epoch 1/30
250/250 [==============================] - 0s 2ms/step - loss: 2.1110 - accuracy: 0.2389 - val_loss: 2.2142 - val_accuracy: 0.1800
Your data is split into training and validation subsets (validation_split=0.2).
Training subset has size 8000 and validation 2000.
Training goes in batches, each batch has size 32 samples by default.
So one epoch should take 8000/32=250 batches, as it shows in the progress.
Try code like following example
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(10, size=(1000, 1))
# Convert labels to categorical one-hot encoding
one_hot_labels = keras.utils.to_categorical(labels, num_classes=10)
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, one_hot_labels, epochs=10, batch_size=32)

Keras [Text multi-classification] - Good accuracy in training and test but bad in prediction

I have been facing a several problems when trying to predict topics based on news articles. The news articles have been cleared (no pontuation, numbers, ... ). There are 6 classes possible and I have a dataset of 13000 news articles per each class (Uniform distribution of the data set).
Pre-processing:
stop_words = set(stopwords.words('english'))
for index, row in data.iterrows():
print ("Index: ", index)
txt_clean = ' '.join(re.sub("([^a-zA-Z ])", " ", data.loc[index,'txt_clean']).split()).lower()
word_tokens = word_tokenize(txt_clean)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
cleaned_text = ''
for w in filtered_sentence:
cleaned_text = cleaned_text + ' ' + w
data.loc[index,'txt_clean'] = cleaned_text
I implemented a RNN using LSTM as the following:
model = Sequential()
model.add(Embedding(50000, 100, input_length=500))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(6, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1)
accr = model.evaluate(X_test,Y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]))
Prediction:
model = load_model('model.h5')
data = data.sample(n=15000)
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])
tokenizer = Tokenizer(num_words=50000)
tokenizer.fit_on_texts(data['txt_clean'].values) (Prediction data sample values and not the same as in the training))
CATEGORIES = ['A','B','C','D','E','F']
for index, row in data.iterrows():
seq = tokenizer.texts_to_sequences([data.loc[index,'txt_clean']])
padded = pad_sequences(seq, maxlen=500)
pred = model.predict(padded)
pred = pred[0]
print (pred, pred[np.argmax(pred)]))
For example after 10 epochs and a batch_size of 500:
Training acc: 0.831
Training loss: 0.513
Test acc: 0.714
Test loss: 0.907
Also tried reducing the number of batch_size to 64:
Training acc: 0.859
Training loss: 0.415
Test acc: 0.771
Test loss: 0.679
The results using 64 batch size seems to me better, BUT when I am predicting news article (one by one) I get an accuracy of 15.97%. This accuracy of prediction is much much low comparing with the training and test.
What could be the problem?
Thanks!
This is a classical problem that exists in ML or DL. There can be a couple of reasons for it
Overfitting, try to make the model deeper or add some normalizations.
Train and Test dataset are dissimilar
Use the same preprocessing steps as used during training
Class imbalance i.e training dataset contains more data for a particular class which test dataset lacks.
Try to change model architecture by using Bidirectional LSTM or GRU
Please try pickling your tokenizer using pickle or joblib to save your keras tokenizer use it for both training and prediction.
Here is the sample code to save keras tokenizer :-
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)

Keras Binary Classifier Tutorial Example gives only 50% validation accuracy

Keras Binary Classifier Tutorial Example gives only 50% validation accuracy.
The near 50% accuracy can be gotten from an un-trained classifier itself for binary classification.
This example is straight from https://keras.io/getting-started/sequential-model-guide/
import numpy as np
import tensorflow as tf
from tensorflow_core.python.keras.models import Sequential
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
np.random.seed(10)
# Generate dummy data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((800, 20))
y_test = np.random.randint(2, size=(800, 1))
model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=50,
batch_size=128,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, batch_size=128)
Accuracy output.
I tried with multiple trials.
Increased the number of hidden layers
Epoch 50/50 1000/1000 [==============================] - 0s
211us/sample - loss: 0.6905 - accuracy: 0.5410 - val_loss: 0.6959 -
val_accuracy: 0.4812
Could someone help me understand if anything is wrong here?
How to increase the accuracy for this "example" problem presented in the tutorial?
If you train a classifier with random examples, you will always get aprrox. 50% accuracy at validation data here represented by x_test. It is because your training samples get trained with random classes. Also the validation or test set has been assigned to random classes. This is why the random accuracy i.e. 50-50% occurs.
The more epoch you test the training set the more accuracy you will get on training set as an effect of overfitting.

My keras neural network does not predict handwritten numbers

I trained a feedforward neural network in Keras module, but there are some problems with it. The problem is the incorrect prediction of images with self-written digits.
(X_train, y_train), (X_test, y_test) = mnist.load_data()
RESHAPED = 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
Y_train = np_utils.to_categorical(y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(y_test, NB_CLASSES)
model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=OPTIMIZER,
metrics=['accuracy'])
history = model.fit(X_train, Y_train,
batch_size=BATCH_SIZE, epochs=NB_EPOCH,
verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
//output
....
Epoch 20/20
48000/48000 [==============================] - 2s 49us/step - loss: 0.0713 -
acc: 0.9795 - val_loss: 0.1004 - val_acc: 0.9782
10000/10000 [==============================] - 1s 52us/step
Test score: 0.09572517980986121
Test accuracy: 0.9781
Next, I upload my image, of dimension 28x28, which was self-written, for example - 2, and executed this script:
img_array = imageio.imread('2.png', as_gray=True)
predictions=model.predict(img_array.reshape(1,784))
print (np.argmax(predictions))
//output for example 3, but I expect - 2
I tried other pictures with different numbers, which also gave wrong predictions. What is wrong? Model shows-Test accuracy: 0.9781. Help me please)))
You are using the MNIST Dataset. This set was originally created for the american post office. You'll have to deal with regional handwriting variation so make sure you have written american numbers.
On the other hand errors are normal. Make a bigger set of handwritten digits and make some tests.
This is a common problem of data mismatch , in order to have a prefect accuracy on your own images , you need to make some preprocessing before , if you noticed one thing in the mnist dataset all images are white in black background , this is one of the modification you should make to your own image , for more information check this awesome medium article :
https://medium.com/#o.kroeger/tensorflow-mnist-and-your-own-handwritten-digits-4d1cd32bbab4

Categories

Resources