Making a prediction after training and fitting a RNN Sequential model - python

I am trying to get predictions from my sentiment analysis models that classify 500 worded News articles. The models validation loss and training loss is in are about the same and their scores are relatively high. However when I try to make predictions with them I get the same classification result in all of them regardless of the text input.
I believe that the problem might be on the way I am trying to make a prediction (I pad my string with spaced characters). I was hoping that someone here could shed some light on this issue (my code below). Thank you for your help
for i in range(300-len(comment.split(' '))):
apad += ' A'
comment = comment + apad
X = tokenizer.texts_to_sequences([comment])
X = preprocessing.sequence.pad_sequences(X)
yhat = b.predict_classes(X)
prediction = b.predict(X, batch_size=None, verbose=0, steps=None)
The output of this script is below. Both prediction and predicted classes, are regardless of the text input always 0 for some reason:
[[0]] [[0.00645966]]

The problem seems to be with the tokenizer.
You can't fit the tokenizer again, because you will have different tokens for each word. You should fit the tokenizer only once before training and then save the tokens to be used with all new text.


Tensorflow BERT for token-classification - exclude pad-tokens from accuracy while training and testing

I'm doing token-based classification using the pre-trained BERT-model for tensorflow to automatically label cause and effects in sentences.
To access BERT, I'm using the TFBertForTokenClassification-Interface from huggingface:
The sentences I use to train are all converted to tokens (basically a mapping of words to numbers) according to the BERT-tokenizer and then padded to a certain length before training, so when one sentence has only 50 tokens and another one has only 30 the first one is filled up with 50 pad-tokens and the second one with 70 of them to get a universal input sentence-length of 100.
I then train my model to predict on every token which label this token belongs to; whether it is part of the cause, the effect or none of them.
However, during training and evaluation, my model does predictions on the PAD-tokens as well and they are also included in the accuracy of the model. As PAD-tokens are very easy to predict for the model (they always have the same token and they all have the "none" label which means they neither belong to the cause nor the effect of the sentence), they really distort my model's accuracy.
For example, if you have a sentence which has 30 words -> 30 tokens and you pad all sentences to a length of 100, then this sentence would get a score of 70% even if the model predicted none of the "real" tokens correctly.
This way i'm getting training and validation accuracy of 90+% really quick although the model performs poorly on the real pad-tokens.
I thought that attention-mask is there to solve this problem but this doesn't seem to be the case.
The input-datasets are created as follows:
def example_to_features(input_ids,attention_masks,token_type_ids,label_ids):
return {"input_ids": input_ids,
"attention_mask": attention_masks},label_ids
train_ds =,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(buffer_size=1000).batch(32)
Model creation:
from transformers import TFBertForTokenClassification
num_epochs = 30
model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
model.layers[-1].activation = tf.keras.activations.softmax
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
And then I train it like this:
history =, epochs=num_epochs, validation_data=validate_ds)
Has anyone encountered this problem so far or does know how to exclude the predictions on pad-tokens from the model's accuracy during training and evaluation?
Yes, this is normal.
The output of BERT [batch_size, max_seq_len = 100, hidden_size] will include values or embeddings for [PAD] tokens as well. However, you also provide attention_masks to the BERT model so that it does not take into consideration these [PAD] tokens.
Similarly, you need to MASK these [PAD] tokens before passing the BERT results to the final fully-connected layer, mask them when you are calculating loss, and also for calculating metrics like precision and recall.

BERT sentence classification

I am using the the Python BERT models:
My goal is to build a binary classification model to predict if a news headline is relevant to a specific category. I have a training set of data which has news headline sentences as well as binary values to indicate if the headline is valid or invalid.
I tried to run the script and the results I obtained do not seem to make sense. The test results file has two columns with the same two numbers being repeated on each row :
Also in the model parameters for task_name I have it set as: cola, after reading the academic paper for BERT I feel as if this is not an appropriate task name. The paper lists several other tasks on pages 14 and 15 but none of them seem to be appropriate for the binary categorization of sentences based on content.
How can I properly use BERT to classify sentences? I tried using this guide.
But it did not yield the results I had expected.
For Binary classification task (I assume you have used the cola processor), BERT's predictions on the test set goes to test_results.tsv file.
In order to interpret test_results.tsv, you must know its structure.
The file contains number of rows equalling to number of inputs in the test set. And the number of columns will be equal to number of test labels. (Since your task is a binary classification, there will be two columns, column for label 0 and column for label 1).
The value in each column is the softmax value (summing up the values of all the columns for a given row must be equal to 1) indicating the probability of the given class (or label).
If you observe in your case, 0.9999991 and 9.12E-6 (9.12*10^(-6)) are not the same. If you sum them, they equate to ~1. (This can also be interpreted that the test input belongs to the class indicated by label 0)
How can I properly use BERT to classify sentences?
Take a look at this complete working code for sentence classification, using IMDB Sentiment Analysis (Binary text classification on Google Colab using GPU)
Basically, you can use Tensorflow and keras-bert to do that. The steps involved are
Load and transform your custom data.
Load pre-trained models and define network for fine-tuning
Train/fine-tune the model using custom data.
Classify using the trained model.
Here is brief snippet to help.
model = load_trained_model_from_checkpoint(
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)
history =
predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
texts = [
"It's a must watch",
"Can't wait for it's next part!",
'It fell short of expectations.',
for text in texts:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
inpu = np.array(ids).reshape([1, SEQ_LEN])
predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
print ("%s: %s"% (id_to_labels[predicted_id], text))
positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
Hope that helps.

Keras Validation Accuracy is Zero but other metrics are normal

I am working on a computer vision problem in keras and I have run into a an interesting problem. My val_acc is 0.0000e+00. This is especially interesting as my other metrics such as loss, acc, and val_loss all are acting normally.
This started happening when I switched from the Sequence data_generator to a custom one that I'm pretty sure is working as intended. My issue is very similar to this one validation accuracy is 0 with Keras fit_generator but no answer was reached in that thread.
I have checked to make sure my activations and loss metrics are appropriate for my particular problem. I am using: loss='categorical_crossentropy' metrics=['accuracy'] and am attempting to predict the month that a certain spectrogram comes from.The validation data is being loaded in the exact same way as the training data so I really can't figure out whats happening also even random guessing should give a 1/12 val_acc right? It can't be zero.
Here is my model architecture:
x = (Convolution2D(32,5,5,activation='relu',input_shape=(501,501,1)))(input_img)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Convolution2D(32,5,5,activation='relu'))(x)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Dropout(0.25))(x)
x = (Flatten())(x)
x = (Dense(128,activation='relu'))(x)
x = (Dropout(0.5))(x)
classify = (Dense(12,activation='softmax', kernel_regularizer=regularizers.l1_l2(l1 = 0.001,l2 = 0.001)))(x)
model = Model(input_img,classify)
and here is my call to fit_generator:
model.fit_generator(generator = pd.data_generator(folder,'train'),
validation_data = pd.data_generator(folder,'test'),
and finally here is the important part of my data generator:
if mode == 'test':
while True:
for things in up.unpickle_batch(folder,50,6000,7200): #The last 1200 things in batches of 50
test_spect = []
test_months = []
for thing in things:
test_spect.append(thing.spect) #GET BATCH DATA
test_months.append(thing.month-1) #this is is here because the months go from 1-12 but should go from 0-11 for to_categorical
x_test = np.asarray(test_spect) #PREPARE BATCH DATA
x_test = x_test.astype('float32')
x_test /= np.amax(x_test) #- 0.5
X_test = np.reshape(x_test, (-1,501, 501,1))
Y_test = np_utils.to_categorical(test_months,12)
yield X_test,Y_test #RETURN BATCH DATA
Check for bad data.
Make sure your data is what you think it is -- shuffled, distributed the same as your validation and/or test set, free of misleading/erroneous/contradictory samples. You can probably generate a failproof dataset (e.g. distinguish dark images from light ones, or sharp versus blurry) and prove that everything but the data is OK. If you can't, then look more closely at your code. This, however, sounds like a data problem.
I just fixed a similar problem in a simple 3-layer MLP network for which training loss & accuracy were heading in reasonable directions, validation loss was following training loss (but lagging) yet validation accuracy hovered at zero. There was an off-by-one error in my training dataset generation (a sampling script from a larger set) that meant that 1 sample in the entire block of samples for one type had the label for the next block for a different type. 499 correct samples out of 500 was insufficient to keep the training on track.

keras - seq2seq model predicting same output for all test inputs

I am trying to build a seq2seq model using LSTM in Keras. Currently working on the English to French pairs dataset-10k pairs(orig dataset has 147k pairs). After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.
What I observe is the predicted words are nothing but the most frequent words in the dataset and they are displayed in the decreasing order of their frequency.
'I know you', 'Can we go ?', 'snap out of it' -- for all these input seq the output is -- 'je suis en train' (same output for all three).
Can anyone help me what could be the reason why model is behaving like this. Am i missing something basic ?
I tried following with batchsize=32, epoch=50,maxinp=8, maxout=8, embeddingsize=100.
encoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='encoder_inputs')
encoder_lstm1 = LSTM(units=HIDDEN_UNITS, return_state=True, name="encoder_lstm1" , stateful=False, dropout=0.2)
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm1(encoder_inputs)
encoder_states = [encoder_state_h, encoder_state_c]
decoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='decoder_inputs')
decoder_lstm = LSTM(units=HIDDEN_UNITS, return_sequences=True, return_state=True, stateful=False,
name='decoder_lstm', dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)
self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Xtrain, Xtest, Ytrain, Ytest = train_test_split(input_texts_word2em, self.target_texts, test_size=0.2, random_state=42)
train_gen = generate_batch(Xtrain, Ytrain, self)
test_gen = generate_batch(Xtest, Ytest, self)
train_num_batches = len(Xtrain) // BATCH_SIZE
test_num_batches = len(Xtest) // BATCH_SIZE
self.model.fit_generator(generator=train_gen, steps_per_epoch=train_num_batches,
verbose=1, validation_data=test_gen, validation_steps=test_num_batches ) #, callbacks=[checkpoint])
self.encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_inputs = [Input(shape=(HIDDEN_UNITS,)), Input(shape=(HIDDEN_UNITS,))]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_state_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
self.decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs] + decoder_states)
Update:: I have run on 147k dataset with 5 epochs and the results vary for every input. Thanks for the help.
However now I am running same model with another dataset with input and output sequences containing average no of words as 170 and 100 respectively after cleaning (removing stopwords and stuff).
This dataset has around 30k records and if i run even with 50 epochs, the results are same again for every test sentence. So what are my next options for me to try. I was expecting atleast different output for different inputs (even if it is wrong) but same output is adding more frustration whether the model is not learning properly.
Any answers ??
An LSTM-based encoder-decoder (Seq2Seq) that is correctly setup may produce the same output for any input when the net has not trained for enough epochs. I can reliably reproduce the "same stupid output no matter what the input" result simply by reducing my number of epochs from 200 to 30.
One thing that may be confusing is that default accuracy measures may not seem to improve much, as you go from 30 to 150 epochs. However, in cases where you are using Seq2Seq for chatbot or translation tasks, something like the BLEU score is more relevant. Even more relevant is your own evaluation of the 'realism' of the responses. These evaluations are done at inference time, not during training - but they can influence your evaluation of whether or not the model has trained sufficiently.
Another thing to consider is that, in my case, the training data was movie dialogue from the Cornell set. Chatbot dialogue is supposed to be helpful - movie dialogue has no such mandate - it's just supposed to be interesting. So we do have a mismatch here between training and use.
In any event, here are examples of the exact same net trained after 30 vs. 150 epochs, responding to the same inputs.
30 epochs
150 epochs
In this case, the default accuracy reported went from .2118 (after 30 epochs) to .2243 (after 150), but clearly the inputs are now getting differentiated. If this were a classifier and we were just looking at the training accuracy (and didn't look at sample inferences) we might reasonably assume that all those additional training epochs were pointless.
But think of it this way - evaluating the ability of a model to, for example, classify a picture of a bird as a bird with labeled data is quite different than evaluating the ability of a model to encapsulate the idea of a sentence or phrase and respond appropriately with a sequence of characters that forms a coherent thought, so the default training metrics aren't as useful.
Another thing we might suspect when we see a oscillating accuracy is that our learning rate is too high or not sufficiently adaptive. I was using rmsprop - maybe Adam would address this problem? But after 30 epochs with Adam, acc was .2085 and starting to oscillate.
At this point, looking at my results, it's clear that training on movie dialogue is just going to produce movie-dialogue-ish text which isn't inherently helpful and not that easy to assess in terms of 'accuracy'. You would expect movie dialogue to have a 'spark of life' - originality, the unexpected, a difference in tone between characters, etc. So at this point, if I want a better chatbot, I need more applicable training data.

How to get my class prediction LSTM Keras

I recently trying to build a program, that classify Quora (question pair) dataset, whether it's duplicate or not. I got the accuracy and loss based on real y, but IDK how to proceed the output (predicted y) can anyone help me?
the output shud be 1 or 0 (binary class)
This is the sentence merger code, training process use LSTM
merged = RNN(EMBED_HIDDEN_SIZE)(merged)
merged = layers.Dropout(dropoutp)(merged)
preds = layers.Dense(answer_size, activation='sigmoid')(merged)
model = Model([questiona, questionb], preds)
rmsprop = keras.optimizers.rmsprop(lr=lrn)
You can get the predictions by passing the test data to predict funtion
Link to the docs

