keras - seq2seq model predicting same output for all test inputs

keras - seq2seq model predicting same output for all test inputs - python

I am trying to build a seq2seq model using LSTM in Keras. Currently working on the English to French pairs dataset-10k pairs(orig dataset has 147k pairs). After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.
What I observe is the predicted words are nothing but the most frequent words in the dataset and they are displayed in the decreasing order of their frequency.
eg:
'I know you', 'Can we go ?', 'snap out of it' -- for all these input seq the output is -- 'je suis en train' (same output for all three).
Can anyone help me what could be the reason why model is behaving like this. Am i missing something basic ?
I tried following with batchsize=32, epoch=50,maxinp=8, maxout=8, embeddingsize=100.
encoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='encoder_inputs')
encoder_lstm1 = LSTM(units=HIDDEN_UNITS, return_state=True, name="encoder_lstm1" , stateful=False, dropout=0.2)
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm1(encoder_inputs)
encoder_states = [encoder_state_h, encoder_state_c]
decoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='decoder_inputs')
decoder_lstm = LSTM(units=HIDDEN_UNITS, return_sequences=True, return_state=True, stateful=False,
name='decoder_lstm', dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)
self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(self.model.summary())
self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Xtrain, Xtest, Ytrain, Ytest = train_test_split(input_texts_word2em, self.target_texts, test_size=0.2, random_state=42)
train_gen = generate_batch(Xtrain, Ytrain, self)
test_gen = generate_batch(Xtest, Ytest, self)
train_num_batches = len(Xtrain) // BATCH_SIZE
test_num_batches = len(Xtest) // BATCH_SIZE
self.model.fit_generator(generator=train_gen, steps_per_epoch=train_num_batches,
epochs=NUM_EPOCHS,
verbose=1, validation_data=test_gen, validation_steps=test_num_batches ) #, callbacks=[checkpoint])
self.encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_inputs = [Input(shape=(HIDDEN_UNITS,)), Input(shape=(HIDDEN_UNITS,))]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_state_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
self.decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs] + decoder_states)
Update:: I have run on 147k dataset with 5 epochs and the results vary for every input. Thanks for the help.
However now I am running same model with another dataset with input and output sequences containing average no of words as 170 and 100 respectively after cleaning (removing stopwords and stuff).
This dataset has around 30k records and if i run even with 50 epochs, the results are same again for every test sentence. So what are my next options for me to try. I was expecting atleast different output for different inputs (even if it is wrong) but same output is adding more frustration whether the model is not learning properly.
Any answers ??

An LSTM-based encoder-decoder (Seq2Seq) that is correctly setup may produce the same output for any input when the net has not trained for enough epochs. I can reliably reproduce the "same stupid output no matter what the input" result simply by reducing my number of epochs from 200 to 30.
One thing that may be confusing is that default accuracy measures may not seem to improve much, as you go from 30 to 150 epochs. However, in cases where you are using Seq2Seq for chatbot or translation tasks, something like the BLEU score is more relevant. Even more relevant is your own evaluation of the 'realism' of the responses. These evaluations are done at inference time, not during training - but they can influence your evaluation of whether or not the model has trained sufficiently.
Another thing to consider is that, in my case, the training data was movie dialogue from the Cornell set. Chatbot dialogue is supposed to be helpful - movie dialogue has no such mandate - it's just supposed to be interesting. So we do have a mismatch here between training and use.
In any event, here are examples of the exact same net trained after 30 vs. 150 epochs, responding to the same inputs.
30 epochs
150 epochs
In this case, the default keras.Model.fit() accuracy reported went from .2118 (after 30 epochs) to .2243 (after 150), but clearly the inputs are now getting differentiated. If this were a classifier and we were just looking at the training accuracy (and didn't look at sample inferences) we might reasonably assume that all those additional training epochs were pointless.
But think of it this way - evaluating the ability of a model to, for example, classify a picture of a bird as a bird with labeled data is quite different than evaluating the ability of a model to encapsulate the idea of a sentence or phrase and respond appropriately with a sequence of characters that forms a coherent thought, so the default training metrics aren't as useful.
Another thing we might suspect when we see a oscillating accuracy is that our learning rate is too high or not sufficiently adaptive. I was using rmsprop - maybe Adam would address this problem? But after 30 epochs with Adam, acc was .2085 and starting to oscillate.
At this point, looking at my results, it's clear that training on movie dialogue is just going to produce movie-dialogue-ish text which isn't inherently helpful and not that easy to assess in terms of 'accuracy'. You would expect movie dialogue to have a 'spark of life' - originality, the unexpected, a difference in tone between characters, etc. So at this point, if I want a better chatbot, I need more applicable training data.

Related

What can I do to help make my TensorFlow network overfit a large dataset?

The reason I am trying to overfit specifically, is because I am following the "Deep Learning with Python" by François Chollet's steps to designing a network. This is important as this is for my final project in my degree.
At this stage, I need to make a network large enough to overfit my data in order to determine a maximal capacity, an upper-bounds for the size of networks that I will optimise for.
However, as the title suggests, I am struggling to make my network overfit. Perhaps my approach is naïve, but let me explain my model:
I am using this dataset, to train a model to classify stars. There are two classes that a star must be classified by (into both of them): its spectral class (100 classes) and luminosity class (10 classes).
For example, our sun is a 'G2V', it's spectral class is 'G2' and it's luminosity class is 'V'.
To this end, I have built a double-headed network, it takes this input data:
DataFrame containing input data
It then splits into two parallel networks.
# Create our input layer:
input = keras.Input(shape=(3), name='observation_data')
# Build our spectral class
s_class_branch = layers.Dense(100000, activation='relu', name = 's_class_branch_dense_1')(input)
s_class_branch = layers.Dense(500, activation='relu', name = 's_class_branch_dense_2')(s_class_branch)
# Spectral class prediction
s_class_prediction = layers.Dense(100,
activation='softmax',
name='s_class_prediction')(s_class_branch)
# Build our luminosity class
l_class_branch = layers.Dense(100000, activation='relu', name = 'l_class_branch_dense_1')(input)
l_class_branch = layers.Dense(500, activation='relu', name = 'l_class_branch_dense_2')(l_class_branch)
# Luminosity class prediction
l_class_prediction = layers.Dense(10,
activation='softmax',
name='l_class_prediction')(l_class_branch)
# Now we instantiate our model using the layer setup above
scaled_model = Model(input, [s_class_prediction, l_class_prediction])
optimizer = keras.optimizers.RMSprop(learning_rate=0.004)
scaled_model.compile(optimizer=optimizer,
loss={'s_class_prediction':'categorical_crossentropy',
'l_class_prediction':'categorical_crossentropy'},
metrics=['accuracy'])
logdir = os.path.join("logs", "2raw100k")
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
scaled_model.fit(
input_data,{
's_class_prediction':spectral_targets,
'l_class_prediction':luminosity_targets
},
epochs=20,
batch_size=1000,
validation_split=0.0,
callbacks=[tensorboard_callback])
In the code above you can see me attempting a model with two hidden layers in both branches, one layer with a shape of 100 000, following into another layer with 500, before going to the output layer. The training targets are one-hot encoded, so there is one node for every class.
I have tried a wide range of sizes with one to four hidden layers, ranging from a shape of 500 to 100 000, only stopping because I ran out of RAM. I have only used dense layers, with the exception of trying a normalisation layer to no affect.
Graph of losses
They will all happily train and slowly lower the loss, but they never seem to overfit. I have run networks out to 100 epochs and they still will not overfit.
What can I do to make my network fit the data better? I am fairly new to machine learning, having only been doing this for a year now, so I am sure there is something that I am missing. I really appreciate any help and would be happy to provide the logs shown in the graph.

After a lot more training I think I have this answered. Basically, the network did not have adequate capacity and needed more layers. I had tried more layers earlier but because I was not comparing it to validation data the overfitting was not apparent!
The proof is in the pudding:
So thank you to #Aryagm for their comment, because that let me work it out. As you can see, the validation data (grey and blue) clearly overfits, while the training data (green and orange) does not show it.
If anything, this goes to show why a separate validation set is so important and I am a fool for not having used it in the first place! Lesson learned.

Tensorflow BERT for token-classification - exclude pad-tokens from accuracy while training and testing

I'm doing token-based classification using the pre-trained BERT-model for tensorflow to automatically label cause and effects in sentences.
To access BERT, I'm using the TFBertForTokenClassification-Interface from huggingface: https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification
The sentences I use to train are all converted to tokens (basically a mapping of words to numbers) according to the BERT-tokenizer and then padded to a certain length before training, so when one sentence has only 50 tokens and another one has only 30 the first one is filled up with 50 pad-tokens and the second one with 70 of them to get a universal input sentence-length of 100.
I then train my model to predict on every token which label this token belongs to; whether it is part of the cause, the effect or none of them.
However, during training and evaluation, my model does predictions on the PAD-tokens as well and they are also included in the accuracy of the model. As PAD-tokens are very easy to predict for the model (they always have the same token and they all have the "none" label which means they neither belong to the cause nor the effect of the sentence), they really distort my model's accuracy.
For example, if you have a sentence which has 30 words -> 30 tokens and you pad all sentences to a length of 100, then this sentence would get a score of 70% even if the model predicted none of the "real" tokens correctly.
This way i'm getting training and validation accuracy of 90+% really quick although the model performs poorly on the real pad-tokens.
I thought that attention-mask is there to solve this problem but this doesn't seem to be the case.
The input-datasets are created as follows:
def example_to_features(input_ids,attention_masks,token_type_ids,label_ids):
return {"input_ids": input_ids,
"attention_mask": attention_masks},label_ids
train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(buffer_size=1000).batch(32)
Model creation:
from transformers import TFBertForTokenClassification
num_epochs = 30
model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
model.layers[-1].activation = tf.keras.activations.softmax
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.summary()
And then I train it like this:
history = model.fit(train_ds, epochs=num_epochs, validation_data=validate_ds)
Has anyone encountered this problem so far or does know how to exclude the predictions on pad-tokens from the model's accuracy during training and evaluation?

Yes, this is normal.
The output of BERT [batch_size, max_seq_len = 100, hidden_size] will include values or embeddings for [PAD] tokens as well. However, you also provide attention_masks to the BERT model so that it does not take into consideration these [PAD] tokens.
Similarly, you need to MASK these [PAD] tokens before passing the BERT results to the final fully-connected layer, mask them when you are calculating loss, and also for calculating metrics like precision and recall.

Supervised Extractive Text Summarization

I want to extract potential sentences from news articles which can be part of article summary.
Upon spending some time, I found out that this can be achieved in two ways,
Extractive Summarization (Extracting sentences from text and clubbing them)
Abstractive Summarization (internal language representation to generate more human-like summaries)
Reference: rare-technologies.com
I followed abigailsee's Get To The Point: Summarization with Pointer-Generator Networks for summarization which was producing good results with the pre-trained model but it was abstractive.
The Problem:
Most of the extractive summarizers that I have looked so far(PyTeaser, PyTextRank and Gensim) are not based on Supervised learning but on Naive Bayes classifier, tf–idf, POS-tagging, sentence ranking based on keyword-frequency, position etc., which don't require any training.
Few things that I have tried so far to extract potential summary sentences.
Get all sentences of articles and label summary sentences as 1 and 0 for all others
Clean up the text and apply stop word filters
Vectorize a text corpus using Tokenizer from keras.preprocessing.text import Tokenizer with Vocabulary size of 20000 and pad all sequences to average length of all sentences.
Build a Sqequential keras model a train it.
model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=sentence_avg_length))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
Any guidance on approach to solve this problem would be appreciated.

I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
That's right. The above model is used for binary classification, not text summarization. If you notice, the output (Dense(1, activation='sigmoid')) only gives you a score between 0-1 while in text summarization we need a model that generates a sequence of tokens.
What should I do?
The dominant idea to tackle this problem is encoder-decoder (also known as seq2seq) models. There is a nice tutorial on Keras repository which used for Machine translation but it is fairly easy to adapt it for text summarization.
The main part of the code is:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
Based on the above implementation, it is necessary to pass encoder_input_data, decoder_input_data and decoder_target_data to model.fit() which respectively are input text and summarize version of the text.
Note that, decoder_input_data and decoder_target_data are the same things except that decoder_target_data is one token ahead of decoder_input_data.
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
The low accuracy performance caused by various reasons including small training size, overfitting, underfitting and etc.

Keras RNN (GRU, LSTM) produces plateau and then improvement

I'm new to Keras (with TensorFlow backend) and am using it to do some simple sentiment analysis on user reviews. For some reason, my recurrent neural network is producing some unusual results that I do not understand.
First, my data is a straight-forward sentiment analysis training and test set from the UCI ML archive. There were 2061 training instances, which is small. The data looks like this:
text label
0 So there is no way for me to plug it in here i... 0
1 Good case, Excellent value. 1
2 Great for the jawbone. 1
3 Tied to charger for conversations lasting more... 0
4 The mic is great. 1
Second, here is a FFNN implementation that produces good results.
# FFNN model.
# Build the model.
model_ffnn = Sequential()
model_ffnn.add(layers.Embedding(input_dim=V, output_dim=32))
model_ffnn.add(layers.GlobalMaxPool1D())
model_ffnn.add(layers.Dense(10, activation='relu'))
model_ffnn.add(layers.Dense(1, activation='sigmoid'))
model_ffnn.summary()
# Compile and train.
model_ffnn.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_ffnn = model_ffnn.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
As you can see, the learning curves produce a smooth improvement as the number of epochs increases.
Third, here is the problem. I trained a recurrent neural network with a GRU, as shown below. I also tried an LSTM and saw the same results.
# GRU model.
# Build the model.
model_gru = Sequential()
model_gru.add(layers.Embedding(input_dim=V, output_dim=32))
model_gru.add(layers.GRU(units=32))
model_gru.add(layers.Dense(units=1, activation='sigmoid'))
model_gru.summary()
# Compile and train.
model_gru.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_gru = model_gru.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
However, the learning curves are quite unusual. You can see a plateau where neither the loss nor the accuracy improve up to about epoch 17, and then the model starts learning and improving. I have never seen this type of plateau at the start of training before.
Can anyone explain why this plateau is occurring, why it stops and gives way to gradual learning, and how I can avoid it?

Following the comment by #Gerges Dib, I tried out different learning rates in increasing order.
lr = 0.0001
lr = 0.001 (the default learning rate for RMSprop)
lr = 0.01
lr = 0.05
lr = 0.1
This is very interesting. It looks like the plateau was caused by the optimizer's learning rate being too low. The parameters were stuck in a local optima until it could break out. I have not seen this pattern before.

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?

Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .

I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.