Supervised Extractive Text Summarization

Supervised Extractive Text Summarization - python

I want to extract potential sentences from news articles which can be part of article summary.
Upon spending some time, I found out that this can be achieved in two ways,
Extractive Summarization (Extracting sentences from text and clubbing them)
Abstractive Summarization (internal language representation to generate more human-like summaries)
Reference: rare-technologies.com
I followed abigailsee's Get To The Point: Summarization with Pointer-Generator Networks for summarization which was producing good results with the pre-trained model but it was abstractive.
The Problem:
Most of the extractive summarizers that I have looked so far(PyTeaser, PyTextRank and Gensim) are not based on Supervised learning but on Naive Bayes classifier, tf–idf, POS-tagging, sentence ranking based on keyword-frequency, position etc., which don't require any training.
Few things that I have tried so far to extract potential summary sentences.
Get all sentences of articles and label summary sentences as 1 and 0 for all others
Clean up the text and apply stop word filters
Vectorize a text corpus using Tokenizer from keras.preprocessing.text import Tokenizer with Vocabulary size of 20000 and pad all sequences to average length of all sentences.
Build a Sqequential keras model a train it.
model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=sentence_avg_length))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
Any guidance on approach to solve this problem would be appreciated.

I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
That's right. The above model is used for binary classification, not text summarization. If you notice, the output (Dense(1, activation='sigmoid')) only gives you a score between 0-1 while in text summarization we need a model that generates a sequence of tokens.
What should I do?
The dominant idea to tackle this problem is encoder-decoder (also known as seq2seq) models. There is a nice tutorial on Keras repository which used for Machine translation but it is fairly easy to adapt it for text summarization.
The main part of the code is:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
Based on the above implementation, it is necessary to pass encoder_input_data, decoder_input_data and decoder_target_data to model.fit() which respectively are input text and summarize version of the text.
Note that, decoder_input_data and decoder_target_data are the same things except that decoder_target_data is one token ahead of decoder_input_data.
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for
positive/negative sentences rather than summary/non-summary sentences
classification.
The low accuracy performance caused by various reasons including small training size, overfitting, underfitting and etc.

Related

Analysing features of text in tensorflow

I am relatively new to TensorFlow and the idea of auto-encoders, but I was trying create an auto-encoder for text and I didn't want to write this code because it measures token by token.
array = array(shape=(2000,20))
model = keras.sequential()
model.add(Dense(20))
model.add(Dense(5))
model.add(Dense(20))
model.compile(optimizer = 'Adam', loss='mse')
model.fit(array, array, epochs = epochs)
This code isn't ideal because it tries to reconstruct the text piece by piece, which would be fine for images but not for text because it relies on what came before it. Is there a way to have a neural network compare the conceptual similarities of the output and input, similar to PCA with the components being the encoded layer. Or is there a way to store components of a text without an auto-encoder like in this video? https://www.youtube.com/watch?v=_ry6S-Dc2X8 http://codeparade.net/classes/
I want to make the conceptual difference between the output and the input be the loss instead of accuracy with tokens. He mentions something about PCA(principal component analysis) in the video.

I realized that instead of defining the loss based on if it got each token correct, I could have it guess a one hot encoded version of the sample, which shows that it can identify the sample without trying to spell. Like this
array = array(shape=(2000,20))
onehot = array(1-2000, shape = (2000, 2000))
model = keras.sequential()
model.add(Dense(20))
model.add(Dense(5))
model.add(Dense(20))
model.add(Dense(2000))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(array, onehot, epochs = epochs)

Fine tuning of Bert word embeddings

I would like to load a pre-trained Bert model and to fine-tune it and particularly the word embeddings of the model using a custom dataset.
The task is to use the word embeddings of chosen words for further analysis.
It is important to mention that the dataset consists of tweets and there are no labels.
Therefore, I used the BertForMaskedLM model.
Is it OK for this task to use the input ids (the tokenized tweets) as the labels?
I have no labels. There are just tweets in randomized order.
From this point, I present the code I wrote:
First, I cleaned the dataset from emojis, non-ASCII characters, etc as described in the following link (2.3 Section):
https://www.kaggle.com/jaskaransingh/bert-fine-tuning-with-pytorch
Second, the code of the fine tuning process:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.to(device)
model.train()
lr = 1e-2
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
max_len = 82
chunk_size = 20
epochs = 20
for epoch in range(epochs):
epoch_losses = []
for j, batch in enumerate(pd.read_csv(path + file_name, chunksize=chunk_size)):
tweets = batch['content_cleaned'].tolist()
encoded_dict = tokenizer.batch_encode_plus(
tweets, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_len, # Pad & truncate all sentences.
pad_to_max_length = True,
truncation=True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
input_ids = encoded_dict['input_ids'].to(device)
# Is it correct? or should I train it in another way?
loss, _ = model(input_ids, labels=input_ids)
loss_score = loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
optimizer.zero_grad()
model.save_pretrained(path + "Fine_Tuned_BertForMaskedLM")
The loss starts from 50 and reduced until 2.3.

Since the objective of the masked language model is to predict the masked token, the label and the inputs are the same. So, whatever you have written is correct.
However, I would like to add on the concept of comparing word embeddings. Since, BERT is not a word embeddings model, it is contextual, in the sense, that the same word can have different embeddings in different context. Example: the word 'talk' will have a different embeddings in the sentences "I want to talk" and "I will attend a talk". So, there is no single vector of embeddings for each word. (Which makes BERT different from word2vec or fastText). Masked Language Model (MLM) on a pre-trained BERT is usually performed when you have a small new corpus, and want your BERT model to adapt to it. However, I am not sure on the performance gain that you would get by using MLM and then fine-tuning to a specific task than directly fine-tuning the pre-trained model with task specific corpus on a downstream task.

Tensorflow BERT for token-classification - exclude pad-tokens from accuracy while training and testing

I'm doing token-based classification using the pre-trained BERT-model for tensorflow to automatically label cause and effects in sentences.
To access BERT, I'm using the TFBertForTokenClassification-Interface from huggingface: https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification
The sentences I use to train are all converted to tokens (basically a mapping of words to numbers) according to the BERT-tokenizer and then padded to a certain length before training, so when one sentence has only 50 tokens and another one has only 30 the first one is filled up with 50 pad-tokens and the second one with 70 of them to get a universal input sentence-length of 100.
I then train my model to predict on every token which label this token belongs to; whether it is part of the cause, the effect or none of them.
However, during training and evaluation, my model does predictions on the PAD-tokens as well and they are also included in the accuracy of the model. As PAD-tokens are very easy to predict for the model (they always have the same token and they all have the "none" label which means they neither belong to the cause nor the effect of the sentence), they really distort my model's accuracy.
For example, if you have a sentence which has 30 words -> 30 tokens and you pad all sentences to a length of 100, then this sentence would get a score of 70% even if the model predicted none of the "real" tokens correctly.
This way i'm getting training and validation accuracy of 90+% really quick although the model performs poorly on the real pad-tokens.
I thought that attention-mask is there to solve this problem but this doesn't seem to be the case.
The input-datasets are created as follows:
def example_to_features(input_ids,attention_masks,token_type_ids,label_ids):
return {"input_ids": input_ids,
"attention_mask": attention_masks},label_ids
train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(buffer_size=1000).batch(32)
Model creation:
from transformers import TFBertForTokenClassification
num_epochs = 30
model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
model.layers[-1].activation = tf.keras.activations.softmax
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.summary()
And then I train it like this:
history = model.fit(train_ds, epochs=num_epochs, validation_data=validate_ds)
Has anyone encountered this problem so far or does know how to exclude the predictions on pad-tokens from the model's accuracy during training and evaluation?

Yes, this is normal.
The output of BERT [batch_size, max_seq_len = 100, hidden_size] will include values or embeddings for [PAD] tokens as well. However, you also provide attention_masks to the BERT model so that it does not take into consideration these [PAD] tokens.
Similarly, you need to MASK these [PAD] tokens before passing the BERT results to the final fully-connected layer, mask them when you are calculating loss, and also for calculating metrics like precision and recall.

Keras: good result with MLP but bad with Bidirectional LSTM

I trained two neural networks with Keras: a MLP and a Bidirectional LSTM.
My task is to predict the words order in a sentence, so for each word, the neural network has to output a real number. When a sentence with N words is processed, the N reals number in the output are ranked in order to obtain integer numbers representing words position.
I'm using same dataset and same preprocessing on the dataset. The only different thing is that in the LSTM dataset I added padding to get the sequences of the same length.
In the prediction phase, with LSTM, I exclude the predictions created from padding vectors, since I masked them in the training phase.
MLP architecture:
mlp = keras.models.Sequential()
# add input layer
mlp.add(
keras.layers.Dense(
units=training_dataset.shape[1],
input_shape = (training_dataset.shape[1],),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
activation='relu')
)
# add hidden layer
mlp.add(
keras.layers.Dense(
units=training_dataset.shape[1] + 10,
input_shape = (training_dataset.shape[1] + 10,),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
bias_initializer='zeros',
activation='relu')
)
# add output layer
mlp.add(
keras.layers.Dense(
units=1,
input_shape = (1, ),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
bias_initializer='zeros',
activation='linear')
)
Bidirection LSTM architecture:
model = tf.keras.Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(Bidirectional(LSTM(units=20, return_sequences=True), input_shape=(timesteps, features)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))
The task is much better solvable with an LSTM, which should capture dependencies between words well.
However, with the MLP I achieve good results, but with LSTM the results are very bad.
Since I'm a beginner, could someone understand what is wrong with my LSTM architecture? I'm going out of head.
Thanks in advance.

For this problem, I am actually not surprised that MLP performs better.
The architecture of LSTM, bi-directional or not, assumes that location is very important to the structure. Words next to each other are more likely to be related than words farther away.
But for your problem you have removed the locality and are trying to restore it. For that problem, an MLP which has global information can do a better job at the sorting.
That said, I think there is still something to be done to improve the LSTM model.
One thing you can do is ensure that the complexity of each model is similar. You can do this easily with count_params.
mlp.count_params()
model.count_params()
If I had to guess, your LSTM is much smaller. There are only 20 units, which seems small for an NLP problem. I used 512 for a Product Classification problem to process character-level information (vocabulary of size 128, embedding of size 50). Word-level models trained on bigger data sets, like AWD-LSTM, get into the thousands of units.
So you probably want to increase that number. You can get an apples-to-apples comparison between the two models by increasing the number of units in the LSTM until the parameter counts are similar. But you don't have to stop there, you can keep increasing the size until you start to overfit or your training starts taking too long.

keras - seq2seq model predicting same output for all test inputs

I am trying to build a seq2seq model using LSTM in Keras. Currently working on the English to French pairs dataset-10k pairs(orig dataset has 147k pairs). After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.
What I observe is the predicted words are nothing but the most frequent words in the dataset and they are displayed in the decreasing order of their frequency.
eg:
'I know you', 'Can we go ?', 'snap out of it' -- for all these input seq the output is -- 'je suis en train' (same output for all three).
Can anyone help me what could be the reason why model is behaving like this. Am i missing something basic ?
I tried following with batchsize=32, epoch=50,maxinp=8, maxout=8, embeddingsize=100.
encoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='encoder_inputs')
encoder_lstm1 = LSTM(units=HIDDEN_UNITS, return_state=True, name="encoder_lstm1" , stateful=False, dropout=0.2)
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm1(encoder_inputs)
encoder_states = [encoder_state_h, encoder_state_c]
decoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='decoder_inputs')
decoder_lstm = LSTM(units=HIDDEN_UNITS, return_sequences=True, return_state=True, stateful=False,
name='decoder_lstm', dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)
self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(self.model.summary())
self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Xtrain, Xtest, Ytrain, Ytest = train_test_split(input_texts_word2em, self.target_texts, test_size=0.2, random_state=42)
train_gen = generate_batch(Xtrain, Ytrain, self)
test_gen = generate_batch(Xtest, Ytest, self)
train_num_batches = len(Xtrain) // BATCH_SIZE
test_num_batches = len(Xtest) // BATCH_SIZE
self.model.fit_generator(generator=train_gen, steps_per_epoch=train_num_batches,
epochs=NUM_EPOCHS,
verbose=1, validation_data=test_gen, validation_steps=test_num_batches ) #, callbacks=[checkpoint])
self.encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_inputs = [Input(shape=(HIDDEN_UNITS,)), Input(shape=(HIDDEN_UNITS,))]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_state_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
self.decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs] + decoder_states)
Update:: I have run on 147k dataset with 5 epochs and the results vary for every input. Thanks for the help.
However now I am running same model with another dataset with input and output sequences containing average no of words as 170 and 100 respectively after cleaning (removing stopwords and stuff).
This dataset has around 30k records and if i run even with 50 epochs, the results are same again for every test sentence. So what are my next options for me to try. I was expecting atleast different output for different inputs (even if it is wrong) but same output is adding more frustration whether the model is not learning properly.
Any answers ??

An LSTM-based encoder-decoder (Seq2Seq) that is correctly setup may produce the same output for any input when the net has not trained for enough epochs. I can reliably reproduce the "same stupid output no matter what the input" result simply by reducing my number of epochs from 200 to 30.
One thing that may be confusing is that default accuracy measures may not seem to improve much, as you go from 30 to 150 epochs. However, in cases where you are using Seq2Seq for chatbot or translation tasks, something like the BLEU score is more relevant. Even more relevant is your own evaluation of the 'realism' of the responses. These evaluations are done at inference time, not during training - but they can influence your evaluation of whether or not the model has trained sufficiently.
Another thing to consider is that, in my case, the training data was movie dialogue from the Cornell set. Chatbot dialogue is supposed to be helpful - movie dialogue has no such mandate - it's just supposed to be interesting. So we do have a mismatch here between training and use.
In any event, here are examples of the exact same net trained after 30 vs. 150 epochs, responding to the same inputs.
30 epochs
150 epochs
In this case, the default keras.Model.fit() accuracy reported went from .2118 (after 30 epochs) to .2243 (after 150), but clearly the inputs are now getting differentiated. If this were a classifier and we were just looking at the training accuracy (and didn't look at sample inferences) we might reasonably assume that all those additional training epochs were pointless.
But think of it this way - evaluating the ability of a model to, for example, classify a picture of a bird as a bird with labeled data is quite different than evaluating the ability of a model to encapsulate the idea of a sentence or phrase and respond appropriately with a sequence of characters that forms a coherent thought, so the default training metrics aren't as useful.
Another thing we might suspect when we see a oscillating accuracy is that our learning rate is too high or not sufficiently adaptive. I was using rmsprop - maybe Adam would address this problem? But after 30 epochs with Adam, acc was .2085 and starting to oscillate.
At this point, looking at my results, it's clear that training on movie dialogue is just going to produce movie-dialogue-ish text which isn't inherently helpful and not that easy to assess in terms of 'accuracy'. You would expect movie dialogue to have a 'spark of life' - originality, the unexpected, a difference in tone between characters, etc. So at this point, if I want a better chatbot, I need more applicable training data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.