How to use Keras predict function for NLP models?

How to use Keras predict function for NLP models? - python

I have created an NLP classification model with keras with no problems with my model showing 83.5% accuracy upon evaluation. However, when I want to use my model to predict a new set of tokenized words, my model returns x number of arrays where x is the number of tokens in a tokenized sentence I have given to my model to predict
`
here is the code example
toPredict = np.array([1,2])
prediction = self.model.predict(toPredict)
print(prediction)
`
The values 1 and 2 are obviously just token values, but this will return an output of
'
[[0.24091144 0.20921658 0.3415633 0.20830865]
[0.20159791 0.46421158 0.19968869 0.13450184]]
'
I may be missing something, but i thought the output would be only 1 array to classify the whole tokenized sentence, not each word individually. Am I feeding in the model a badly formatted input? Please help!

to predict you should feed model in the same shape that training data fed into model; so the sequence must have been in 2-dim shape and even the same length as you set before when padded sequences. you could tf.expand_dims(toPredict, 0) and then feed it into model.
for instance here i will define a function for prediction;
#def prediction
def predict_text(#define input text and model
input_text, tokenizer, model,
#define tokenizer maximum length of sequence
maxlen_seq, padding = 'post', truncating = 'post'
):
#prediction
text = str(input_text)
sequence = tokenizer.texts_to_sequences([text])
sequence = keras.preprocessing.sequence.pad_sequences(sequence, maxlen = maxlen_seq,
padding = padding, truncating = truncating)
predict = model.predict(sequence)
return predict

Related

Fine tuning of Bert word embeddings

I would like to load a pre-trained Bert model and to fine-tune it and particularly the word embeddings of the model using a custom dataset.
The task is to use the word embeddings of chosen words for further analysis.
It is important to mention that the dataset consists of tweets and there are no labels.
Therefore, I used the BertForMaskedLM model.
Is it OK for this task to use the input ids (the tokenized tweets) as the labels?
I have no labels. There are just tweets in randomized order.
From this point, I present the code I wrote:
First, I cleaned the dataset from emojis, non-ASCII characters, etc as described in the following link (2.3 Section):
https://www.kaggle.com/jaskaransingh/bert-fine-tuning-with-pytorch
Second, the code of the fine tuning process:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.to(device)
model.train()
lr = 1e-2
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
max_len = 82
chunk_size = 20
epochs = 20
for epoch in range(epochs):
epoch_losses = []
for j, batch in enumerate(pd.read_csv(path + file_name, chunksize=chunk_size)):
tweets = batch['content_cleaned'].tolist()
encoded_dict = tokenizer.batch_encode_plus(
tweets, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_len, # Pad & truncate all sentences.
pad_to_max_length = True,
truncation=True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
input_ids = encoded_dict['input_ids'].to(device)
# Is it correct? or should I train it in another way?
loss, _ = model(input_ids, labels=input_ids)
loss_score = loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
optimizer.zero_grad()
model.save_pretrained(path + "Fine_Tuned_BertForMaskedLM")
The loss starts from 50 and reduced until 2.3.

Since the objective of the masked language model is to predict the masked token, the label and the inputs are the same. So, whatever you have written is correct.
However, I would like to add on the concept of comparing word embeddings. Since, BERT is not a word embeddings model, it is contextual, in the sense, that the same word can have different embeddings in different context. Example: the word 'talk' will have a different embeddings in the sentences "I want to talk" and "I will attend a talk". So, there is no single vector of embeddings for each word. (Which makes BERT different from word2vec or fastText). Masked Language Model (MLM) on a pre-trained BERT is usually performed when you have a small new corpus, and want your BERT model to adapt to it. However, I am not sure on the performance gain that you would get by using MLM and then fine-tuning to a specific task than directly fine-tuning the pre-trained model with task specific corpus on a downstream task.

How can I show multiple predictions of the next word in a sentence?

I am using the GPT-2 pre trained model. the code I am working on will get a sentence and generate the next word for that sentence. I want to print multiple predictions, like the three first predictions with best probabilities!
for example if I put in the sentence "I's an interesting ...."
predictions: "Books" "story" "news"
is there a way I can modify this code to show these predictions instead of one?!
also there are two parts in the code, I do not understand, what is the meaning of the numbers in (predictions[0, -1, :])? and why do we put [0] after predictions = output[0]?
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Encode a text inputs
text = "The fastest car in the "
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()
# If you have a GPU, put everything on cuda
#tokens_tensor = tokens_tensor.to('cuda')
#model.to('cuda')
# Predict all tokens
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
#print(predictions)
# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
# Print the predicted word
#print(predicted_index)
print(predicted_text)
The result for the above code will be :
The fastest car in the world.

You can use torch.topk as follows:
predicted_indices = [x.item() for x in torch.topk(predictions[0, -1, :],k=3)]

Correctly structuring text data for text generation with Tensorflow model

I am trying to train my model to generate sentences no longer that 210 characters. From what I have read I have only seen training on 'continuous' text. Like a book. However I am trying to train my model on single sentences.
I'm pretty new to tensorflow and ML so right now I am able to train my model but it generates garbage, seemingly random text. I have 10,000 sentences so I think I have sufficient data.
Overview of my data
Structure [['SENTENCE'], ['SENTENCE2']...]
Data Prep
tokenizer = keras.preprocessing.text.Tokenizer(num_words=209, lower=False, char_level=True, filters='#$%&()*+-<=>#[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(df['title'].values)
df['encoded_with_keras'] = tokenizer.texts_to_sequences(df['title'].values)
dataset = df['encoded_with_keras'].values
dataset = tf.keras.preprocessing.sequence.pad_sequences(dataset, padding='post')
dataset = dataset.flatten()
dataset = tf.data.Dataset.from_tensor_slices(dataset)
sequences = dataset.batch(seq_len+1, drop_remainder=True)
def create_seq_targets(seq):
input_txt = seq[:-1]
target_txt = seq[1:]
return input_txt, target_txt
dataset = sequences.map(create_seq_targets)
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
Model
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None],input_length=209, mask_zero=True))
model.add(LSTM(rnn_neurons, return_sequences=True, stateful=True,))
model.add(Dropout(0.2))
model.add(Dense(258, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss="sparse_categorical_crossentropy")
return model
When I give the model a sequence to start from I get back absolute nonsense and eventually the model predicts a 0 which is not in the char_index mapping.
Edit
Text Generation
epochs = 2
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
model = create_model(vocab_size = vocab_size,
embed_dim=embed_dim,
rnn_neurons=rnn_neurons,
batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string):
num_generate = 200
input_eval = [char_2_index[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
temperature = 1
# model.reset_states()
for i in range(num_generate):
print(text_generated)
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0)
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
print(predicted_id)
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(index_2_char[predicted_id])
return (start_string + ''.join(text_generated))

There are a few things that must be changed on the first sight.
Tokenizer must have num_words = vocab_size
At first (didn't analyse it deeply), I can't imagine why you're flattening your dataset and getting slices if it's probably correctly structured
You cannot use stateful=True if you don't want that "batch 2 is a sequel of batch 1", you have individual sentences, so stateful=False. (Unless you are training correctly with manual training loops and resetting states for each batch, which is unnecessary trouble in the training phase)
What you need to check visually:
Input data must have format like:
[
[1,2,3,6,10,4,10, ...up to sentence length - 1...],
[5,6,3,6,7,3,11,... up to sentence length - 1...],
.... up to number of sentences ...
]
Output data must then be:
[
[2,3,6,10,4,10,15 ...], #equal to input data, shifted by 1
[6,3,6,7,3,11,13, ...],
...
]
Print a few rows of them to check if they're correctly preprocessed as intended.
Training will then be easy:
model.fit(input_data, output_data, epochs=....)
Yes, your model will predict zeros, as you have zeros in your data, that's not weird: you did a pad_sequences.
You can interpret a zero as a "sentence end" in this case, since you did a 'post' pading. When your model gives you a zero, it decided that the sentence it's generating should end at that point - if it was well trained, it will probably continue outputting zeros for that sentence from this point on.
Generating new senteces
This part is more complex and you need to rewrite the model, now being stative=True, and transfer the weights from the trained model to this new model.
Before anything, call model.reset_states().
You will need to manually feed a batch with shape (number_of_sentences=batch_size, 1). This will be the "first character" of each of the sentences it will generate. The output will be the "second character" of each sentence.
Get this output and feed the model with it. It will generate the "third character" of each sentence. And so on.
When all outputs are zero, all sentences are fully generated and you can stop the loop.
Call model.reset_states() again before trying to generate a new batch of sentences.
You can find examples of this kind of predicting here: https://stackoverflow.com/a/50235563/2097240

Test data giving prediction error in Keras in the model with Embedding layer

I have trained a Bi-LSTM model to find NER on a set of sentences. For this I took the different words present and I did a mapping between a word and a number and then created the Bi-LSTM model using those numbers. I then create and pickle that model object.
Now I get a set of new sentences containing certain words that the training model has not seen. Thus these words do not have a numeric value till now. Thus when I test it on my previously existing model, it would give an error. It is not able to find the words or features as the numeric values for those do not exist.
To circumvent this error I gave a new integer value to all the new words that I see.
However, when I load the model and test it, it gives the error that:
InvalidArgumentError: indices[0,24] = 5444 is not in [0, 5442) [[Node: embedding_14_16/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_14_16/embeddings/read, embedding_14_16/Cast)]]
The training data contains 5445 words including the padding word. Thus = [0, 5444]
5444 is the index value I have given to the paddings in the test sentences. Not clear why it is assuming the index values to range between [0, 5442).
I have used the base code available on the following link: https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm
The code:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50
, input_length=max_len)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
#number of epochs - Also for output file naming
epoch_num=20
domain="../data/Laptop_Prediction_Corrected"
output_file_name=domain+"_E"+str(epoch_num)+".xlsx"
model_name="../models/Laptop_Prediction_Corrected"
output_model_filename=model_name+"_E"+str(epoch_num)+".sav"
history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=epoch_num, validation_split=0.1, verbose=1)
max_len is the total number of words in a sentence and n_words is the vocab size. In the model the padding has been done using the following code where n_words=5441:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words)
The padding in the new dataset:
max_len = 50
# this is to pad sentences to the maximum length possible
#-> so all records of X will be of the same length
#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=res_new_word2idx["pad_blank"])
#X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=5441)
Not sure which of these paddings is correct?
However, the vocab only includes the words in the training data. When I say:
p = loaded_model.predict(X)
How to use predict for text sentences which contain words that are not present in the initial vocab?

You can use Keras Tokenizer class and its methods to easily tokenize and preprocess the input data. Specify the vocab size when instantiating it and then use its fit_on_texts() method on the training data to construct a vocabulary based on the given texts. After that you can use its text_to_sequences() method to convert each text string to a list of word indices. The good thing is that only the words in the vocabulary is considered and all the other words are ignored (you can set those words to one by passing oov_token=1 to Tokenizer class):
from keras.preprocessing.text import Tokenizer
# set num_words to limit the vocabulary to the most frequent words
tok = Tokenizer(num_words=n_words)
# you can also pass an arbitrary token as `oov_token` argument
# which will represent out-of-vocabulary words and its index would be 1
# tok = Tokenizer(num_words=n_words, oov_token='[unk]')
tok.fit_on_texts(X_train)
X_train = tok.text_to_sequences(X_train)
X_test = tok.text_to_sequences(X_test) # use the same vocab to convert test data to sequences
You can optionally use pad_sequences function to pad them with zeros or truncate them to make them all have the same length:
from keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)
Now, the vocab size would be equal to n_words+1 if you have not used oov token or n_words+2 if you have used it. And then you can pass the correct number to embedding layer as its input_dim argument (first positional argument):
Embedding(correct_num_words, embd_size, ...)

Initializing and using a tf.placeholder(tf.string), feeding a value to it and converting it back to a string when necessary

I am having trouble implementing a model in tensorflow.
I want to program a model that predicts the polarity of a sentiment. To do this I first have to train the model. This is where an error occurs.
I use three variables: sentence (or sentiment), target (the word in the sentence on which the polarity applies) and the polarity itself. The first two variables are strings and the polarity is a vector ([1,0,0] when positive, [0,0,1] when negative).
To make these variables I use tf.placeholder(tf.string). When I run the model I get the following error: "You must feed a value for placeholder tensor 'Placeholder_1' with dtype string"
I also think I have an issue when I want to do operations on the sentence variable: I need to split the sentence in its words, but since sentence is a placeholder, I first have to convert it back to a string.
The input of the model are three vectors: sVec (a vector of all sentences as strings), tVec (a vector of all targets as strings) and pVec (a vector of all polarities as vectors)
I have been dealing with this error for quite some time now, so any help is appreciated. Below you find the code. Thanks in advance.
polarity = tf.placeholder(tf.float32, shape=[1, c_cardi])#label
sentence = tf.placeholder(tf.string) #inputdata
target = tf.placeholder(tf.string)
def multilayer_perceptron(mi, target, weights, biases):
# This method works correctly
def modelA(sentence, target):
# The error might refer to this:
sess2 = tf.Session()
sentence = sess2.run(sentence) # I need to get the value of the tensor here
sess2.close()
words = sentence.split()
# Code goes on, no error here
return prediction
def trainModelA(sVec, tVec, pVec):
prediction = modelA(sentence, target)
cost_function = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=polarity))
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost_function)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
# The error might refer to this part
sess.run(optimizer, feed_dict={sentence:sVec, target: tVec, polarity:pVec})
# Code goes on
trainModelA(sVec, tVec, pVec)
EDIT:
It seems to me that the error occurs at the line: sentence = sess2.run(sentence). Because I run modelA(sentence, target), I have sentence as an input but this has not yet been filled with a value. However, this seemed the way to go so I still don't not what is going on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.