How to pad sequences during training for an encoder decoder model - python

I've got an encoder-decoder model for character level English language spelling correction, it is pretty basic stuff with a two LSTM encoder and another LSTM decoder.
However, up until now, I have been pre-padding the encoder input sequences, like below:
abc -> -abc
defg -> defg
ad -> --ad
And next I have been splitting the data into several groups with the same decoder input length, e.g.
train_data = {'15': [...], '16': [...], ...}
where the key is the length of the decoder input data and I have been training the model once for each length in a loop.
However, there has to be a better way to do this, such as padding after the EOS or before SOS characters etc. But if this is the case, how would I change the loss function so that this padding isn't counted into the loss?

The standard way of doing padding is putting it after the end-of-sequence token, but it should really matter where the padding is.
Trick how to not include the padded positions into the loss is masking them out before reducing the loss. Assuming the PAD_ID variable contains the index of the symbol that you use for padding:
def custom_loss(y_true, y_pred):
mask = 1 - K.cast(K.equal(y_true, PAD_ID), K.floatx())
loss = K.categorical_crossentropy(y_true, y_pred) * mask
return K.sum(loss) / K.sum(mask)

Related

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

Understanding CTC loss for speech recognition in Keras

I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.
What i think i understood (please correct me if i'm wrong!)
Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).
Let's say we're feeding utterances of some sentences as MFCCs.
The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:
alphabet (a, b, ..., z)
a blank token (-)
a space (_) and an end-character (>)
Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).
To implement it, i found that i can do something like this:
# CTC implementation from Keras example found at https://github.com/keras-
# team/keras/blob/master/examples/image_ocr.py
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
# print "y_pred_shape: ", y_pred.shape
y_pred = y_pred[:, 2:, :]
# print "y_pred_shape: ", y_pred.shape
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)
x = Bidirectional(lstm(...,return_sequences=True))(input_data)
x = Bidirectional(lstm(...,return_sequences=True))(x)
y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)
loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
[y_pred, y_true, input_length, label_length])
model = Model(inputs=[input_data, y_true, input_length,label_length],
outputs=loss_out)
With ALPHABET_LENGTH = 29 (alphabet length + special characters)
And:
y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.
(source)
Now, i'm facing some problems:
What i don't understand
Is this implantation the right way to code and use CTC loss?
I do not understand what are concretely y_true, input_length and
label_length. Any examples?
In what form should I give the labels to the network? Again, Any examples?
What are these?
y_true your ground truth data. The data you are going to compare with the model's outputs in training. (On the other hand, y_pred is the model's calculated output)
input_length, the length (in steps, or chars this case) of each sample (sentence) in the y_pred tensor (as said here)
label_length, the length (in steps, or chars this case) of each sample (sentence) in the y_true (or labels) tensor.
It seems this loss expects that your model's outputs (y_pred) have different lengths, as well as your ground truth data (y_true). This is probably to avoid calculating the loss for garbage characters after the end of the sentences (since you will need a fixed size tensor for working with lots of sentences at once)
Form of the labels:
Since the function's documentation is asking for shape (samples, length), the format is that... the char index for each char in each sentence.
How to use this?
There are some possibilities.
1- If you don't care about lengths:
If all lengths are the same, you can easily use it as a regular loss:
def ctc_loss(y_true, y_pred):
return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
#where input_length and label_length are constants you created previously
#the easiest way here is to have a fixed batch size in training
#the lengths should have the same batch size (see shapes in the link for ctc_cost)
model.compile(loss=ctc_loss, ...)
#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)
2 - If you care about the lengths.
This is a little more complicated, you need that your model somehow tells you the length of each output sentence.
There are again several creative forms of doing this:
Have an "end_of_sentence" char and detect where in the sentence it is.
Have a branch of your model to calculate this number and round it to integer.
(Hardcore) If you are using stateful manual training loop, get the index of the iteration you decided to finish a sentence
I like the first idea, and will exemplify it here.
def ctc_find_eos(y_true, y_pred):
#convert y_pred from one-hot to label indices
y_pred_ind = K.argmax(y_pred, axis=-1)
#to make sure y_pred has one end_of_sentence (to avoid errors)
y_pred_end = K.concatenate([
y_pred_ind[:,:-1],
eos_index * K.ones_like(y_pred_ind[:,-1:])
], axis = 1)
#to make sure the first occurrence of the char is more important than subsequent ones
occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())
#is eos?
is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))
#lengths
true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)
#reshape
true_lengths = K.reshape(true_lengths, (-1,1))
pred_lengths = K.reshape(pred_lengths, (-1,1))
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)
model.compile(loss=ctc_find_eos, ....)
If you use the other option, use a model branch to calculate the lengths, concatenate these length to the first or last step of the output, and make sure you do the same with the true lengths in your ground truth data. Then, in the loss function, just take the section for lengths:
def ctc_concatenated_length(y_true, y_pred):
#assuming you concatenated the length in the first step
true_lengths = y_true[:,:1] #may need to cast to int
y_true = y_true[:, 1:]
#since y_pred uses one-hot, you will need to concatenate to full size of the last axis,
#thus the 0 here
pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
y_pred = y_pred[:, 1:]
return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

Tensorflow seq2seq chatbot always give the same outputs

I'm trying to make a seq2seq chatbot with Tensorflow, but it seems to converge to the same outputs despite different inputs. The model gives different outputs when first initialized, but quickly converges to the same outputs after a few epochs. This is still an issue even after a lot of epochs and low costs. However, the models seems to do fine when trained with smaller datasets (say 20) but it fails with larger ones.
I'm training on the Cornell Movie Dialogs Corpus with a 100-dimensional and 50000-vocab glove pretrained embedding.
The encoder seems to have very close final states (in the ranges of around 0.01) when given totally different inputs. I've tried using a simple LSTM/GRU, bidirectional LSTM/GRU, multi-layer/stacked LSTM/GRU, and multi-layer bidirection LSTM/GRU. The rnn nodes have been tested with from 16 to 2048 hidden units. The only difference is that the model tends to output only the start and end tokens (GO and EOS) when having lesser hidden units.
For multi-layer GRU, here's my code:
cell_encode_0 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_1 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_2 = tf.contrib.rnn.GRUCell(self.n_hidden)
self.cell_encode = tf.contrib.rnn.MultiRNNCell([cell_encode_0, cell_encode_1, cell_encode_2])
# identical decoder
...
embedded_x = tf.nn.embedding_lookup(self.word_embedding, self.x)
embedded_y = tf.nn.embedding_lookup(self.word_embedding, self.y)
_, self.encoder_state = tf.nn.dynamic_rnn(
self.cell_encode,
inputs=embedded_x,
dtype=tf.float32,
sequence_length=self.x_length
)
# decoder for training
helper = tf.contrib.seq2seq.TrainingHelper(
inputs=embedded_y,
sequence_length=self.y_length
)
decoder = tf.contrib.seq2seq.BasicDecoder(
self.cell_decode,
helper,
self.encoder_state,
output_layer=self.projection_layer
)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=self.max_sequence, swap_memory=True)
return outputs.rnn_output
...
# Optimization
dynamic_max_sequence = tf.reduce_max(self.y_length)
mask = tf.sequence_mask(self.y_length, maxlen=dynamic_max_sequence, dtype=tf.float32)
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=self.y[:, :dynamic_max_sequence], logits=self.network())
self.cost = (tf.reduce_sum(crossent * mask) / batch_size)
self.train_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cost)
For the full code, please see github. (In case if you want to test it out, run train.py)
As for hyper-parameters, I've tried learning rates from 0.1 all the way to 0.0001 and batch sizes from 1 to 32. Other than the regular and expected effects, they do not help with the problem.
After digging around for months, I've finally found the issue. It seems that the RNN requires GO tokens in decoder inputs but not outputs (the one you use for cost). Basically, the RNN expects its data as the following:
Encoder input: GO foo foo foo EOS
Decoder input/ground truth: GO bar bar bar EOS
Decoder output: bar bar bar EOS EOS/PAD
In my code, I included the GO token in both decoder input and output, causing the RNN to repeat the same tokens (GO -> GO, bar -> bar). This can be easily fixed by creating an additional variable that does not have the first column (GO tokens) of the ground truth. In numpy, this looks something like
# y is ground truth with shape[0] = batch and shape[1] = token index
np.concatenate([y[:, 1:], np.full([y.shape[0], 1], EOS)], axis=1)

Understanding tensorflow RNN encoder input for translation task

I'm following this tutorial (https://theneuralperspective.com/2016/11/20/recurrent-neural-networks-rnn-part-3-encoder-decoder/) to implement a RNN using tensorflow.
I have prepared the input data for both encoder and decoder as described but I'm having trouble understanding how the dynamic_rnn function on tensorflow is expecting the input for this particular task so I'm going to describe the data I have for now in hopes someone has an idea.
I have 20 sentences in english and in french. As is described on the above link, each sentence is padded according to the longest sentence in each language and tokens for GO and EOS are added where expected.
Since not all the code is available on the link I tried doing some of the missing code for myself. This is what my encoder code looks like
with tf.variable_scope('encoder') as scope:
# Encoder RNN cell
self.encoder_lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_units, forget_bias=0.0, state_is_tuple=True)
self.encoder_cell = tf.nn.rnn_cell.MultiRNNCell([self.encoder_lstm_cell] * num_layers, state_is_tuple=True)
# Embed encoder RNN inputs
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [self.vocab_size_en, hidden_units], dtype=data_type)
self.embedded_encoder_inputs = tf.nn.embedding_lookup(embedding, self.encoder_inputs)
# Outputs from encoder RNN
self.encoder_outputs, self.encoder_state = tf.nn.dynamic_rnn(
cell=self.encoder_cell,
inputs=self.embedded_encoder_inputs,
sequence_length=self.seq_lens_en, time_major=False, dtype=tf.float32)
My embeddings are of shape [20, 60, 60] but I think this is incorrect. If I look into the documentation for dynamic_rnn (https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) then the inputs are expected to be of size [batch, max_time, ...]. If I look into this example here (https://medium.com/#erikhallstrm/using-the-dynamicrnn-api-in-tensorflow-7237aba7f7ea#.4hrf7z1pd) then the inputs are supposed to be [batch, truncated_backprop_length, state_size].
The max length of a sentence for the encoder is 60. This is where I am confused about what exactly should the 2 last numbers on the inputs should be. Should max_time be also the max length for my sentence sequence or should that be another quantity. Also should the last number be also the max length of the sentences' sequences or the length of the current input without the padding?

Understanding `tf.nn.nce_loss()` in tensorflow

I am trying to understand the NCE loss function in Tensorflow. NCE loss is employed for a word2vec task, for instance:
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
more details, please reference Tensorflow word2vec_basic.py
What are the input and output matrices in the NCE function?
In a word2vec model, we are interested in building representations for words. In the training process, given a slid window, every word will have two embeddings: 1) when the word is a centre word; 2) when the word is a context word. These two embeddings are called input and output vectors, respectively. (more explanations of input and output matrices)
In my opinion, the input matrix is embeddings and the output matrix is nce_weights. Is it right?
What is the final embedding?
According to a post by s0urcer also relating to nce, it says the final embedding matrix is just the input matrix. While, some others saying, the final_embedding=input_matrix+output_matrix. Which is right/more common?
Let's look at the relative code in word2vec example (examples/tutorials/word2vec).
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
These two lines create embedding representations. embeddings is a matrix where each row represents a word vector. embedding_lookup is a quick way to get vectors corresponding to train_inputs. In word2vec example, train_inputs consists of some int32 number, representing the id of target words. Basically, it can be placed by hidden layer feature.
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
These two lines create parameters. They will be updated by optimizer during training. We can use tf.matmul(embed, tf.transpose(nce_weights)) + nce_biases to get final output score. In other words, last inner-product layer in classification can be replaced by it.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights, # [vocab_size, embed_size]
biases=nce_biases, # [vocab_size]
labels=train_labels, # [bs, 1]
inputs=embed, # [bs, embed_size]
num_sampled=num_sampled,
num_classes=vocabulary_size))
These lines create nce loss, #garej has given a very good explanation. num_sampled refers to the number of negative sampling in nce algorithm.
To illustrate the usage of nce, we can apply it in mnist example (examples/tutorials/mnist/mnist_deep.py) with following 2 steps:
1. Replace embed with hidden layer output. The dimension of hidden layer is 1024 and num_output is 10. Minimum value of num_sampled is 1. Remember to remove the last inner-product layer in deepnn().
y_conv, keep_prob = deepnn(x)
num_sampled = 1
vocabulary_size = 10
embedding_size = 1024
with tf.device('/cpu:0'):
embed = y_conv
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
2. Create loss and compute output. After computing the output, we can use it to calculate accuracy. Note that the label here is not one-hot vector as used in softmax. Labels are the original label of training samples.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=y_idx,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
output = tf.matmul(y_conv, tf.transpose(nce_weights)) + nce_biases
correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y_, 1))
When we set num_sampled=1, the val accuracy will end at around 98.8%. And if we set num_sampled=9, we can get almost the same val accuracy as trained by softmax. But note that nce is different from softmax.
Full code of training mnist by nce can be found here. Hope it is helpful.
The embeddings Tensor is your final output matrix. It maps words to vectors. Use this in your word prediction graph.
The input matrix is a batch of centre-word : context-word pairs (train_input and train_label respectively) generated from the training text.
While the exact workings of the nce_loss op are not yet know to me, the basic idea is that it uses a single layer network (parameters nce_weights and nce_biases) to map an input vector (selected from embeddings using the embed op) to an output word, and then compares the output to the training label (an adjacent word in the training text) and also to a random sub-sample (num_sampled) of all other words in the vocab, and then modifies the input vector (stored in embeddings) and the network parameters to minimise the error.
What are the input and output matrices in the NCE function?
Take for example the skip gram model, for this sentence:
the quick brown fox jumped over the lazy dog
the input and output pairs are:
(quick, the), (quick, brown), (brown, quick), (brown, fox), ...
For more information please refer to the tutorial.
What is the final embedding?
The final embedding you should extract is usually the {w} between the input and hidden layer.
To illustrate more intuitively take a look at the following picture:
The one hot vector [0, 0, 0, 1, 0] is the input layer in the above graph, the output is the word embedding [10, 12, 19], and W(in the graph above) is the matrix in between.
For detailed explanation please read this tutorial.
1) In short, it is right in general, but just partly right for the function in question. See tutorial:
The noise-contrastive estimation loss is defined in terms of a
logistic regression model. For this, we need to define the weights and
biases for each word in the vocabulary (also called the output weights
as opposed to the input embeddings).
So inputs to the function nce_loss are output weights and a small part of input embeddings, among the other stuff.
2) 'Final' embedding (aka Word vectors, aka Vector representations of Words) is what you call input matrix. Embeddings are strings (vectors) of that matrix, corresponding to each word.
Warn In fact, this terminology is confusing because of input and output concepts usage in NN environment. Embeddings matrix is not an input to NN, as input to NN is technically an input layer. You obtain the final state of this matrix during training process. Nonetheless, this matrix should be initialised in programming, because an algorithm has to start from some random state of this matrix to gradually update it during training.
The same is true for weights - they are to be initialised also. It happens in the following line:
nce_weights = tf.Variable(
tf.truncated_normal([50000, 128], stddev=1.0 / math.sqrt(128)))
Each embedding vector can be multiplied by a vector from weights matrix (in a string to column manner). So we will get the scalar in the NN output layer. The norm of this scalar is interpreted as probability that the target word (from input layer) will be accompanied by label [or context] word corresponding to the place of scalar in output layer.
So, if we are saying about inputs (arguments) to the function, then both matrixes are such: weights and a batch sized extraction from embeddings:
tf.nn.nce_loss(weights=nce_weights, # Tensor of shape(50000, 128)
biases=nce_biases, # vector of zeros; len(128)
labels=train_labels, # labels == context words enums
inputs=embed, # Tensor of shape(128, 128)
num_sampled=num_sampled, # 64: randomly chosen negative (rare) words
num_classes=vocabulary_size)) # 50000: by construction
This nce_loss function outputs a vector of batch_size - in the TensorFlow example a shape(128,) tensor.
Then reduce_mean() reduces this result to a scalar taken the mean of those 128 values, which is in fact an objective for further minimization.
Hope this helps.
From the paper Learning word embeddings efficiently with noise-contrastive estimation:
NCE is based on the reduction of density estimationto probabilistic binary classification. The basic idea is to train a logistic regression classifier to
discriminate between samples from the data distribution and samples from some “noise” distribution
We could find out that in word embedding, the NCE is actually negative sampling. (For the difference between this two, see paper: Notes on Noise Contrastive Estimation and Negative Sampling)
Therefore, you do not need to input the noise distribution. And also from the quote, you will find out that it is actually a logistic regression: weight and bias would be the one need for logistic regression. If you are familiar with word2vec, it is just adding a bias.

Categories

Resources