How does batching work in a seq2seq model in pytorch?

How does batching work in a seq2seq model in pytorch? - python

I am trying to implement a seq2seq model in Pytorch and I am having some problem with the batching.
For example I have a batch of data whose dimensions are
[batch_size, sequence_lengths, encoding_dimension]
where the sequence lengths are different for each example in the batch.
Now, I managed to do the encoding part by padding each element in the batch to the length of the longest sequence.
This way if I give as input to my net a batch with the same shape as said, I get the following outputs:
output, of shape [batch_size, sequence_lengths, hidden_layer_dimension]
hidden state, of shape [batch_size, hidden_layer_dimension]
cell state, of shape [batch_size, hidden_layer_dimension]
Now, from the output, I take for each sequence the last relevant element, that is the element along the sequence_lengths dimension corresponding to the last non padded element of the sequence. Thus the final output I get is of shape [batch_size, hidden_layer_dimension].
But now I have the problem of decoding it from this vector. How do I handle a decoding of sequences of different lengths in the same batch? I tried to google it and found this, but they don't seem to address the problem. I thought of doing element by element for the whole batch, but then I have the problem to pass the initial hidden states, given that the ones from the encoder will be of shape [batch_size, hidden_layer_dimension], while the ones from the decoder will be of shape [1, hidden_layer_dimension].
Am I missing something? Thanks for the help!

You are not missing anything. I can help you since I have worked on several sequence-to-sequence application using PyTorch. I am giving you a simple example below.
class Seq2Seq(nn.Module):
"""A Seq2seq network trained on predicting the next query."""
def __init__(self, dictionary, embedding_index, args):
super(Seq2Seq, self).__init__()
self.config = args
self.num_directions = 2 if self.config.bidirection else 1
self.embedding = EmbeddingLayer(len(dictionary), self.config)
self.embedding.init_embedding_weights(dictionary, embedding_index, self.config.emsize)
self.encoder = Encoder(self.config.emsize, self.config.nhid_enc, self.config.bidirection, self.config)
self.decoder = Decoder(self.config.emsize, self.config.nhid_enc * self.num_directions, len(dictionary),
self.config)
#staticmethod
def compute_decoding_loss(logits, target, seq_idx, length):
losses = -torch.gather(logits, dim=1, index=target.unsqueeze(1)).squeeze()
mask = helper.mask(length, seq_idx) # mask: batch x 1
losses = losses * mask.float()
num_non_zero_elem = torch.nonzero(mask.data).size()
if not num_non_zero_elem:
return losses.sum(), 0 if not num_non_zero_elem else losses.sum(), num_non_zero_elem[0]
def forward(self, q1_var, q1_len, q2_var, q2_len):
# encode the query
embedded_q1 = self.embedding(q1_var)
encoded_q1, hidden = self.encoder(embedded_q1, q1_len)
if self.config.bidirection:
if self.config.model == 'LSTM':
h_t, c_t = hidden[0][-2:], hidden[1][-2:]
decoder_hidden = torch.cat((h_t[0].unsqueeze(0), h_t[1].unsqueeze(0)), 2), torch.cat(
(c_t[0].unsqueeze(0), c_t[1].unsqueeze(0)), 2)
else:
h_t = hidden[0][-2:]
decoder_hidden = torch.cat((h_t[0].unsqueeze(0), h_t[1].unsqueeze(0)), 2)
else:
if self.config.model == 'LSTM':
decoder_hidden = hidden[0][-1], hidden[1][-1]
else:
decoder_hidden = hidden[-1]
decoding_loss, total_local_decoding_loss_element = 0, 0
for idx in range(q2_var.size(1) - 1):
input_variable = q2_var[:, idx]
embedded_decoder_input = self.embedding(input_variable).unsqueeze(1)
decoder_output, decoder_hidden = self.decoder(embedded_decoder_input, decoder_hidden)
local_loss, num_local_loss = self.compute_decoding_loss(decoder_output, q2_var[:, idx + 1], idx, q2_len)
decoding_loss += local_loss
total_local_decoding_loss_element += num_local_loss
if total_local_decoding_loss_element > 0:
decoding_loss = decoding_loss / total_local_decoding_loss_element
return decoding_loss
You can see the complete source code here. This application is about predicting users' next web-search query given the current web-search query.
The answerer to your question:
How do I handle a decoding of sequences of different lengths in the same batch?
You have padded sequences, so you can consider as all the sequences are of the same length. But when you are computing loss, you need to ignore loss for those padded terms using masking.
I have used a masking technique to achieve the same in the above example.
Also, you are absolutely correct on: you need to decode element by element for the mini-batches. The initial decoder state [batch_size, hidden_layer_dimension] is also fine. You just need to unsqueeze it at dimension 0, to make it [1, batch_size, hidden_layer_dimension].
Please note, you do not need to loop over each example in the batch, you can execute the whole batch at a time, but you need to loop over the elements of the sequences.

Related

Trying to overfit in GRU characters RNN

I have a GRU network, which is manually built (i.e. no nn.GRU) with 2 vertical layers, 128-dim hidden layer, and sequence length of 64 chars.
I'm trying to overfit a small corpus taken out of Shakespeare:
I ran training for 500+ epochs, and after every 25 epochs I generate a sample with (very low temperature) by giving only the first letter "B". At first it's gibberish, but at the end it does get close to the actual text. The first version was:
This was without passing the hidden state between batches. I thought maybe this was my problem, so I passed the hidden state, detached, between batches. But I still don't get perfect overfit, and it actually seems to worsen it:
Here's how I implemented the GRU:
for i in range(self.n_layers):
layer_input = layer_middle
layer_middle = torch.zeros((batch_size, seq_len, self.h_dim))
params = self.layer_params[i]
for j in range(seq_len):
x = layer_input[:, j, :].to(torch.device('cuda'))
z = F.sigmoid(params['W1'](x) + params['W2'](layer_states[i]))
r = F.sigmoid(params['W3'](x) + params['W4'](layer_states[i]))
g = F.tanh(params['W5'](x) + params['W6'](r*layer_states[i]))
layer_states[i] = z*layer_states[i] + (1-z)*g
layer_middle[:, j, :] = layer_states[i]
layer_middle = nn.Dropout(self.dropout)(layer_middle)
layer_output = torch.zeros((batch_size, seq_len, self.out_dim))
for j in range(seq_len):
x = layer_middle[:, j, :].to(torch.device('cuda'))
layer_output[:, j, :] = self.layer_params[-1]['W7'](x)
The params are nn.Linear with the correct shapes.
Any idea what might prevent me from overfitting to this tiny corpus??

Variable batch_size in call function

I am trying to implement an attention network with TensorFlow 2. Thus, for every image, I want to take only some glimpses, i.e. a small part from the image. For this I have implemented a subclass from tensorflow.keras.models.Model, here is a snippet out of it.
class RecurrentAttentionModel(models.Model):
# ...
def call(self, inputs):
l = tf.random.uniform((40,2,), minval=0, maxval=1)
for _ in range(0, self.glimpses):
glimpse = tf.image.extract_glimpse(inputs, size=(self.retina_size, self.retina_size), offsets=l, centered=False, normalized=True)
# some other code...
# update l to take a glimpse somewhere else
return result
Now, the code above works and trains perfectly, but my issue is, that I have the hardcoded 40 in it, the batch_size which I have defined in my dataset. I am not able to read/get the batch_size in the call method since the variable "inputs" is of the form Tensor("input_1_77:0", shape=(None, 250, 500, 1), dtype=float32) where the None for the batch_size seems to be expected behavior.
When I just initialize l with the following code (without the batch_size)
l = tf.random.uniform((2,), minval=0, maxval=1)
it throws this error
ValueError: Shape must be rank 2 but is rank 1 for 'recurrent_attention_model_86/ExtractGlimpse' (op: 'ExtractGlimpse') with input shapes: [?,250,500,1], [2], [2]
what I totally understand but I have no idea how I could implement the initial values according to the batch_size.

You can extract the batch size dimension dynamically by using tf.shape.
l = tf.random.normal(tf.stack([tf.shape(inputs)[0], 2]), minval=0, maxval=1))

How to use pack_padded_sequence with multiple variable-length input with the same label in pytorch

I have a model which takes three variable-length inputs with the same label. Is there a way I could use pack_padded_sequence somehow? If so, how should I sort my sequences?
For example,
a = (([0,1,2], [3,4], [5,6,7,8]), 1) # training data is in length 3,2,4; label is 1
b = (([0,1], [2], [6,7,8,9,10]), 1)
Both a and b will be fed into three separated LSTMs and the result will be merged to predict the target.

Let's do it step by step.
Input Data Processing
a = (([0,1,2], [3,4], [5,6,7,8]), 1)
# store length of each element in an array
len_a = np.array([len(a) for a in a[0]])
variable_a = np.zeros((len(len_a), np.amax(len_a)))
for i, a in enumerate(a[0]):
variable_a[i, 0:len(a)] = a
vocab_size = len(np.unique(variable_a))
Variable(torch.from_numpy(variable_a).long())
print(variable_a)
It prints:
Variable containing:
0 1 2 0
3 4 0 0
5 6 7 8
[torch.DoubleTensor of size 3x4]
Defining embedding and RNN layer
Now, let's say, we have an Embedding and RNN layer class as follows.
class EmbeddingLayer(nn.Module):
def __init__(self, input_size, emsize):
super(EmbeddingLayer, self).__init__()
self.embedding = nn.Embedding(input_size, emsize)
def forward(self, input_variable):
return self.embedding(input_variable)
class Encoder(nn.Module):
def __init__(self, input_size, hidden_size, bidirection):
super(Encoder, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bidirection = bidirection
self.rnn = nn.LSTM(self.input_size, self.hidden_size, batch_first=True,
bidirectional=self.bidirection)
def forward(self, sent_variable, sent_len):
# Sort by length (keep idx)
sent_len, idx_sort = np.sort(sent_len)[::-1], np.argsort(-sent_len)
idx_unsort = np.argsort(idx_sort)
idx_sort = torch.from_numpy(idx_sort)
sent_variable = sent_variable.index_select(0, Variable(idx_sort))
# Handling padding in Recurrent Networks
sent_packed = nn.utils.rnn.pack_padded_sequence(sent_variable, sent_len, batch_first=True)
sent_output = self.rnn(sent_packed)[0]
sent_output = nn.utils.rnn.pad_packed_sequence(sent_output, batch_first=True)[0]
# Un-sort by length
idx_unsort = torch.from_numpy(idx_unsort)
sent_output = sent_output.index_select(0, Variable(idx_unsort))
return sent_output
Embed and encode the processed input data
We can embed and encode our input as follows.
emb = EmbeddingLayer(vocab_size, 50)
enc = Encoder(50, 100, False, 'LSTM')
emb_a = emb(variable_a)
enc_a = enc(emb_a, len_a)
If you print the size of enc_a, you will get torch.Size([3, 4, 100]). I hope you understand the meaning of this shape.
Please note, the above code runs only on CPU.

The answer above is already quite informative. Though I often find myself having problems understanding pytorch's documentation. I created those two functions to help me with the pack padding pad packing think.
def batch_to_sequence(x, len_x, batch_first):
"""helpful function to do the pack padding shit
returns the pack_padded sequence, whatever that is.
The data does NOT have to be sorted by sentence lenght, we do that for you!
Input:
x: (torch.tensor[max_len, batch, embedding_dim]) tensor containing the
padded data. It expects the embeddings of the words in the sequence
they happen in the sentence. If batch_first == True, then the
max_len and batch dimensions are transposed.
len_x: (torch.tensor[batch]) a tensor containing the length of each
sentence in x.
batch_first: (bool), indicates whether batch or sentence lenght are
indexed in the first dimension of the tensor.
Output:
x: (torch pack padded sequence) a the pad packed sequence containing
the data. (The documentation is horrible, I don't know what a
pack padded sequence really is.)
idx: (torch.tensor[batch]), the indexes used to sort x, this index in
necessary in sequence_to_batch.
len_x: (torch.tensor[batch]) the sorted lenghs, also needed for
sequence_to_batch."""
#sort data because pack_padded is too stupid to do it itself
len_x, idx = len_x.sort(0, descending=True)
x = x[:,idx]
#remove paddings before feeding it to the LSTM
x = torch.nn.utils.rnn.pack_padded_sequence(x,
len_x,
batch_first = batch_first)
return x, len_x, idx
and
def sequence_to_batch(x, len_x, idx, output_size, batch_first, all_hidden = False):
"""helpful function for the pad packed shit.
Input:
x: (packed pad sequence) the ouptut of lstm or pack_padded_sequence().
len_x (torch.tensor[batch]), the sorted leghths that come out of
batch_to_sequence().
idx: (torch.tenssor[batch]), the indexes used to sort len_x
output_size: (int), the expected dimension of the output embeddings.
batch_first: (bool), indicates whether batch or sentence lenght are
indexed in the first dimension of the tensor.
all_hidden: (bool), if False returs the last relevant hidden state - it
ignores the hidden states produced by the padding. If True, returs
all hidden states.
Output:
x: (torch.tensor[batch, embedding_dim]) tensor containing the
padded data.
"""
#re-introduce the paddings
#doc pack_padded_sequence:
#https://pytorch.org/docs/master/nn.html#torch.nn.utils.rnn.pack_padded_sequence
x, _ = torch.nn.utils.rnn.pad_packed_sequence(x,
batch_first = batch_first)
if all_hidden:
return x
#get the indexes of the last token (where the lstm should stop)
longest_sentence = max(len_x)
#subtracsts -1 to see what happens
last_word = [i*longest_sentence + len_x[i] for i in range(len(len_x))]
#get the relevant hidden states
x = x.view(-1, output_size)
x = x[last_word,:]
#unsort the batch!
_, idx = idx.sort(0, descending=False)
x = x[idx, :]
return x
you can use them inside the forward pass of your lstm
def forward(self, x, len_x):
#convert batch into a packed_pad sequence
x, len_x, idx = batch_to_sequence(x, len_x, self.batch_first)
#run LSTM,
x, (_, _) = self.uni_lstm(x)
#takes the pad_packed_sequence and gives you the embedding vectors
x = sequence_to_batch(x, len_x, idx, self.output_size, self.batch_first)
return x

tensorflow:setting an array element with a sequence

when i use bilstm model to process NLP problem. i got the error while using session.run().I search on Google it seems that bad feed dict make the error. i print the input x shape, it is (100,), but i define it as that:[100, None,256].
how i can solve the error?
it's my evenrimont:
python:3.6
tensorflow:1.0.0
task:everty description has some tags, like stackoverflow, one question has some tags.i need to build a model to predict tags for questions. my training input x is:[batch_size,None,word_embedding_size] a batch of question description, one description has some words, and one word is expressed as vector as length is 256. input y is:[batch_size,n_classes],
this is my model code:
self.X_inputs = tf.placeholder(tf.float32, [self.n_steps,None,self.n_inputs])
self.targets = tf.placeholder(tf.float32, [None,self.n_classes])
#transpose the input x
x = tf.transpose(self.X_inputs, [1, 0, 2])
x = tf.reshape(x, [-1, self.n_inputs])
x = tf.split(x, self.n_steps)
# lstm cell
lstm_cell_fw = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)
lstm_cell_bw = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)
# dropout
if is_training:
lstm_cell_fw = tf.contrib.rnn.DropoutWrapper(lstm_cell_fw, output_keep_prob=(1 - self.dropout_rate))
lstm_cell_bw = tf.contrib.rnn.DropoutWrapper(lstm_cell_bw, output_keep_prob=(1 - self.dropout_rate))
lstm_cell_fw = tf.contrib.rnn.MultiRNNCell([lstm_cell_fw] * self.num_layers)
lstm_cell_bw = tf.contrib.rnn.MultiRNNCell([lstm_cell_bw] * self.num_layers)
# forward and backward
self.outputs, _, _ = tf.contrib.rnn.static_bidirectional_rnn(
lstm_cell_fw,
lstm_cell_bw,
x,
dtype=tf.float32
)
feed_dict like that:
feed_dict={
self.X_inputs: X_train_batch,
self.targets: y_train_batch
}
the X_train_batch is some sentence, they have such shape, [100,None,256], the'None' means, input sentence has not the same length, from 10 to 1500,i just get the real length. maybe it occur the error?
My question is:do you padding the sentence length as the same, or reshape the inputs while doing such nlp work?

How to feed back RNN output to input in tensorflow

In case where suppose I have a trained RNN (e.g. language model), and I want to see what it would generate on its own, how should I feed its output back to its input?
I read the following related questions:
TensorFlow using LSTMs for generating text
TensorFlow LSTM Generative Model
Theoretically it is clear to me, that in tensorflow we use truncated backpropagation, so we have to define the max step which we would like to "trace". Also we reserve a dimension for batches, therefore if I'd like to train a sine wave, I have to feed [None, num_step, 1] inputs.
The following code works:
tf.reset_default_graph()
n_samples=100
state_size=5
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
X = tf.placeholder_with_default(zero_x, [None, n_samples, 1])
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Y = np.roll(def_x, 1)
loss = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
opt = tf.train.AdamOptimizer().minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Initial state run
plt.show(plt.plot(output.eval()[0]))
plt.plot(def_x.squeeze())
plt.show(plt.plot(pred.eval().squeeze()))
steps = 1001
for i in range(steps):
p, l, _= sess.run([pred, loss, opt])
The state size of the LSTM can be varied, also I experimented with feeding sine wave into the network and zeros, and in both cases it converged in ~500 iterations. So far I have understood that in this case the graph consists n_samples number of LSTM cells sharing their parameters, and it is only up to me that I feed input to them as a time series. However when generating samples the network is explicitly depending on its previous output - meaning that I cannot feed the unrolled model at once. I tried to compute the state and output at every step:
with tf.variable_scope('sine', reuse=True):
X_test = tf.placeholder(tf.float64)
X_reshaped = tf.reshape(X_test, [1, -1, 1])
output, last_states = tf.nn.dynamic_rnn(lstm_cell, X_reshaped, dtype=tf.float64)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
test_vals = [0.]
for i in range(1000):
val = pred.eval({X_test:np.array(test_vals)[None, :, None]})
test_vals.append(val)
However in this model it seems that there is no continuity between the LSTM cells. What is going on here?
Do I have to initialize a zero array with i.e. 100 time steps, and assign each run's result into the array? Like feeding the network with this:
run 0: input_feed = [0, 0, 0 ... 0]; res1 = result
run 1: input_feed = [res1, 0, 0 ... 0]; res2 = result
run 1: input_feed = [res1, res2, 0 ... 0]; res3 = result
etc...
What to do if I want to use this trained network to use its own output as its input in the following time step?

If I understood you correctly, you want to find a way to feed the output of time step t as input to time step t+1, right? To do so, there is a relatively easy work around that you can use at test time:
Make sure your input placeholders can accept a dynamic sequence length, i.e. the size of the time dimension is None.
Make sure you are using tf.nn.dynamic_rnn (which you do in the posted example).
Pass the initial state into dynamic_rnn.
Then, at test time, you can loop through your sequence and feed each time step individually (i.e. max sequence length is 1). Additionally, you just have to carry over the internal state of the RNN. See pseudo code below (the variable names refer to your code snippet).
I.e., change the definition of the model to something like this:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(state_size, forget_bias=1.)
X = tf.placeholder_with_default(zero_x, [None, None, 1]) # [batch_size, seq_length, dimension of input]
batch_size = tf.shape(self.input_)[0]
initial_state = lstm_cell.zero_state(batch_size, dtype=tf.float32)
def_x = np.sin(np.linspace(0, 10, n_samples))[None, :, None]
zero_x = np.zeros(n_samples)[None, :, None]
output, last_states = tf.nn.dynamic_rnn(inputs=X, cell=lstm_cell, dtype=tf.float64,
initial_state=initial_state)
pred = tf.contrib.layers.fully_connected(output, 1, activation_fn=tf.tanh)
Then you can perform inference like so:
fetches = {'final_state': last_state,
'prediction': pred}
toy_initial_input = np.array([[[1]]]) # put suitable data here
seq_length = 20 # put whatever is reasonable here for you
# get the output for the first time step
feed_dict = {X: toy_initial_input}
eval_out = sess.run(fetches, feed_dict)
outputs = [eval_out['prediction']]
next_state = eval_out['final_state']
for i in range(1, seq_length):
feed_dict = {X: outputs[-1],
initial_state: next_state}
eval_out = sess.run(fetches, feed_dict)
outputs.append(eval_out['prediction'])
next_state = eval_out['final_state']
# outputs now contains the sequence you want
Note that this can also work for batches, however it can be a bit more complicated if you sequences of different lengths in the same batch.
If you want to perform this kind of prediction not only at test time, but also at training time, it is also possible to do, but a bit more complicated to implement.

You can use its own output (last state) as the next-step input (initial state).
One way to do this is to:
use zero-initialized variables as the input state at every time step
each time you completed a truncated sequence and got some output state, update the state variables with this output state you just got.
The second can be done by either:
fetching the states to python and feeding them back next time, as done in the ptb example in tensorflow/models
build an update op in the graph and add a dependency, as done in the ptb example in tensorpack.

I know I'm a bit late to the party but I think this gist could be useful:
https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31
It lets you autofeed the input through a filter and back into the network as input. To make shapes match up processing can be set as a tf.layers.Dense layer.
Please ask any questions!
Edit:
In your particular case, create a lambda which performs the processing of the dynamic_rnn outputs into your character vector space. Ex:
# if you have:
W = tf.Variable( ... )
B = tf.Variable( ... )
Yo, Ho = tf.nn.dynamic_rnn( cell , inputs , state )
logits = tf.matmul(W, Yo) + B
...
# use self_feeding_rnn as
process_yo = lambda Yo: tf.matmul(W, Yo) + B
Yo, Ho = self_feeding_rnn( cell, seed, initial_state, processing=process_yo)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.