Feeding LSTMCell with whole sentences using embeddings gives dimensionality error - python

So currently i'm sitting on a text-classification problem, but i can't even set up my model in Tensorflow. I have a batch of sentences of length 70 (using padding) and i'm using a embedding_lookup with an embedding size of 300. Here the code for the embedding:
embedding = tf.constant(embedding_matrix, name="embedding")
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
So now inputs should be of shape [batch_size, sentence_length, embedding_size] which is not surprising. Now sadly i'm getting a ValueError for my LSTMCell since it is expecting ndim=2 and obviously inputs is of ndim=3. I have not found a way to change the expected input shape of the LSTM Layer. Here is the code for my LSTMCell init:
for i in range(num_layers):
cells.append(LSTMCell(num_units, forget_bias, state_is_tuple, reuse=reuse, name='lstm_{}'.format(i))
cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)
The error is triggered in the call function of the cell, that looks like this:
for step in range(self.num_steps):
if step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, step, :], state)
Similar question but not helping: Understanding Tensorflow LSTM Input shape

I could solve the problem myself. As it seems, the LSTMCell implementation is more hands on and basic in relation to how a LSTM actually works. The Keras LSTM Layers took care of stuff i need to consider when i'm using TensorFlow. The example i'm using is from the following official TensorFlow example:
https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
As we want to feed our LSTM Layers with a sequence, we need to feed the cells each word after another. As the call of the Cell creates two outputs (cell output and cell state), we use a loop for all words in all sentences to feed the cell and reuse our cell states. This way we create the output for our layers, which we can then use for further operations. The code for this looks like this:
self._initial_state = cell.zero_state(config.batch_size, data_type())
state = self._initial_state
outputs = []
with tf.variable_scope("RNN"):
for time_step in range(self.num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
output = tf.reshape(tf.concat(outputs, 1), [-1, config.hidden_size])
num_steps represents the amount of words in our sentence, that we are going to use.

Related

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

Tensorflow: Understanding the layer structure of LSTM model

I'm new to tensorflow and LSTM and I'm having some trouble understanding the shape and structure of the network (weights, biases, shape of inputs and logs).
In this specific piece of code taken from here
def recurrent_neural_network(x):
layer = {'weights':tf.Variable(tf.random_normal([rnn_size,n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
x = tf.transpose(x, [1,0,2])
x = tf.reshape(x, [-1, chunk_size])
x = tf.split(x, n_chunks, 0)
lstm_cell = rnn_cell.BasicLSTMCell(rnn_size,state_is_tuple=True)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
output = tf.matmul(outputs[-1],layer['weights']) + layer['biases'])
return output
Can someone please explain why we need to convert x to this specific format (transpose -> reshape -> split)
Why weights are defined as [rnn_size, n_classes] and biases defined as [n_classes].
What is the exact structure of the network that is being formed and how are the weights connected, I don't understand that properly.
Is there any site or reference that I could read that would help?
Thanks.
For the general network structure, LSTMs are an extension of RNN networks. For explanation of RNN network structure, take a look at this classic blog post
For the actual LSTMs, try this post (which also has an RNN explanation)
These are not very formal, but they should be much easier to read and understand than academic papers.
Once you read these, the rest should not be very hard. The reason for the transformations of X is because that is the format static_rnn expects. And rnn_size is the size of the LSTM cell, so thats why the weights are shaped that way.

Output of Tensorflow LSTM-Cell

I've got a question on Tensorflow LSTM-Implementation. There are currently several implementations in TF, but I use:
cell = tf.contrib.rnn.BasicLSTMCell(n_units)
where n_units is the amount of 'parallel' LSTM-Cells.
Then to get my output I call:
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(cell, x,
initial_state=initial_state, time_major=False)
where (as time_major=False) x is of shape (batch_size, time_steps, input_length)
where batch_size is my batch_size
where time_steps is the amount of timesteps my RNN will go through
where input_length is the length of one of my input vectors (vector fed into the network on one specific timestep on one specific batch)
I expect rnn_outputs to be of shape (batch_size, time_steps, n_units, input_length) as I have not specified another output size.
Documentation of nn.dynamic_rnn tells me that output is of shape (batch_size, input_length, cell.output_size).
The documentation of tf.contrib.rnn.BasicLSTMCell does have a property output_size, which is defaulted to n_units (the amount of LSTM-cells I use).
So does each LSTM-Cell only output a scalar for every given timestep? I would expect it to output a vector of the length of the input vector. This seems not to be the case from how I understand it right now, so I am confused. Can you tell me whether that's the case or how I could change it to output a vector of size of the input vector per single lstm-cell maybe?
I think the primary confusion is on the terminology of the LSTM cell's argument: num_units. Unfortunately it doesn't mean, as the name suggests, "the no. of LSTM cells" that should be equal to your time-steps. They actually correspond to the number of dimensions in the hidden state (cell state + hidden state vector).
The call to dynamic_rnn() returns a tensor of shape: [batch_size, time_steps, output_size] where,
(Please note this) output_size = num_units; if (num_proj = None) in the lstm cell
where as, output_size = num_proj; if it is defined.
Now, typically, you will extract the last time_step's result and project it to the size of output dimensions using a mat-mul + biases operation manually, or use the num_proj argument in the LSTM cell.
I have been through the same confusion and had to look really deep to get it cleared. Hope this answer clears some of it.

Understanding tensorflow RNN encoder input for translation task

I'm following this tutorial (https://theneuralperspective.com/2016/11/20/recurrent-neural-networks-rnn-part-3-encoder-decoder/) to implement a RNN using tensorflow.
I have prepared the input data for both encoder and decoder as described but I'm having trouble understanding how the dynamic_rnn function on tensorflow is expecting the input for this particular task so I'm going to describe the data I have for now in hopes someone has an idea.
I have 20 sentences in english and in french. As is described on the above link, each sentence is padded according to the longest sentence in each language and tokens for GO and EOS are added where expected.
Since not all the code is available on the link I tried doing some of the missing code for myself. This is what my encoder code looks like
with tf.variable_scope('encoder') as scope:
# Encoder RNN cell
self.encoder_lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_units, forget_bias=0.0, state_is_tuple=True)
self.encoder_cell = tf.nn.rnn_cell.MultiRNNCell([self.encoder_lstm_cell] * num_layers, state_is_tuple=True)
# Embed encoder RNN inputs
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [self.vocab_size_en, hidden_units], dtype=data_type)
self.embedded_encoder_inputs = tf.nn.embedding_lookup(embedding, self.encoder_inputs)
# Outputs from encoder RNN
self.encoder_outputs, self.encoder_state = tf.nn.dynamic_rnn(
cell=self.encoder_cell,
inputs=self.embedded_encoder_inputs,
sequence_length=self.seq_lens_en, time_major=False, dtype=tf.float32)
My embeddings are of shape [20, 60, 60] but I think this is incorrect. If I look into the documentation for dynamic_rnn (https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) then the inputs are expected to be of size [batch, max_time, ...]. If I look into this example here (https://medium.com/#erikhallstrm/using-the-dynamicrnn-api-in-tensorflow-7237aba7f7ea#.4hrf7z1pd) then the inputs are supposed to be [batch, truncated_backprop_length, state_size].
The max length of a sentence for the encoder is 60. This is where I am confused about what exactly should the 2 last numbers on the inputs should be. Should max_time be also the max length for my sentence sequence or should that be another quantity. Also should the last number be also the max length of the sentences' sequences or the length of the current input without the padding?

What does embedding do in tensorflow

I am reading an example of using RNN with tensorflow here: ptb_word_lm.py
I can't figure out what the embedding and embedding_lookup are doing here. How can it add another dimension to the tensor? Going from (20, 25) to (20, 25, 200). In this case (20,25) is a batch-size of 20 with 25 time steps. I can't understand how/why you can add the hidden_size of the cell as a dimension of the input data? Typically the input data would be a matrix of size [batch_size, num_features] and the model would map num_features ---> hidden_dims with a matrix of size [num_features, hidden_dims] yielding an output of size [batch-size, hidden-dims]. So how can hidden_dims be a dimension of the input tensor?
input_data, targets = reader.ptb_producer(train_data, 20, 25)
cell = tf.nn.rnn_cell.BasicLSTMCell(200, forget_bias=1.0, state_is_tuple=True)
initial_state = cell.zero_state(20, tf.float32)
embedding = tf.get_variable("embedding", [10000, 200], dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, input_data)
input_data_train # <tf.Tensor 'PTBProducer/Slice:0' shape=(20, 25) dtype=int32>
inputs # <tf.Tensor 'embedding_lookup:0' shape=(20, 25, 200) dtype=float32>
outputs = []
state = initial_state
for time_step in range(25):
if time_step > 0:
tf.get_variable_scope().reuse_variables()
cell_output, state = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
output = tf.reshape(tf.concat(1, outputs), [-1, 200])
outputs # list of 20: <tf.Tensor 'BasicLSTMCell/mul_2:0' shape=(20, 200) dtype=float32>
output # <tf.Tensor 'Reshape_2:0' shape=(500, 200) dtype=float32>
softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size], dtype=tf.float32)
softmax_b = tf.get_variable("softmax_b", [config.hidden_size, config.vocab_size], dtype=tf.float32)
logits = tf.matmul(output, softmax_w) + softmax_b
loss = tf.nn.seq2seq.sequence_loss_by_example([logits], [tf.reshape(targets, [-1])],[tf.ones([20*25], dtype=tf.float32)])
cost = tf.reduce_sum(loss) / batch_size
ok, I'm not going to try and explain this specific code, but I will try and answer the "what is an embedding?" part of the title.
Basically it's a mapping of the original input data into some set of real-valued dimensions, and the "position" of the original input data in those dimensions is organized to improve the task.
In tensorflow, if you imagine some text input field has "king", "queen", "girl","boy", and you have 2 embedding dimensions. Hopefully the backprop will train the embedding to put the concept of royalty on one axis and gender on the other. So in this case, what was a 4 categorical value feature gets "boiled" down to a floating point embedding feature with 2 dimensions.
They are implemented using a lookup table, either hashed from the original or from a dictionary ordering. For a fully trained one, You might put in "Queen", and you get out say [1.0,1.0], Put in "Boy" and you get out [0.0,0.0].
Tensorflow does backprop of the error INTO this lookup table, and hopefully what starts off as a randomly initialized dictionary will gradually become like we see above.
Hope this helps. If not, look at: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
At simplest,
input_data: Batch of sequence of word IDs (with shape (20,25))
inputs: Batch of sequence of word embeddings (with shape (20,25,200))
How does input_data becomes inputs you might ask? This is what learning word embeddings does. The easiest way to imagine is,
unwrap the input_data to a single batch of shape (20*25,).
Now assign a vector of size 200 for each element in that unwrapped input_data which gives you a matrix of shape (20*25,200).
Now, reshape the matrix to shape (20,25,200).
This is because, embedding learning is not a time-series process. You learn word embeddings with a feed forward network. Next important question would be, how do you learn the word embeddings.
Initialise a huge Tensorflow variable of size (vocabulary_size, 200) (i.e. embedding in the code)
Optimise the embedding so that a given word should be able to predict any word from its context. (e.g. in "dog barked at the mailman", if "at" is the target word "dog", "barked", "the" and "mailman" are context words)
This process give you a vector (200 long in this example) for each word, such that semantics are preserved (i.e. vector of "dog" is close to "cat", but far away from "pen").
Here's an overview of what I just explained.

Categories

Resources