Tensorflow: Understanding the layer structure of LSTM model

Tensorflow: Understanding the layer structure of LSTM model - python

I'm new to tensorflow and LSTM and I'm having some trouble understanding the shape and structure of the network (weights, biases, shape of inputs and logs).
In this specific piece of code taken from here
def recurrent_neural_network(x):
layer = {'weights':tf.Variable(tf.random_normal([rnn_size,n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
x = tf.transpose(x, [1,0,2])
x = tf.reshape(x, [-1, chunk_size])
x = tf.split(x, n_chunks, 0)
lstm_cell = rnn_cell.BasicLSTMCell(rnn_size,state_is_tuple=True)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
output = tf.matmul(outputs[-1],layer['weights']) + layer['biases'])
return output
Can someone please explain why we need to convert x to this specific format (transpose -> reshape -> split)
Why weights are defined as [rnn_size, n_classes] and biases defined as [n_classes].
What is the exact structure of the network that is being formed and how are the weights connected, I don't understand that properly.
Is there any site or reference that I could read that would help?
Thanks.

For the general network structure, LSTMs are an extension of RNN networks. For explanation of RNN network structure, take a look at this classic blog post
For the actual LSTMs, try this post (which also has an RNN explanation)
These are not very formal, but they should be much easier to read and understand than academic papers.
Once you read these, the rest should not be very hard. The reason for the transformations of X is because that is the format static_rnn expects. And rnn_size is the size of the LSTM cell, so thats why the weights are shaped that way.

Related

How are keras tensors connected to layers that create them

In the book "Machine Learning with scikit-learn and Tensorflow" there's a code fragment I can't wrap my head around. Until that chapter, their models were only explicitly using layers - be it in a sequential fashion, or functional. But in the chapter 16, there's this:
import tensorflow_addons as tfa
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
decoder_embeddings, initial_state=encoder_state,
sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
And then he just runs the model in a standard way:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)
I have trouble understanding the code starting at line 7. The author is creating an Embedding layer which he immediately calls on encoder_inputs and decoder_inputs, then he does basically the same with the LSTM layer that he calls on the previously created encoder_embeddings and tensors returned by this operation are used in the code slightly below. What I don't get here is how are those tensors trained? It looks like he's not using the layers creating them in the model, but if so, then how come the embeddings are learned and the whole model converges?

To understand this overall flow, you must understand how things are made under the hood. Tensorflow uses graph execution when making the model. When you have passed [encoder_inputs, decoder_inputs, sequence_lengths] as an inputs and [Y_proba] as a output. The model doesn't immediately start the training, first it builds the model. So, what builds means here, the thing is it makes a computational graph first and then stores this computational graph. model.compile() does this for you, it makes a computational graph for you.
Let me explain it further let's suppose I wanna compute a + b = c and b + d = 2 and finally c * d = 6 using a computational graph, then first Tensorflow will make 3 nodes for it, how will it look like? see in the picture below.
As you are seeing in the picture above the same exact thing is done by TensorFlow when you pass your inputs and outputs. The picture above is the depiction of forward pass. Now the same graph would be used by Tensorflow to do the backward pass. See the figure below.
Now, first, the computational graph is made and then the same computational graph is used to compute the forward pass and backward pass.
The graph above computes the computational graph of your complete model. But how? in your case specifically. The model will ask how Y_prob comes here. The graph consists of the operations and tensors created between the inputs and outputs. The Embedding layer is created and applied to the inputs encoder_inputs and decoder_inputs to obtain encoder_embeddings and decoder_embeddings, respectively. The LSTM layer is applied to encoder_embeddings to produce encoder_outputs, state_h, and state_c. These tensors are then passed as inputs to the BasicDecoder layer, which combines the decoder_embeddings, encoder_state (constructed from state_h and state_c), and sequence_lengths to produce final_outputs, final_state, and final_sequence_lengths. Finally, the softmax function is applied to the rnn_output of final_outputs to produce the final output Y_proba.
All the entities which are mentioned in the paragraph above in quotes would be your intermediate nodes in a computational graph.
So, it will start with the inputs and bring it down to the Y-Prob. During graph computation, the weights of the model and other parameters are also initiated. The graph is made once, which is then easy to compute the forward pass and backward pass.
How do these layers are trained and optimized for convergence?
when you specify inputs=[encoder_inputs, decoder_inputs, sequence_lengths] and outputs=[Y_proba], the model knows the intermediate layers that are used to compute Y_proba from encoder_inputs, decoder_inputs and sequence_lengths. These intermediate layers are the Embedding and LSTM layers, as well as the TrainingSampler, LSTMCell, Dense and BasicDecoder layers. These layers are automatically included in the computation graph of the model, allowing the optimizer to update the parameters of these layers during training.

This is an example of the Keras Functional API. In this style of defining a model, you write the blueprints first, and then use it later. Think of it like wiring a circuit: while you're connecting things, there's no electricity flowing through them (the electricity corresponds to data in our metaphor). Later, you turn on the power source, and electricity flows through.
This is how the Functional API works as well. First, let's read the last line:
model = keras.models.Model(
inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
outputs=[Y_proba])
This says "Hey Keras, I need a model whose inputs are encoder_inputs, decoder_inputs, and sequence_lengths, and they will eventually produce Y_proba. The details of how they will produce this output is defined above. Let's look at the specific lines you're having trouble with:
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
The first of these says, "Keras, give me a layer that will produce embeddings". embeddings is a layer object. The next two lines are the wiring that we talked about: you can connect layers preemptively before data flows through them: that's the crux of the Functional API. So the second layer says, "Keras, encoder_inputs, which is an Input, will connect to (and go through) the Embedding layer I just created, and the result of that will be in a variable I call encoder_embeddings.
The rest of the code follows the same logic: you're connecting the wires together, before you compile your model and eventually fit it with the data.

Adding softmax layer to LSTM network "freezes" output

I've been trying to teach myself the basics of RNN's with a personnal project on PyTorch. I want to produce a simple network that is able to predict the next character in a sequence (idea mainly from this article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ but I wanted to do most of the stuff myself).
My idea is this : I take a batch of B input sequences of size n (np array of n integers), one hot encode them and pass them through my network composed of several LSTM layers, one fully connected layers and one softmax unit.
I then compare the output to the target sequences which are the input sequences shifted one step ahead.
My issue is that when I include the softmax layer, the output is the same every single epoch for every single batch. When I don't include it, the network seems to learn appropriately. I can't figure out what's wrong.
My implementation is the following :
class Model(nn.Module):
def __init__(self, one_hot_length, dropout_prob, num_units, num_layers):
super().__init__()
self.LSTM = nn.LSTM(one_hot_length, num_units, num_layers, batch_first = True, dropout = dropout_prob)
self.dropout = nn.Dropout(dropout_prob)
self.fully_connected = nn.Linear(num_units, one_hot_length)
self.softmax = nn.Softmax(dim = 1)
# dim = 1 as the tensor is of shape (batch_size*seq_length, one_hot_length) when entering the softmax unit
def forward_pass(self, input_seq, hc_states):
output, hc_states = self.LSTM (input_seq, hc_states)
output = output.view(-1, self.num_units)
output = self.fully_connected(output)
# I simply comment out the next line when I run the network without the softmax layer
output = self.softmax(output)
return output, hc_states
one_hot_length is the size of my character dictionnary (~200, also the size of a one hot encoded vector)
num_units is the number of hidden units in a LSTM cell, num_layers the number of LSTM layers in the network.
The inside of the training loop (simplified) goes as follows :
input, target = next_batches(data, batch_pointer)
input = nn.functional.one_hot(input_seq, num_classes = one_hot_length).float().
for state in hc_states:
state.detach_()
optimizer.zero_grad()
output, states = net.forward_pass(input, hc_states)
loss = nn.CrossEntropyLoss(output, target)
loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), MaxGradNorm)
optimizer.step()
With hc_states a tuple with the hidden states tensor and the cell states tensor, input, is a tensor of size (B,n,one_hot_length), target is (B,n).
I'm training on a really small dataset (sentences in a .txt of ~400Ko) just to tune my code, and did 4 different runs with different parameters and each time the outcome was the same : the network doesn't learn at all when it has the softmax layer, and trains somewhat appropriately without.
I don't think it is an issue with tensors shapes as I'm almost sure I checked everything.
My understanding of my problem is that I'm trying to do classification, and that the usual is to put a softmax unit at the end to get "probabilities" of each character to appear, but clearly this isn't right.
Any ideas to help me ?
I'm also fairly new to Pytorch and RNN so I apologize in advance if my architecture/implementation is some kind of monstrosity to a knowledgeable person. Feel free to correct me and thanks in advance.

Can I use loops inside a model using functional API?

I have a trained keras model which takes inputs of size (batchSize,2). This works well and gives good results.
My main problem is to have a model which takes an input a vector of size(batchSize,2,16) and slice it inside the model to 16 vectors of size(batchSize,2) and concatenate the outputs together.
I have used this code for this
y = layers.Input(shape=(2,16,))
model_x= load_model('saved_model')
for i in range(16):
x_input = Lambda(lambda x: x[:, :, i])(y)
if i == 0:
x_output = model_x(x_input)
else:
x_output = layers.concatenate([x_output,
model_x(x_input)])
x_output = Lambda(lambda x: x[:, :tf.cast(N, tf.int32)])(x_output)
final_model = Model(y, x_output)
Although the saved model gives me good performance, this code does not trains well and doesn't give the intended performance.
What can I do to get better results?

I can't say anything about the bad performance of your final model because it might be due to various reasons and this is not readily evident from the content of your question. But to answer your original question: yes, you can use for loops that way, because you are essentially creating layers/tensors and connecting them to each other (i.e. building the graph of the model). So it's a valid thing to do. The problem might be somewhere else, e.g. a wrong indexing, a wrong loss function, etc.
Further, you can build your final model in a much simpler approach. You already have a trained model which gets inputs of shape (batch_size, 2) and gives outputs of shape (batch_size, 8). Now you want to build a model which takes inputs of shape (batch_size, 2, 16), apply the already trained model on each of the 16 (batch_size, 2) segments and then concatenate the results. You can easily do that with a TimeDistributed wrapper:
# load your already trained model
model_x = load_model('saved_model')
inp = layers.Input(shape=(2,16))
# this makes the input shape as `(16,2)`
x = layers.Permute((2,1))(inp)
# this would apply `model_x` on each of the 16 segments; the output shape would be (None, 16, 8)
x = layers.TimeDistributed(model_x)(x)
# flatten to make it have a shape of (None, 128)
out = layers.Flatten()(x)
final_model = Model(inp, out)

Feeding LSTMCell with whole sentences using embeddings gives dimensionality error

So currently i'm sitting on a text-classification problem, but i can't even set up my model in Tensorflow. I have a batch of sentences of length 70 (using padding) and i'm using a embedding_lookup with an embedding size of 300. Here the code for the embedding:
embedding = tf.constant(embedding_matrix, name="embedding")
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
So now inputs should be of shape [batch_size, sentence_length, embedding_size] which is not surprising. Now sadly i'm getting a ValueError for my LSTMCell since it is expecting ndim=2 and obviously inputs is of ndim=3. I have not found a way to change the expected input shape of the LSTM Layer. Here is the code for my LSTMCell init:
for i in range(num_layers):
cells.append(LSTMCell(num_units, forget_bias, state_is_tuple, reuse=reuse, name='lstm_{}'.format(i))
cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)
The error is triggered in the call function of the cell, that looks like this:
for step in range(self.num_steps):
if step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, step, :], state)
Similar question but not helping: Understanding Tensorflow LSTM Input shape

I could solve the problem myself. As it seems, the LSTMCell implementation is more hands on and basic in relation to how a LSTM actually works. The Keras LSTM Layers took care of stuff i need to consider when i'm using TensorFlow. The example i'm using is from the following official TensorFlow example:
https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
As we want to feed our LSTM Layers with a sequence, we need to feed the cells each word after another. As the call of the Cell creates two outputs (cell output and cell state), we use a loop for all words in all sentences to feed the cell and reuse our cell states. This way we create the output for our layers, which we can then use for further operations. The code for this looks like this:
self._initial_state = cell.zero_state(config.batch_size, data_type())
state = self._initial_state
outputs = []
with tf.variable_scope("RNN"):
for time_step in range(self.num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
output = tf.reshape(tf.concat(outputs, 1), [-1, config.hidden_size])
num_steps represents the amount of words in our sentence, that we are going to use.

Keras RNN Regression Input dimensions and architecture

I have recently constructed a CNN in Keras (with Tensorflow as this backend) that takes stellar spectra as an input and predicts three stellar parameters as outputs: Temperature, Surface Gravity, and Metallicity. I am now trying to create an RNN that does the same thing in order to compare the two models.
After searching through examples and forums I haven't come across many applications that are similar enough to my project. I have tried implementing a simple RNN to see if I can come up with sensible results, but no luck so far: the networks don't seem to be learning at all.
I could really use some guidance to get me started. Specifically:
Is an RNN an appropriate network for this type of problem?
What is the correct input shape for the model? I know this depends on the architecture of the network, so I guess that my next question would be: what is a simple architecture to start with that is capable of computing regression predictions?
My input data is such that I have m=50,000 spectra each with n=7000 data points, and L=3 output labels that I am trying to learn. I also have test sets and cross-validation sets with the same n & L dimensions.
When structuring my input data as (m,n,1) and my output targets as (m,L) and using the following architecture, the loss doesn't seem to decrease.
n=7000
L=3
## train_X.shape = (50000, n, 1)
## train_Y.shape = (50000, L)
## cv_X.shape = (10000, n, 1)
## cv_Y.shape = (10000, L)
batch_size=32
lstm_layers = [16, 32]
input_shape = (None, n, 1)
model = Sequential([
InputLayer(batch_input_shape=input_shape),
LSTM(lstm_layers[0],return_sequences=True, dropout_W=0.2, dropout_U=0.2),
LSTM(lstm_layers[1], return_sequences=False),
Dense(L),
Activation('linear')
])
model.compile(loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_X, train_Y, batch_size=batch_size, nb_epoch=20,
validation_data=(cv_X, cv_Y), verbose=2)
I have also tried changing my input shape to (m, 1, n) and still haven't had any success. I am not looking for an optimal network, just something that trains and then I can take it from there. My input data isn't in time-series, but there are relationships between one section of the spectrum and the previous section, so is there a way I can structure each spectrum into a 2D array that will allow an RNN to learn stellar parameters from the spectra?

First you set
train_X.shape = (50000, n, 1)
Then you write
input_shape = (None, 1, n)
Why don't you try
input_shape = (None, n, 1) ?
It would make more sense for your RNN to receive a sequence of n timesteps and 1 value per time step than the other way around.
Does it help? :)
**EDIT : **
Alright, after re-reading here is my 2cents for your questions : the LSTM isn't a good idea.
1) because there is no "temporal" information, there isnt a "direction" in the spectrum information. The LSTM is good at capturing a changing state of the world for example. It's not the best at combining informations at the beginning of your spectrum with information at the end. It will "read" starting from the beginning and that information will fade as the state is updated. You could try a bidirectionnal LSTM to counter the fact that there is "no direction". However, go to the second point.
2) 7000 timesteps is wayyy too much for an LSTM to work. When it trains, at the backpropagation step, the LSTM is unrolled and the information will have to go through "7000 layers" (not actually 7000 because they have the same weights). This is very very difficult to train. I would limit LSTM to 100 steps at max (from my experience).
Otherwise your input shape is correct :)
Have you tried a deep fully connected network?! I believe this would be more efficient.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.