Meaning of the hidden state in Keras LSTM

Meaning of the hidden state in Keras LSTM - python

Since I am new to deep learning this question may be funny to you. but I couldn't visualize it in the mind. That's why I am asking about it.
I am giving a sentence as the vector to the LSTM, Think I have a sentence contains 10 words. Then I change those sentences to the vectors and giving it to the LSTM.
The length of the LSTM cells should be 10. But in most of the tutorials, I have seen they have added 128 hidden states. I couldn't understand and visualize it. What's that the word means by LSTM layer with "128-dimensional hidden state"
for example:
X = LSTM(128, return_sequences=True)(embeddings)
The summery of this looks
lstm_1 (LSTM) (None, 10, 128) 91648
Here It looks like 10 LSTM cells are added but why are that 128 hidden states there? Hope you may understand what I am expecting.

Short Answer:
If you are more familiar with Convolutional Networks, you can thick of the size of the LSTM layer (128) is the equivalent to the size of a Convolutional layer. The 10 only means that the size of your input (lenght of your sequence is 10)
Longer Answer:
You can check this article for more detail article about RNNs.
In the left image, a LSTM layer is represented with (xt) as the input with output (ht). The feedback arrow shows that there is some kind of memory inside the cell.
In practice in Keras (right image), this model is "unrolled" to give the whole input xt in parallel to our layer.
So when your summary is:
lstm_1 (LSTM) (None, 10, 128) 91648
It means that your input sequence is 10 (x0,x1,x2,...,x9), and that the size of your LSTM is 128 (128 will be the dimension of your output ht)

Related

Neural network hidden layer vs. Convolutional hidden layer intuition

If there are 10 features and 1 output class (sigmoid activation) with a regression objective:
If I use only 5 neurons in my first dense hidden layer: will the first error be calculated solely based on half of the training feature set? Isn't it imperative to match the # of features with the neurons in hidden layer #1 so that the model can see all the features at once? Otherwise it's not getting the whole picture? The first fwd propagation iteration would use 5 out of 10 features, and get the error value (and train during backprop, assume batch grad descent). Then the 2nd fwd propagation iteration would see the remaining 5 out of 10 features with updated weights and hopefully arrive at a smaller error. BUT its only seeing half the features at a time!
Conversely, if I have a convolutional 2D layer of 64 neurons. And my training shape is: (100, 28,28,1) (pictures of cats and dogs in greyscale), will each of the 64 neurons see a different 28x28 vector? No right, because it can only send one example through the forward propagation at a time? So then only a single picture (cat or dog) should be spanned across the 64 neurons? Why would you want that since each neuron in that layer has the same filter, stride, padding and activation function? When you define a Conv2D layer...the parameters of each neuron are the same. So is only a part of the training example going into each neuron? Why have 64 neurons, for example? Just have one neuron, use a filter on it and pass it along to a second hidden layer with another filter with different parameters!
Please explain the flaws in my logic. Thanks so much.
EDIT: I just realized for Conv2D, you flatten the training data sets so it becomes a 1D vector and so a 28x28 image would mean having an input conv2d layer of 724 neurons. But I am still confused for the dense neural network (paragraph #1 above)

What is your "first" layer?
Normally you have an input layer as first layer, which does not contain any weights.
The shape of the input layer must match the shape of your feature data.
So basically when you train a model with 10 features, but only have a input layer of shape (None,5) (where none stands for the batch_size), tensorflow will raise an exception, because it needs data for all inputs in the correct shape.
So what you said is just not going to happen. If you only have 5 features, the next 5 features wont be fit into the net in the next iteration but, the next sample will be send to the model instead. (Lets say no exception is thrown) So of the next sample also only the first 5 features would be used.
What you can do instead, use a input_layer as first layer with the correct shape of your features. Then as secodn layer, you can use any shape you like, 1,10,100 dense neurons, its up to you (and what works well of course). The shape of the output again must match (this time) the shape of your label data.
I hope this makes it more clear

BiLSTM (Bidirectional Long Short-Term Memory Networks) with MLP(Multi-layer Perceptron)

I am trying to implement the network architecture of this paper Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks, by Ruiqing Yin, Herve Bredin, Claude Barras, which is as, enter image description here
The model is composed of two Bi-LSTM (Bi-LSTM 1 and 2) and a multi-layer perceptron (MLP) whose weights are shared across the sequence. B. Bi-LSTM1 has 64 outputs (32 forward and 32 backward). Bi-LSTM2 has 40 (20 each). The fully connected layers are 40-, 10- and 1-dimensional respectively. The output of both forward and backward LSTMs are concatenated and fed forward to the next layer. The shared MLP is made of three fully connected feedforward layers, using tanh activation function for the first two layers, and a sigmoid activation function for the last layer, in order to output a score between 0 and 1.
I have taken reference from various sources and come up with following code,
model = Sequential()
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(40, return_sequences=True)))
model.add(TimeDistributed(Dense(40,activation='tanh')))
model.add(TimeDistributed(Dense(10,activation='tanh')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.build(input_shape=(None, 200, 35))
model.summary()
I am confused with TimeDistributed layer and how can it simulate an MLP, also how the weights are being shared, can you at least point out that whether I am doing right or not.

As the architecture in the paper suggests, you basically want to push each of the hidden states (which are themselves time distributed) into separate dense layers (thus forming an MLP at each time state).
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional (Bidirectional (None, 200, 128) 51200
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 80) 54080
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 40) 3240
_________________________________________________________________
time_distributed_1 (TimeDist (None, 200, 10) 410
_________________________________________________________________
time_distributed_2 (TimeDist (None, 200, 1) 11
=================================================================
Total params: 108,941
Trainable params: 108,941
Non-trainable params: 0
The Bi-LSTM here is set to return_sequence = True. Therefore it returns the hidden state sequence to the subsequent layer. If you push this sequence into a Dense layer, it wouldn't make sense since you are going to return a 3D tensor (batch, time, feature). Now, if you want to form a Dense network at each time, you will need it to be Time distributed.
As the output shape suggests, this layer creates a 40 node layer at each of the 200 time steps that are the output of the Bi-LSTM before (hidden states). Each of these is then stacked with 10 node layer as well (None, 200, 10). Similarly, the logic follows.
If your doubt is what TimeDistributed layers are - as per official documentation.
This wrapper allows applying a layer to every temporal slice of an input.
The final goal is speaker change detection. Meaning that you want to predict the speaker or probability of a speaker at each of the 200 time steps. Therefore the output layer returns 200 logits (None, 200, 1).
Hope that solves your confusion.
Another intuitive way of looking at it -
Your Bi-LSTM is set to return sequences instead of just features. Each time step in this sequence that is returned needs to have a Dense network of its own. TimeDistributed Dense is basically a layer that takes in an input sequence and inputs it to separate dense nodes at each time step. So, instead of having 40 nodes like a standard Dense layer, it has 200 X 40 nodes, where the input to say the 3rd 40 nodes, is the 3rd time step from the Bi-LSTM. This simulates a time distributed MLP over the Bi-LSTM sequences.
A good visual intuition that I prefer when working with LSTMs -
If you DONT return sequences, the output of the LSTM is just a single value of ht (LHS of the image below)
If you return sequences, the output is a sequence (h0 to ht) (RHS of the image below)
Adding a Dense layer, in the first case will only take in ht as input. In the second case, you will need a TimeDistributed Dense, which will "stack" on top of each of the h0 to ht.

Concatenate two inputs of different dimension at a specific index in a sequence in Keras

I am trying to train an LSTM using two types of embedding layers. Let's say that the following is my tokenized sentence:
tokenized_sentence = ['I', 'am', 'trying', 'to', 'TARGET_TOKEN', 'my', 'way', 'home']
Now for the words surrounding the "TARGET_TOKEN" I have an embedding layer which is (vocab_size x 128) while for the token 'TARGET_TOKEN' at index 4 I have an embedding layer which is (vocab_size x 512). So I need to transform the TARGET_TOKEN embedding from 512 to 128 and then insert this 128 dimension vector at index 4 (this index will change depending on the feature) of the output from the surrounding words embedding layer before feeding this concatenated list (tensor) to the LSTM. In my situation the positioning of the words/tokens is very important and so I do not wish to lose the position of where the token 'TARGET_TOKEN' is in the sentence.
Initially I was looking on how to reduce the size of the 512 embeddings and I found that with numpy I could take the average of every 4 adjacent vectors and thus I end up from 512 dimensions to 128 dimensions. However, it is to my understanding that this might not represent the vectors in the right way anymore.
Let's call the token 'TARGET_TOKEN' as "target_token" and the rest of the words as "context_tokens". So instead after further reading I thought could take the output of the target_token embedding layer and pass it through a Dense layer with 128 units (thus reducing its size to 128). Following this I will concatenate the output of the Dense layer with the output of the context_tokens embedding layer. So far I know how to do this. My issue is that positioning is important and it is important that my LSTM learns the target_token embedding with respect to its surrounding context. So long-story-short I need to concatenate at index 4 (maybe I'm looking at this the wrong way but that's how I understand it).
However, the concatenate layer in Keras does not have such a parameter and I can only concatenate the two layers without taking into consideration the positioning.
My model will take three inputs:
input1 = target_token
input2 = context_tokens
input3 = target_token_index
and one output (as a sequence).
My code looks like this:
target_token_input = Input((1,))
sentence_input = Input((None,))
index_input = Input((1,), dtype="int32")
target_token_embedding_layer = Embedding(500, 512, weights=[], trainable=False)(target_token_input)
target_token_dense_layer = Dense(128, activation="relu")(target_token_embedding_layer)
context_embedding_layer = Embedding(self.vocab_size, 128, weights=[self.weight_matrix],
trainable=False)(sentence_input)
concatenation_layer = Concatenate()([target_token_dense_layer, context_embedding_layer])
bidirectional = Bidirectional(LSTM(64, return_sequences=self.return_sequences, dropout=0.2, recurrent_dropout=0.2))(concatenation_layer)
normalization_layer = BatchNormalization()(bidirectional)
output_layer = TimeDistributed(Dense(self.output_size, activation=self.activation))(normalization_layer)
model = Model([target_token_input, sentence_input, index_input],[output_layer])
My expected result should be the following, here the numbers represent the dimensions of the the token vectors.
original_tokens = ['I', 'am', 'trying', 'to', 'eng-12345', 'my', 'way', 'home']
vector_original_tokens = [128, 128, 128, 128, 512, 128, 128, 128]
post_concatenation_tokens = [128, 128, 128, 128, 128, 128, 128, 128]
Notice how at index 4 the embedding went from 512 to 128. I am looking into the possibility of transforming the tensor into a list, inserting the output of the target_token_embedding_layer into this list at the desired index and then transforming the list back to a tensor and using that tensor as input for the LSTM. However, I'm still trying to figure this out.
Does anyone know how to do this? Any help would be greatly appreciated!

Umm, the short answer is you can't. You could do it programmatically, where things would look like how you want them to look, but the Keras LSTM wouldn't understand it. The Keras LSTM needs to understand the connections between tokens. When you get a list of embeddings from one "universe" and try to mash it with embeddings from another "universe", things just don't work.
You will have to tokenize / embed all the words in the same dimensions for it to work. I'm assuming that the TARGET_TOKEN has different embeddings (512) compared to the rest of the dictionary? You could create a new embedding which is (128+512). However, since you mention that you reduced the original 512 embeddings to 128, my suggestion is that you should just go back to 512 embeddings.
Note how I mention Keras LSTM: What I wrist is in response to the common, well known LSTM layer with the standard 4 gates. If you are writing your own variation of the LSTM layer from scratch (say in Tensorflow or numpy), all bets are off.
I'm guessing you're trying to build some kind of network that fills in the blanks? What was your reason to switch from 512 embeddings to 128? Saving memory?

The Diagram explanation of the LSTM Network

I am working with LSTM for my time series forecasting problem. I have the following network:
model = Sequential()
model.add(LSTM(units_size=300, activation=activation, input_shape=(20, 1)))
model.add(Dense(20))
My forecasting problem is to forecast the next 20 time steps looking back the last 20 time steps. So, for each iteration, I have an input shape like (x_t-20...x_t) and forecast the next (x_t+1...x_t+20). For the hidden layer, I use 300 hidden units.
As LSTM is different than the simple feed-forward neural network, I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells? As I describe above, I have 20 time steps to predict and are these all steps generated from the last LSTM cels? I have no idea. Can some generally give a diagram example of this kind of network structure?

Regarding these questions,
I cannot understand how those 300 hidden units used for the LSTM cells and how the output comes out. Are there 20 LSTM cells and 300 units for each cell? How is the output generated from these cells?
It is simpler to consider the LSTM layer you have defined as a single block. This diagram is heavily borrowed from Francois Chollet's Deep Learning with Python book:
In your model, input shape is defined as (20,1), so you have 20 time-steps of size 1. For a moment, consider that the output Dense layer is not present.
model = Sequential()
model.add(LSTM(300, input_shape=(20,1)))
model.summary()
lstm_7 (LSTM) (None, 300) 362400
The output shape of the LSTM layer is 300 which means the output is of size 300.
output = model.predict(np.zeros((1, 20, 1)))
print(output.shape)
(1, 300)
input (1,20,1) => batch size = 1, time-steps = 20, input-feature-size = 1.
output (1, 300) => batch size = 1, output-feature-size = 300
Keras recurrently ran the LSTM for 20 time-steps and generated an output of size (300). In the diagram above, this is Output t+19.
Now, if you add the Dense layer after LSTM, the output will be of size 20 which is straightforward.

To understand LSTMs, I'd recommend first spending a few minutes to understand 'plain vanilla' RNNs, as LSTMs are just a more complex version of that. I'll try to describe what's happening in your network if it was a basic RNN.
You are training a single set of weights that are repeatedly used for each time step (t-20,...,t). The first weight (let's say W1) is for inputs. One by one, each of x_t-20,...,x_t is multiplied by W1, then a non-linear activation function is applied - same as any NN forward pass.
The difference with RNNs is the presence of a separate 'state' (note: not a trained weight), that can start off random or zero, and carries information about your sequence across time steps. There's another weight for the state (W2). So starting at the first time step t-20, the initial state is multiplied by W2 and an activation function applied.
So at timestep t-20 we have the output from W1 (on inputs) and W2 (on state). We can combine these outputs at each timestep, and use it to generate the state to pass to the next timestep, i.e. t-19. Because the state has to be calculated at each timestep and passed to the next, these calculations have to happen sequentially starting from t-20. To generate our desired output, we can take each output state across all timesteps - or only take the output at the final timestep. As return_sequences=False by default in Keras, you are only using the output at the final timestep, which then goes into your dense layer.
The weights W1 and W2 need to have one dimension equal to the dimensions of each timestep input x_t-20... for matrix multiplication to work. This dimension is 1 in your case, as each of the 20 inputs are a 1d vector (or number), which is multiplied by W1. However, we're free to set the second dimension of the weights as we please - 300 in your case. So W1 is of size 1x300, and is multiplied 20 times, once for each timestep.
This lecture will take you through the basic flow diagram of RNNs that I described above, all the way to more advanced stuff which you can skip. This is a famous explanation of LSTMs if you want to make the leap from basic RNNs to LSTMs, which you may not need to do - there are just more complicated weights and states.

Many-to-many classification with Keras LSTM

I am new to RNN's / LSTM's in Keras and need advice on whether / how to use them for my problem, which is many-to-many classification.
I have a number of time series: Approximately 1500 "runs" which each last for about 100-300 time steps and have multiple channels. I understand that I need to zero-pad my data to the maximum number of time steps, so my data looks like this:
[nb_samples, timesteps, input_dim]: [1500, 300, 10]
Since getting the label for a single time step is impossible without knowing the past even for a human, I could do feature engineering and train a classical classification algorithm, however, I think LSTMs would be a good fit here. This answer tells me that for many-to-many classification in Keras, I need to set return_sequences to True. However, I do not quite understand how to proceed from here - do I use the return sequence as input for another, normal layer? How do I connect this to my output layer?
Any help, hints or links to tutorials are greatly appreciated - I found a lot of stuff for many-to-one classification, but nothing good on many-to-many.

There can be many approaches to this, i am specifying which can be good fit to your problem.
If you want to stack two LSTM layer, then return-seq can help to learn for another LSTM layer as shown in following example.
from keras.layers import Dense, Flatten, LSTM, Activation
from keras.layers import Dropout, RepeatVector, TimeDistributed
from keras import Input, Model
seq_length = 15
input_dims = 10
output_dims = 8 # number of classes
n_hidden = 10
model1_inputs = Input(shape=(seq_length,input_dims,))
model1_outputs = Input(shape=(output_dims,))
net1 = LSTM(n_hidden, return_sequences=True)(model1_inputs)
net1 = LSTM(n_hidden, return_sequences=False)(net1)
net1 = Dense(output_dims, activation='relu')(net1)
model1_outputs = net1
model1 = Model(inputs=model1_inputs, outputs = model1_outputs, name='model1')
## Fit the model
model1.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 15, 10) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 15, 10) 840
_________________________________________________________________
lstm_2 (LSTM) (None, 10) 840
_________________________________________________________________
dense_3 (Dense) (None, 8) 88
_________________________________________________________________
Another option is that you can use the complete return sequence as the features for the next layer. In that case make a simple Dense layer whose input will be [batch, seq_len*lstm_output_dims].
Note: These features can be useful for classification task, but mostly, we used stacked lstm layer and use its output with-out complete sequence as features for the classification layer.
This answer may be helpful to understand another approaches for LSTM architecture for different purpose.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.