I am new to RNN's / LSTM's in Keras and need advice on whether / how to use them for my problem, which is many-to-many classification.
I have a number of time series: Approximately 1500 "runs" which each last for about 100-300 time steps and have multiple channels. I understand that I need to zero-pad my data to the maximum number of time steps, so my data looks like this:
[nb_samples, timesteps, input_dim]: [1500, 300, 10]
Since getting the label for a single time step is impossible without knowing the past even for a human, I could do feature engineering and train a classical classification algorithm, however, I think LSTMs would be a good fit here. This answer tells me that for many-to-many classification in Keras, I need to set return_sequences to True. However, I do not quite understand how to proceed from here - do I use the return sequence as input for another, normal layer? How do I connect this to my output layer?
Any help, hints or links to tutorials are greatly appreciated - I found a lot of stuff for many-to-one classification, but nothing good on many-to-many.
There can be many approaches to this, i am specifying which can be good fit to your problem.
If you want to stack two LSTM layer, then return-seq can help to learn for another LSTM layer as shown in following example.
from keras.layers import Dense, Flatten, LSTM, Activation
from keras.layers import Dropout, RepeatVector, TimeDistributed
from keras import Input, Model
seq_length = 15
input_dims = 10
output_dims = 8 # number of classes
n_hidden = 10
model1_inputs = Input(shape=(seq_length,input_dims,))
model1_outputs = Input(shape=(output_dims,))
net1 = LSTM(n_hidden, return_sequences=True)(model1_inputs)
net1 = LSTM(n_hidden, return_sequences=False)(net1)
net1 = Dense(output_dims, activation='relu')(net1)
model1_outputs = net1
model1 = Model(inputs=model1_inputs, outputs = model1_outputs, name='model1')
## Fit the model
model1.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 15, 10) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 15, 10) 840
_________________________________________________________________
lstm_2 (LSTM) (None, 10) 840
_________________________________________________________________
dense_3 (Dense) (None, 8) 88
_________________________________________________________________
Another option is that you can use the complete return sequence as the features for the next layer. In that case make a simple Dense layer whose input will be [batch, seq_len*lstm_output_dims].
Note: These features can be useful for classification task, but mostly, we used stacked lstm layer and use its output with-out complete sequence as features for the classification layer.
This answer may be helpful to understand another approaches for LSTM architecture for different purpose.
Related
I'm working with the tensorflow.keras API, and I've encountered a syntax which I'm unfamiliar with, i.e., applying a layer on a sub-models' output, as shown in the following example from this tutorial:
from tensorflow.keras import Model, layers
from tensorflow.keras.applications import resnet
target_shape = (200, 200)
base_cnn = resnet.ResNet50(
weights="imagenet", input_shape=target_shape + (3,), include_top=False
)
flatten = layers.Flatten()(base_cnn.output)
dense1 = layers.Dense(512, activation="relu")(flatten)
dense1 = layers.BatchNormalization()(dense1)
dense2 = layers.Dense(256, activation="relu")(dense1)
dense2 = layers.BatchNormalization()(dense2)
output = layers.Dense(256)(dense2)
embedding = Model(base_cnn.input, output, name="Embedding")
In the official reference of layers.Flatten for example, I couldn't find the explanation of what does applying it on a layer actually do. In the keras.Layer reference I've encountered this explanation:
call(self, inputs, *args, **kwargs): Called in call after making sure build() has been called. call() performs the logic of applying the layer to the input tensors (which should be passed in as argument).
So my question is:
What does flatten = layers.Flatten()(base_cnn.output) do?
You are creating a model based on a pre-trained model. This pre-trained model will not be actively trained with the rest of your layers unless you explicitly set trainable=True. That is, you are only interested in extracting its useful features. A flattening operation is usually used to convert a multidimensional output into a one-dimensional tensor, and that is exactly what is happening in this line: flatten = layers.Flatten()(base_cnn.output). A one-dimensional tensor is often a desirable end result of a model, especially in supervised learning. The output of the pre-trained resnet model is (None, 7, 7, 2048) and you want to generate 1D feature vectors for each input and compare them, so you flatten that output, resulting in a tensor with the shape (None, 100352) or (None, 7 * 7 * 2048).
Alternatives to Flatten would be GlobalMaxPooling2D and GlobalAveragePooling2D, which downsample an input by taking the max or average value along the spatial dimensions. For more information on this topic check out this post.
Consider a simple Keras network like this:
def custom_loss(y_true,y_pred):
return K.abs(y_true[0]-y_pred)+K.abs(y_true[1]-y_pred)
def gen():
while True:
a = np.random.random()
b = 2*a
c = 3*a
yield (np.array([a]),np.array([b,c]))
model = Sequential()
model.add(Dense(1,input_dim=1))
model.compile(Adam(lr= 0.01),custom_loss)
model.fit_generator(gen(),steps_per_epoch=20)
In this case, it is supposed to learn to predict the average between double and triple the value of its input. y_true is of shape [1], while y_pred is of shape [2]. Therefore, keras throws an error: 'Input arrays should have the same number of samples as target arrays. Found 1 input samples and 2 target samples.' This is by design though, how can you avoid having a bigger target than input array, if you have multiple targets?
So, I see in your generator you have single dim- input and 2-dim outputs.
In your loss function, you are indexing in the wrong manner. The first index is usually the batch index, so you have to specify you're trying to calculate the loss across all the batches.
The correct loss implementation would be as below:
return K.abs(y_true[:,:]-y_pred[:,0])+K.abs(y_true[:,:]-y_pred[:,1])
I guess you were getting the error because of the improper indexing if this is what you meant then this will solve the problem.
Finally, the number of outputs is determined by the units in the last layer. [Look at your model summary, you'll see the last layer has shape (None,1), but you need (None, 2) there as you have 2 outputs] You can look into your model summary, you are passing two values b, c in your generator but your model has one output (1 unit in the final dense layer). It's as easy as changing the units in Dense (2) to fix this.
from tensorflow.keras.layers import Input, Dense, Add, Activation, Flatten
from tensorflow.keras.models import Model, Sequential
import tensorflow as K
def custom_loss(y_true,y_pred):
print(y_true.shape)
print(y_pred.shape)
return K.abs(y_true[:,:]-y_pred[:,0])+K.abs(y_true[:,:]-y_pred[:,1])
return K.keras.losses.mse(y_true, y_pred) # this one works
def gen():
while True:
a = np.random.random()
b = 2*a
c = 3*a
yield (np.array([a]),np.array([b,c]))
model = Sequential()
model.add(Dense(2,input_dim=1))
model.compile('adam',custom_loss)
model.summary()
model.fit_generator(gen(),steps_per_epoch=20)
Model: "sequential_17"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_17 (Dense) (None, 2) 4
=================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0
_________________________________________________________________
(None, 1)
(None, 2)
(None, 1)
(None, 2)
20/20 [==============================] - 0s 2ms/step - loss: 2.8827
<tensorflow.python.keras.callbacks.History at 0x7f48e64f1d68>
I am trying to implement the network architecture of this paper Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks, by Ruiqing Yin, Herve Bredin, Claude Barras, which is as, enter image description here
The model is composed of two Bi-LSTM (Bi-LSTM 1 and 2) and a multi-layer perceptron (MLP) whose weights are shared across the sequence. B. Bi-LSTM1 has 64 outputs (32 forward and 32 backward). Bi-LSTM2 has 40 (20 each). The fully connected layers are 40-, 10- and 1-dimensional respectively. The output of both forward and backward LSTMs are concatenated and fed forward to the next layer. The shared MLP is made of three fully connected feedforward layers, using tanh activation function for the first two layers, and a sigmoid activation function for the last layer, in order to output a score between 0 and 1.
I have taken reference from various sources and come up with following code,
model = Sequential()
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(40, return_sequences=True)))
model.add(TimeDistributed(Dense(40,activation='tanh')))
model.add(TimeDistributed(Dense(10,activation='tanh')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.build(input_shape=(None, 200, 35))
model.summary()
I am confused with TimeDistributed layer and how can it simulate an MLP, also how the weights are being shared, can you at least point out that whether I am doing right or not.
As the architecture in the paper suggests, you basically want to push each of the hidden states (which are themselves time distributed) into separate dense layers (thus forming an MLP at each time state).
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional (Bidirectional (None, 200, 128) 51200
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 80) 54080
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 40) 3240
_________________________________________________________________
time_distributed_1 (TimeDist (None, 200, 10) 410
_________________________________________________________________
time_distributed_2 (TimeDist (None, 200, 1) 11
=================================================================
Total params: 108,941
Trainable params: 108,941
Non-trainable params: 0
The Bi-LSTM here is set to return_sequence = True. Therefore it returns the hidden state sequence to the subsequent layer. If you push this sequence into a Dense layer, it wouldn't make sense since you are going to return a 3D tensor (batch, time, feature). Now, if you want to form a Dense network at each time, you will need it to be Time distributed.
As the output shape suggests, this layer creates a 40 node layer at each of the 200 time steps that are the output of the Bi-LSTM before (hidden states). Each of these is then stacked with 10 node layer as well (None, 200, 10). Similarly, the logic follows.
If your doubt is what TimeDistributed layers are - as per official documentation.
This wrapper allows applying a layer to every temporal slice of an input.
The final goal is speaker change detection. Meaning that you want to predict the speaker or probability of a speaker at each of the 200 time steps. Therefore the output layer returns 200 logits (None, 200, 1).
Hope that solves your confusion.
Another intuitive way of looking at it -
Your Bi-LSTM is set to return sequences instead of just features. Each time step in this sequence that is returned needs to have a Dense network of its own. TimeDistributed Dense is basically a layer that takes in an input sequence and inputs it to separate dense nodes at each time step. So, instead of having 40 nodes like a standard Dense layer, it has 200 X 40 nodes, where the input to say the 3rd 40 nodes, is the 3rd time step from the Bi-LSTM. This simulates a time distributed MLP over the Bi-LSTM sequences.
A good visual intuition that I prefer when working with LSTMs -
If you DONT return sequences, the output of the LSTM is just a single value of ht (LHS of the image below)
If you return sequences, the output is a sequence (h0 to ht) (RHS of the image below)
Adding a Dense layer, in the first case will only take in ht as input. In the second case, you will need a TimeDistributed Dense, which will "stack" on top of each of the h0 to ht.
I am trying to train a Keras Model which include two nested models, and I want to save the weights of both inner models separately. Right now I am able to save weights of the whole model, but I am unable to load the weights of nested models within big model.
Output of Big_model.summary looks like this
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_4 (InputLayer) [(None, 128, 128, 1)] 0
_________________________________________________________________
model (Model) (None, 16, 16, 512) 170369024
_________________________________________________________________
model_1 (Model) (None, 128, 128, 1) 15342209
=================================================================
Total params: 185,711,233
Trainable params: 185,711,233
Non-trainable params: 0
How can I even see the summary of both inner models, e.g Big_model.inner_Model1.summary() something like that, Or save the weights of both inner models separately after training using Big_model.inner_Model1.save_weights() and Big_model.inner_Model2.save_weights()or callbacks during model.fit.
What I am getting is Big_model has no module as inner_Model1, Any help Please ??
PS: There is no problem with training or anything, I can run the training, also I am using Tensorflow version tf.keras.models.Model for models.
This is how I am creating models
inner_Model1 = tf.keras.models.Model()
inner_Model2 = tf.keras.models.Model()
x = tf.keras.layers.Input(shape=IMAGE_SHAPE)
Big_model = tf.keras.models.Model(x, inner_model2(inner_model1(x)))
Big_model.compile(optimizer=optimizer, loss='mean_absolute_error')
In that summary you posted, model is layer 1, and model_1 is layer 2:
Big_model.layers[1].summary() #this is inner_Model1.summary()
Big_model.layers[2].summary() #this is inner_Model2.summary()
Do whatever you want with them.
If you created the model like you did, there is nothing wrong with simply doing:
inner_Model1.save_weights(...)
inner_Model2.save_weights(...)
It will also work OK if you load the weights outside the big model, it will see the changes.
inner_Model1.load_weights(...)
inner_Model2.load_weights(...)
Since I am new to deep learning this question may be funny to you. but I couldn't visualize it in the mind. That's why I am asking about it.
I am giving a sentence as the vector to the LSTM, Think I have a sentence contains 10 words. Then I change those sentences to the vectors and giving it to the LSTM.
The length of the LSTM cells should be 10. But in most of the tutorials, I have seen they have added 128 hidden states. I couldn't understand and visualize it. What's that the word means by LSTM layer with "128-dimensional hidden state"
for example:
X = LSTM(128, return_sequences=True)(embeddings)
The summery of this looks
lstm_1 (LSTM) (None, 10, 128) 91648
Here It looks like 10 LSTM cells are added but why are that 128 hidden states there? Hope you may understand what I am expecting.
Short Answer:
If you are more familiar with Convolutional Networks, you can thick of the size of the LSTM layer (128) is the equivalent to the size of a Convolutional layer. The 10 only means that the size of your input (lenght of your sequence is 10)
Longer Answer:
You can check this article for more detail article about RNNs.
In the left image, a LSTM layer is represented with (xt) as the input with output (ht). The feedback arrow shows that there is some kind of memory inside the cell.
In practice in Keras (right image), this model is "unrolled" to give the whole input xt in parallel to our layer.
So when your summary is:
lstm_1 (LSTM) (None, 10, 128) 91648
It means that your input sequence is 10 (x0,x1,x2,...,x9), and that the size of your LSTM is 128 (128 will be the dimension of your output ht)