I'm trying to understand the keras LSTM layer a bit better in regards to timesteps, but am still struggling a bit.
I want to create a model that is able to compare 2 inputs (siamese network). So my input is twice a preprocessed text. The preprocessing is done as followed:
max_len = 64
data['cleaned_text_1'] = assets.apply(lambda x: clean_string(data[]), axis=1)
data['text_1_seq'] = t.texts_to_sequences(cleaned_text_1.astype(str).values)
data['text_1_seq_pad'] = [list(x) for x in pad_sequences(assets['text_1_seq'], maxlen=max_len, padding='post')]
same is being done for the second text input. T is from keras.preprocessing.text.Tokenizer.
I defined the model with:
common_embed = Embedding(
name="synopsis_embedd",
input_dim=len(t.word_index)+1,
output_dim=300,
input_length=len(data['text_1_seq_pad'].tolist()[0]),
trainable=True
)
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)
)
input1 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e1 = common_embed(input1)
x1 = lstm_layer(e1)
input2 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e2 = common_embed(input2)
x2 = lstm_layer(e2)
merged = tf.keras.layers.Lambda(
function=l1_distance, output_shape=l1_dist_output_shape, name='L1_distance'
)([x1, x2])
conc = Concatenate(axis=-1)([merged, x1, x2])
x = Dropout(0.01)(conc)
preds = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
that seems to work if I feed the numpy data with the fit method:
model.fit(
x = [np.array(data['text_1_seq_pad'].tolist()), np.array(data['text_2_seq_pad'].tolist())],
y = y_train.values.reshape(-1,1),
epochs=epochs,
batch_size=batch_size,
validation_data=([np.array(val['text_1_seq_pad'].tolist()), np.array(val['text_2_seq_pad'].tolist())], y_val.values.reshape(-1,1)),
)
What I'm trying to understand at the moment is what is the shape in my case for the LSTM layer for:
samples
time_steps
features
Is it correct that the input_shape for the LSTM layer would be input_shape=(300,1) because I set the embedding output dim to 300 and I have only 1 input feature per LSTM?
And do I need to reshape the embedding output or can I just set
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, input_shape=(300,1), dropout=0.2, recurrent_dropout=0.2)
)
from the embedding output?
Example notebook can be found in Github or as Colab
In general, an LSTM layer needs 3D inputs shaped this way : (batch_size, lenght of an input sequence , number of features ). (Batch size is not really important, so you can just consider that one input need to have this shape (lenght of sequence, number of features par item) )
In your case, the output dim of your embedding layer is 300. So your LSTM have 300 features.
Then, using LSTM on sentences requires a constant number of tokens. LSTM works with constant input dimension, you can not pass it a text with 12 tokens following by another one with 68 tokens. Indeed, you need to fix a limit and pad the sequence if needed.
So, if your sentence is 20 tokens long and that your limit is 50, you need to pad (add at the end of your sequence) the sequence with 30 “neutral” tokens (often zeros).
After all, your LSTM input dimension must be (number of token per text, dimension of your embedding outputs) -> (50, 300) in my example.
To learn more about it, it suggest you to take a look to this : (but in your case, you can replace time_steps by number_of_tokens)
https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
Share
Edit
Delete
Flag
Related
I have developed a trivial Feed Forward neural network with Pytorch.
The neural network uses GloVe pre-trained embeddings in a freezed nn.Embeddings layer.
Next, the embedding layer splits into three embeddings. Each split is a different transformation applied to the initial embedding layer. Then the embeddings layer feed three nn.Linear layers. And finally I have a single output layer for a binary classification target.
The shape of the embedding tensor is [64,150,50]
-> 64: sentences in the batch,
-> 150: words per sentence,
-> 50: vector-size of a single word (pre-trained GloVe vector)
So after the transformation, the embedding layer splits into three layers with shape [64,50], where 50 = either the torch.mean(), torch.max() or torch.min() of the 150 words per sentence.
My questions are:
How could I feed the output layer from three different nn.Linear layers to predict a single target value [0,1].
Is this efficient and helpful to the total predictive power of the model? Or just selecting the average of the embeddings is sufficient and no improvement will be observed.
The forward() method of my PyTorch model is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
else:
embedded = self.flatten_layer(embedded)
input_layer = self.input_layer(embedded_average) #each Linear layer has the same value of hidden unit
input_layer = self.activation(input_layer)
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
#What should I do here? to exploit the weights of the 3 hidden layers
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer) #Sigmoid()
return output_layer
After the proposed answer the function is:
def forward(self, text):
embedded = self.embedding(text)
if self.use_pretrained_embeddings:
embedded_average = torch.mean(embedded, dim=1)
embedded_max = torch.max(embedded, dim=1)[0]
embedded_min = torch.min(embedded, dim=1)[0]
#use of average embeddings transformation
input_layer_average = self.input_layer(embedded_average)
input_layer_average = self.activation(input_layer_average)
#use of max embeddings transformation
input_layer_max = self.input_layer(embedded_max)
input_layer_max = self.activation(input_layer_max)
#use of min embeddings transformation
input_layer_min = self.input_layer(embedded_min)
input_layer_min = self.activation(input_layer_min)
else:
embedded = self.flatten_layer(embedded)
input_layer = torch.concat([input_layer_average, input_layer_max, input_layer_min], dim=1)
input_layer = self.activation(input_layer)
print("3",input_layer.shape) #[192,1] vs [64,1] -> output layer
if self.n_layers !=0:
for layer in self.layers:
input_layer = layer(input_layer)
output_layer = self.output_layer(input_layer)
output_layer = self.activation_output(output_layer)
return output_layer
This generates the following error:
ValueError: Using a target size (torch.Size([64, 1])) that is different to the input size (torch.Size([192, 1])) is deprecated. Please ensure they have the same size.
Expected outcome since the concatenated layer is 3x the size of the sentences (64). Any fix that could resolve it?
Regarding 1: You can use torch.concat to concatenate the outputs along the appropriate dimension, and then e.g. map them to a single output using another linear layer.
Regarding 2: You will have to try it yourself and see whether this is useful.
I am working to understand Erik Linder-Norén's implementation of the Categorical GAN model, and am confused by the generator in that model:
def build_generator(self):
model = Sequential()
# ...some lines removed...
model.add(Dense(np.prod(self.img_shape), activation='tanh'))
model.add(Reshape(self.img_shape))
model.summary()
noise = Input(shape=(self.latent_dim,))
label = Input(shape=(1,), dtype='int32')
label_embedding = Flatten()(Embedding(self.num_classes, self.latent_dim)(label))
model_input = multiply([noise, label_embedding])
img = model(model_input)
return Model([noise, label], img)
My question is: How does the Embedding() layer work here?
I know that noise is a vector that has length 100, and label is an integer, but I don't understand what the label_embedding object contains or how it functions here.
I tried printing the shape of label_embedding to try and figure out what's going on in that Embedding() line but that returns (?,?).
If anyone could help me understand how the Embedding() lines here work, I'd be very grateful for their assistance!
To keep in mind why use embedding here at all: the alternative is to concatenate the noise with the conditioned class, which may cause the generator to completely ignore the noise values, generating data with high similarity in each class (or even just 1 per class).
From the documentation, https://keras.io/layers/embeddings/#embedding,
Turns positive integers (indexes) into dense vectors of fixed size.
eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
In the GAN model, the input integer(0-9) is converted to a vector of shape 100. With this short code snippet, we can feed some test input to check the output shape of the Embedding layer.
from keras.layers import Input, Embedding
from keras.models import Model
import numpy as np
latent_dim = 100
num_classes = 10
label = Input(shape=(1,), dtype='int32')
label_embedding = Embedding(num_classes, latent_dim)(label)
mod = Model(label, label_embedding)
test_input = np.zeros((1))
print(f'output shape is {mod.predict(test_input).shape}')
mod.summary()
output shape is (1, 1, 100)
From model summary, output shape for embedding layer is (1,100) which is the same as output of predict.
embedding_1 (Embedding) (None, 1, 100) 1000
One additional point, in the output shape (1,1,100), the leftmost 1 is the batch size, the middle 1 is the input length. In this case, we provided an input of length 1.
The embedding stores the per label state. If I read the code correctly, each label corresponds to a digit; i.e. there is an embedding that captures how to generate a 0, 1, ... 9.
This code takes some random noise and multiplies it to this per label state. The result should be a vector that leads the generator to display the digit corresponding to the label (i.e. 0..9).
I've had this suspicion for the longest time but couldn't figure out whether the case or not, so here's the scenario:
I'm trying to build a model that has 3 features from 3 different inputs:
A text sequence
A float
A float
Now all three of these makes up one time step. But since I'm using glove to vectorize my text sequence using 100 dimensions, a 20 word text sequence ends up having the length of 2000. Hence, the overall input per step has the length of 2002 (each time step a matrix with the shape (1, 2002) is fed in, with 2000 of those coming from a single feature.
Is the text sequence overwhelming the two floats so whatever the value of the floats are it's irrelevant to the prediction? If so, what can I do to fix this? Perhaps manually weigh how much each feature should be used? Code is attached
def build_model(embedding_matrix) -> Model:
text = Input(shape=(9, news_text.shape[1]), name='text')
price = Input(shape=(9, 1), name='price')
volume = Input(shape=(9, 1), name='volume')
text_layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix]
)(text)
text_layer = Dropout(0.2)(text_layer)
# Flatten the vectorized text matrix
text_layer = Reshape((9, int_shape(text_layer)[2] * int_shape(text_layer)[3]))(text_layer)
inputs = concatenate([
text_layer,
price,
volume
])
output = Convolution1D(128, 5, activation='relu')(inputs)
output = MaxPool1D(pool_size=4)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2)(output)
output = Dense(units=2, activation='linear', name='output')(output)
model = Model(
inputs=[text, price, volume],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
return model
Edit: note that the input shape into the lstm is (?, 9, 2002) which means right now the 2000 coming from the text are treated as 2000 independent features
As I mentioned in my comments, one approach is to have a two branch model where one branch processes the text data and another one processes the two float features. At the end the output of the two branches are merged together:
# Branch one: process text data
text_input = Input(shape=(news_text.shape[1],), name='text')
text_emb = Embedding(embedding_matrix.shape[0],embedding_matrix.shape[1],
weights=[embedding_matrix])(text_input)
# you may alternatively use only Conv1D + MaxPool1D or
# stack multiple LSTM layers on top of each other or
# use a combination of Conv1D, MaxPool1D and LSTM
text_conv = Convolution1D(128, 5, activation='relu')(text_emb)
text_lstm = LSTM(units=128, dropout=0.2)(text_conv)
# Branch two: process float features
price_input = Input(shape=(9, 1), name='price')
volume_input = Input(shape=(9, 1), name='volume')
pv = concatenate([price_input, volume_input])
# you can also stack multiple LSTM layers on top of each other
pv_lstm = LSTM(units=128, dropout=0.2)(pv)
# merge output of branches
text_pv = concatenate([text_lstm, pv_lstm])
output = Dense(units=2, activation='linear', name='output')(text_pv)
model = Model(
inputs=[text_input, price_input, volume_input],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
As I have commented in the code, this is just a simple illustration. You may need to further add or remove layers or regularization and tune the hyper-parameters.
I am working on a project where I have to use a combination of numeric and text data in a neural network to make predictions of a system's availability for the next hour. Instead of trying to use separate neural networks and doing something weird/unclear (to me) at the end to produce the desired output, I decided to use Keras' merge layer with two networks (one for numeric data, one for text). The idea is that I feed the model a sequence of performance metrics for the previous 6 hours in the shape of (batch_size, 6hrs, num_features). Alongside the input I am giving to the network that handles numeric data, I am giving the second network another sequence of the size (batch_size, max_alerts_per_sequence, max_sentence length).
Any sequence of numeric data within a time range can have a variable number of events (text data) associated with it. For the sake of simplicity, I only allow a maximum of 50 events to accompany a sequence of performance data. Each event is hash encoded by word and padded. I have tried using a flatten layer to reduce the input shape from (50, 30) to (1500) so that the model can train on every event in these "sequences" (to clarify: I pass the model 50 sentences with 30 encoded elements each for every sequence of performance data).
My question is: Due to the fact that I need the NN to look at all events for a given sequence of performance metrics, how can I make the NN for text based data train on sequences of sentences?
My Model:
#LSTM Module for performance metrics
input = Input(shape=(shape[1], shape[2]))
lstm1 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=True, input_shape=shape))(input)
dropout1 = Dropout(rate=0.2)(lstm1)
lstm2 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=False))(dropout1)
dropout2 = Dropout(rate=0.2)(lstm2)
#LSTM Module for text based data
tInput = Input(shape=(50, 30))
flatten = Flatten()(tInput)
embed = Embedding(input_dim=vocabsize + 1, output_dim= 50 * 30, input_length=30*50)(flatten)
magic = Bidirectional(LSTM(100))(embed)
tOut = Dense(1, activation='relu')(magic)
#Merge the layers
concat = Concatenate()([dropout2, tOut])
output = Dense(units=1, activation='sigmoid')(concat)
nn = keras.models.Model(inputs=[input, tInput], outputs = output)
opt = keras.optimizers.SGD(lr=0.1, momentum=0.8, nesterov=True, decay=0.001)
nn.compile(optimizer=opt, loss='mse', metrics=['accuracy', coeff_determination])
So as far as I understood you have a sequence of max 50 events, which you want to make predictions for. These events have text data attached, which can be treated as another sequence of word embeddings. Here is an article about a similar architecture.
I would propose a solution which involves LSTMs for the text part an 1D-convolution for the "real" sequence part. Every LSTM layer is concatenated with the numerical data. This involves 50 LSTM layers, which can be time consuming to train, even if you use shared weights. It would be also possible to use only convolution layers for the text part, which is faster, but does not model long term dependencies. (I have the experience, that these long term dependencies are often not that important in text mining).
Text -> LSTM or 1DConv -> concat with numerical data -> 1DConv -> Output
Here is some exmaple code, which shows how to do use shard weights
numeric_input = Input(shape=(x_numeric_train.values.shape[1],), name='numeric_input')
nlp_seq = Input(shape=(number_of_messages ,seq_length,), name='nlp_input'+str(i))
# shared layers
emb = TimeDistributed(Embedding(input_dim=num_features, output_dim=embedding_size,
input_length=seq_length, mask_zero=True,
input_shape=(seq_length, )))(nlp_seq)
x = TimeDistributed(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01))))(emb)
c1 = Conv1D(filter_size, kernel1, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p1 = GlobalMaxPooling1D()(c1)
c2 = Conv1D(filter_size, kernel2, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p2 = GlobalMaxPooling1D()(c2)
c3 = Conv1D(filter_size, kernel3, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p3 = GlobalMaxPooling1D()(c3)
x = concatenate([p1, p2, p3, numeric_input])
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[nlp_seq, meta_input] , outputs=[x])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
And training:
model.fit([x_train, x_numeric_train], y_train)
# where x_train is a a array of num_samples * num_messages * seq_length
A complex model like this needs a lot of data to converge. For less data a simpler solution could be implemented by aggregating the events to have only one sequence. For example the text data of all events can be treated as one single text (with a separator token), instead of multiple texts, while the numerical data can be summed up, averaged or even combined into a fixed length list. But this depends on your data.
As I am working on something similar, I will update these answer with code later on.
I have the following idea to implement:
Input -> CNN-> LSTM -> Dense -> Output
The Input has 100 time steps, each step has a 64-dimensional feature vector
A Conv1D layer will extract features at each time step. The CNN layer contains 64 filters, each has length 16 taps. Then, a maxpooling layer will extract the single maximum value of each convolutional output, so a total of 64 features will be extracted at each time step.
Then, the output of the CNN layer will be fed into an LSTM layer with 64 neurons. Number of recurrence is the same as time step of input, which is 100 time steps. The LSTM layer should return a sequence of 64-dimensional output (the length of sequence == number of time steps == 100, so there should be 100*64=6400 numbers).
input = Input(shape=(100,64), dtype='float', name='mfcc_input')
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
... (more code) ...
But this doesn't work. The second line reports "list index out of range" and I don't understand what's going on.
I'm new to Keras, so I appreciate sincerely if anyone could help me with it.
This picture explains how CNN should be applied to EACH TIME STEP
The problem is with your input. Your input is of shape (100, 64) in which the first dimension is the timesteps. So ignoring that, your input is of shape (64) to a Conv1D.
Now, refer to the Keras Conv1D documentation, which states that the input should be a 3D tensor (batch_size, steps, input_dim). Ignoring the batch_size, your input should be a 2D tensor (steps, input_dim).
So, you are providing 1D tensor input, where the expected size of the input is a 2D tensor. For example, if you are providing Natural Language input to the Conv1D in form of words, then there are 64 words in your sentence and supposing each word is encoded with a vector of length 50, your input should be (64, 50).
Also, make sure that you are feeding the right input to LSTM as given in the code below.
So, the correct code should be
embedding_size = 50 # Set this accordingingly
mfcc_input = Input(shape=(100, 64, embedding_size), dtype='float', name='mfcc_input')
CNN_out = TimeDistributed(Conv1D(64, 16, activation='relu'))(mfcc_input)
CNN_out = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True)(CNN_out)
CNN_out = TimeDistributed(MaxPooling1D(pool_size=(64-16+1), strides=None, padding='valid'))(CNN_out)
# Directly feeding CNN_out to LSTM will also raise Error, since the 3rd dimension is 1, you need to purge it as
CNN_out = Reshape((int(CNN_out.shape[1]), int(CNN_out.shape[3])))(CNN_out)
LSTM_out = LSTM(64,return_sequences=True)(CNN_out)
... (more code) ...