Keras: good result with MLP but bad with Bidirectional LSTM - python

I trained two neural networks with Keras: a MLP and a Bidirectional LSTM.
My task is to predict the words order in a sentence, so for each word, the neural network has to output a real number. When a sentence with N words is processed, the N reals number in the output are ranked in order to obtain integer numbers representing words position.
I'm using same dataset and same preprocessing on the dataset. The only different thing is that in the LSTM dataset I added padding to get the sequences of the same length.
In the prediction phase, with LSTM, I exclude the predictions created from padding vectors, since I masked them in the training phase.
MLP architecture:
mlp = keras.models.Sequential()
# add input layer
mlp.add(
keras.layers.Dense(
units=training_dataset.shape[1],
input_shape = (training_dataset.shape[1],),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
activation='relu')
)
# add hidden layer
mlp.add(
keras.layers.Dense(
units=training_dataset.shape[1] + 10,
input_shape = (training_dataset.shape[1] + 10,),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
bias_initializer='zeros',
activation='relu')
)
# add output layer
mlp.add(
keras.layers.Dense(
units=1,
input_shape = (1, ),
kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
bias_initializer='zeros',
activation='linear')
)
Bidirection LSTM architecture:
model = tf.keras.Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(Bidirectional(LSTM(units=20, return_sequences=True), input_shape=(timesteps, features)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))
The task is much better solvable with an LSTM, which should capture dependencies between words well.
However, with the MLP I achieve good results, but with LSTM the results are very bad.
Since I'm a beginner, could someone understand what is wrong with my LSTM architecture? I'm going out of head.
Thanks in advance.

For this problem, I am actually not surprised that MLP performs better.
The architecture of LSTM, bi-directional or not, assumes that location is very important to the structure. Words next to each other are more likely to be related than words farther away.
But for your problem you have removed the locality and are trying to restore it. For that problem, an MLP which has global information can do a better job at the sorting.
That said, I think there is still something to be done to improve the LSTM model.
One thing you can do is ensure that the complexity of each model is similar. You can do this easily with count_params.
mlp.count_params()
model.count_params()
If I had to guess, your LSTM is much smaller. There are only 20 units, which seems small for an NLP problem. I used 512 for a Product Classification problem to process character-level information (vocabulary of size 128, embedding of size 50). Word-level models trained on bigger data sets, like AWD-LSTM, get into the thousands of units.
So you probably want to increase that number. You can get an apples-to-apples comparison between the two models by increasing the number of units in the LSTM until the parameter counts are similar. But you don't have to stop there, you can keep increasing the size until you start to overfit or your training starts taking too long.

Related

on which basis should i set Input and output shapes in python keras LSTM?

I have dataset of shape (143312, 30) and i'm using the following code for setting the model
model = Sequential()
model.add(LSTM(100,activation='sigmoid', input_shape = (30,1 ) ))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy',f1_m,precision_m, recall_m])
It is working but I have no idea why. Is it just about the feature numbers? When I have 30 features then do I simply set it like this? What does 1 mean and on which basis was Dense set to 5?
About this one:
LSTM(100,activation='sigmoid', input_shape = (30,1))
You have created RNN, which works on sequences of 30 items, each item has one feature. This matches to your data set with shape (143312, 30). The dataset contains 143312 sequences of data, each sequence 30 items long, each item is just a single feature.
The 100 here specifies the number of units (recurrent neurons) used in LSTM. It is a hyperparameter, you use a bigger number for a more complex model and smaller one if your model overfits data.
Regarding this one:
model.add(Dense(5, activation='softmax'))
This is an output layer of your model. Apparently you are using your model for classficantion ('softmax' activation function) and your labels have 5 classes, hence 5 neurons in the Dense layer.

What can I do to help make my TensorFlow network overfit a large dataset?

The reason I am trying to overfit specifically, is because I am following the "Deep Learning with Python" by François Chollet's steps to designing a network. This is important as this is for my final project in my degree.
At this stage, I need to make a network large enough to overfit my data in order to determine a maximal capacity, an upper-bounds for the size of networks that I will optimise for.
However, as the title suggests, I am struggling to make my network overfit. Perhaps my approach is naïve, but let me explain my model:
I am using this dataset, to train a model to classify stars. There are two classes that a star must be classified by (into both of them): its spectral class (100 classes) and luminosity class (10 classes).
For example, our sun is a 'G2V', it's spectral class is 'G2' and it's luminosity class is 'V'.
To this end, I have built a double-headed network, it takes this input data:
DataFrame containing input data
It then splits into two parallel networks.
# Create our input layer:
input = keras.Input(shape=(3), name='observation_data')
# Build our spectral class
s_class_branch = layers.Dense(100000, activation='relu', name = 's_class_branch_dense_1')(input)
s_class_branch = layers.Dense(500, activation='relu', name = 's_class_branch_dense_2')(s_class_branch)
# Spectral class prediction
s_class_prediction = layers.Dense(100,
activation='softmax',
name='s_class_prediction')(s_class_branch)
# Build our luminosity class
l_class_branch = layers.Dense(100000, activation='relu', name = 'l_class_branch_dense_1')(input)
l_class_branch = layers.Dense(500, activation='relu', name = 'l_class_branch_dense_2')(l_class_branch)
# Luminosity class prediction
l_class_prediction = layers.Dense(10,
activation='softmax',
name='l_class_prediction')(l_class_branch)
# Now we instantiate our model using the layer setup above
scaled_model = Model(input, [s_class_prediction, l_class_prediction])
optimizer = keras.optimizers.RMSprop(learning_rate=0.004)
scaled_model.compile(optimizer=optimizer,
loss={'s_class_prediction':'categorical_crossentropy',
'l_class_prediction':'categorical_crossentropy'},
metrics=['accuracy'])
logdir = os.path.join("logs", "2raw100k")
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
scaled_model.fit(
input_data,{
's_class_prediction':spectral_targets,
'l_class_prediction':luminosity_targets
},
epochs=20,
batch_size=1000,
validation_split=0.0,
callbacks=[tensorboard_callback])
In the code above you can see me attempting a model with two hidden layers in both branches, one layer with a shape of 100 000, following into another layer with 500, before going to the output layer. The training targets are one-hot encoded, so there is one node for every class.
I have tried a wide range of sizes with one to four hidden layers, ranging from a shape of 500 to 100 000, only stopping because I ran out of RAM. I have only used dense layers, with the exception of trying a normalisation layer to no affect.
Graph of losses
They will all happily train and slowly lower the loss, but they never seem to overfit. I have run networks out to 100 epochs and they still will not overfit.
What can I do to make my network fit the data better? I am fairly new to machine learning, having only been doing this for a year now, so I am sure there is something that I am missing. I really appreciate any help and would be happy to provide the logs shown in the graph.
After a lot more training I think I have this answered. Basically, the network did not have adequate capacity and needed more layers. I had tried more layers earlier but because I was not comparing it to validation data the overfitting was not apparent!
The proof is in the pudding:
So thank you to #Aryagm for their comment, because that let me work it out. As you can see, the validation data (grey and blue) clearly overfits, while the training data (green and orange) does not show it.
If anything, this goes to show why a separate validation set is so important and I am a fool for not having used it in the first place! Lesson learned.

Time Series Forecasting model with LSTM in Tensorflow predicts a constant

I am building a hurricane track predictor using satellite data. I have a multiple to many output in a multilayer LSTM model, with input and output arrays following the structure [samples[time[features]]]. I have as features of inputs and outputs the coordinates of the hurricane, WS, and other dimensions.
The problem is that the error reduction, and as a consequence, the model predicts always a constant. After reading several posts, I standardized the data, removed some unnecessary layers, but still, the model always predicts the same output.
I think the model is big enough, activation functions make sense, given that the outputs are all within [-1;1].
So my questions are : What am I doing wrong ?
The model is the following:
class Stacked_LSTM():
def __init__(self, training_inputs, training_outputs, n_steps_in, n_steps_out, n_features_in, n_features_out, metrics, optimizer, epochs):
self.training_inputs = training_inputs
self.training_outputs = training_outputs
self.epochs = epochs
self.n_steps_in = n_steps_in
self.n_steps_out = n_steps_out
self.n_features_in = n_features_in
self.n_features_out = n_features_out
self.metrics = metrics
self.optimizer = optimizer
self.stop = EarlyStopping(monitor='loss', min_delta=0.000000000001, patience=30)
self.model = Sequential()
self.model.add(LSTM(360, activation='tanh', return_sequences=True, input_shape=(self.n_steps_in, self.n_features_in,))) #, kernel_regularizer=regularizers.l2(0.001), not a good idea
self.model.add(layers.Dropout(0.1))
self.model.add(LSTM(360, activation='tanh'))
self.model.add(layers.Dropout(0.1))
self.model.add(Dense(self.n_features_out*self.n_steps_out))
self.model.add(Reshape((self.n_steps_out, self.n_features_out)))
self.model.compile(optimizer=self.optimizer, loss='mae', metrics=[metrics])
def fit(self):
return self.model.fit(self.training_inputs, self.training_outputs, callbacks=[self.stop], epochs=self.epochs)
def predict(self, input):
return self.model.predict(input)
Notes
1) In this particular problem, the time series data is not "continuous", because one time serie belongs to a particular hurricane. I have therefore adapted the training and test samples of the time series to each hurricane. The implication of this is that I cannot use the function stateful=True in my layers because it would then mean that the model doesn't makes any difference between the different hurricanes (if my understanding is correct).
2) No image data, so no convolutionnal model needed.
Few suggestions, based on my experience:
4 layers of LSTM is too much. Stick to two, maximum three.
Don't use relu as activations for LSTMs.
Do not use BatchNormalization for time-series.
Other than these, I'd also suggest removing the dense layers between two LSTM layers.

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?
Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .
I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

TimeDistributed layer and return sequences etc for LSTM in Keras

Sorry I am new to RNN. I have read this post on TimeDistributed layer.
I have reshaped my data in to Keras requried [samples, time_steps, features]: [140*50*19], which means I have 140 data points, each has 50 time steps, and 19 features. My output is shaped [140*50*1]. I care more about the last data point's accuracy. This is a regression problem.
My current code is :
x = Input((None, X_train.shape[-1]) , name='input')
lstm_kwargs = { 'dropout_W': 0.25, 'return_sequences': True, 'consume_less': 'gpu'}
lstm1 = LSTM(64, name='lstm1', **lstm_kwargs)(x)
output = Dense(1, activation='relu', name='output')(lstm1)
model = Model(input=x, output=output)
sgd = SGD(lr=0.00006, momentum=0.8, decay=0, nesterov=False)
optimizer = sgd
model.compile(optimizer=optimizer, loss='mean_squared_error')
My questions are:
My case is many-to-many, so I need to use return_sequences=True? How about if I only need the last time step's prediction, it would be many-to-one. So I need to my output to be [140*1*1] and return_sequences=False?
Is there anyway to enhance my last time points accuracy if I use many-to-many? I care more about it than the other points accuracy.
I have tried to use TimeDistributed layer as
output = TimeDistributed(Dense(1, activation='relu'), name='output')(lstm1)
the performance seems to be worse than without using TimeDistributed layer. Why is this so?
I tried to use optimizer=RMSprop(lr=0.001). I thought RMSprop is supposed to stabilize the NN. But I was never able to get good result using RMSprop.
How do I choose a good lr and momentum for SGD? I have been testing on different combinations manually. Is there a cross validation method in keras?
So:
Yes - return_sequences=False makes your network to output only a last element of sequence prediction.
You could define the output slicing using the Lambda layer. Here you could find an example on how to do this. Having the output sliced you can provide the additional output where you'll feed the values of the last timestep.
From the computational point of view these two approaches are equivalent. Maybe the problem lies in randomness introduced by weight sampling.
Actually - using RMSProp as a first choice for RNNs is a rule of thumb - not a general proved law. Moreover - it is strongly adviced not to change it's parameters. So this might cause the problems. Another thing is that LSTM needs a lot of time to stabalize. Maybe you need to leave it for more epochs. Last thing - is that maybe your data could favour another activation function.
You could use a keras.sklearnWrapper.

Categories

Resources