Keras RNN (GRU, LSTM) produces plateau and then improvement

Keras RNN (GRU, LSTM) produces plateau and then improvement - python

I'm new to Keras (with TensorFlow backend) and am using it to do some simple sentiment analysis on user reviews. For some reason, my recurrent neural network is producing some unusual results that I do not understand.
First, my data is a straight-forward sentiment analysis training and test set from the UCI ML archive. There were 2061 training instances, which is small. The data looks like this:
text label
0 So there is no way for me to plug it in here i... 0
1 Good case, Excellent value. 1
2 Great for the jawbone. 1
3 Tied to charger for conversations lasting more... 0
4 The mic is great. 1
Second, here is a FFNN implementation that produces good results.
# FFNN model.
# Build the model.
model_ffnn = Sequential()
model_ffnn.add(layers.Embedding(input_dim=V, output_dim=32))
model_ffnn.add(layers.GlobalMaxPool1D())
model_ffnn.add(layers.Dense(10, activation='relu'))
model_ffnn.add(layers.Dense(1, activation='sigmoid'))
model_ffnn.summary()
# Compile and train.
model_ffnn.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_ffnn = model_ffnn.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
As you can see, the learning curves produce a smooth improvement as the number of epochs increases.
Third, here is the problem. I trained a recurrent neural network with a GRU, as shown below. I also tried an LSTM and saw the same results.
# GRU model.
# Build the model.
model_gru = Sequential()
model_gru.add(layers.Embedding(input_dim=V, output_dim=32))
model_gru.add(layers.GRU(units=32))
model_gru.add(layers.Dense(units=1, activation='sigmoid'))
model_gru.summary()
# Compile and train.
model_gru.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
EPOCHS = 50
history_gru = model_gru.fit(x_train, y_train, epochs=EPOCHS,
batch_size=128, validation_split=0.2, verbose=3)
However, the learning curves are quite unusual. You can see a plateau where neither the loss nor the accuracy improve up to about epoch 17, and then the model starts learning and improving. I have never seen this type of plateau at the start of training before.
Can anyone explain why this plateau is occurring, why it stops and gives way to gradual learning, and how I can avoid it?

Following the comment by #Gerges Dib, I tried out different learning rates in increasing order.
lr = 0.0001
lr = 0.001 (the default learning rate for RMSprop)
lr = 0.01
lr = 0.05
lr = 0.1
This is very interesting. It looks like the plateau was caused by the optimizer's learning rate being too low. The parameters were stuck in a local optima until it could break out. I have not seen this pattern before.

Related

Extracting weights from best Neural Network in Tensorflow/Keras - multiple epochs

I am working on a 1 - hidden - layer Neural Network with 2000 neurons and 8 + constant input neurons for a regression problem.
In particular, as optimizer I am using RMSprop with learning parameter = 0.001, ReLU activation from input to hidden layer and linear from hidden to output. I am also using a mini-batch-gradient-descent (32 observations) and running the model 2000 times, that is epochs = 2000.
My goal is, after the training, to extract the weights from the best Neural Network out of the 2000 run (where, after many trials, the best one is never the last, and with best I mean the one that leads to the smallest MSE).
Using save_weights('my_model_2.h5', save_format='h5') actually works, but at my understanding it extract the weights from the last epoch, while I want those from the epoch in which the NN has perfomed the best. Please find the code I have written:
def build_first_NN():
model = keras.Sequential([
layers.Dense(2000, activation=tf.nn.relu, input_shape=[len(X_34.keys())]),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error']
)
return model
first_NN = build_first_NN()
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000)
first_NN.save_weights('my_model_2.h5', save_format='h5')
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['dense_1/dense_1/kernel:0'][:])
The then extracted weights should be those from the last of the 2000 epochs: how can I get those from, instead, the one in which the MSE was the smallest?
Looking forward for any comment.
EDIT: SOLVED
Building on the received suggestions, as for general interest, that's how I have updated my code, meeting my scope:
# build_first_NN() as defined before
first_NN = build_first_NN()
trained_weights_path = 'C:/Users/Myname/Desktop/otherfolder/Data/my_model_2.h5'
checkpoint = ModelCheckpoint(trained_weights_path,
monitor='mean_squared_error',
verbose=1,
save_best_only=True,
mode='min')
history_firstNN_all_nocv = first_NN.fit(X_34,
y_34,
epochs = 2000,
callbacks = [checkpoint])
trained_weights = h5py.File(trained_weights_path, 'r')
weights_0 = pd.DataFrame(trained_weights['model_weights/dense/dense/kernel:0'][:])
weights_1 = pd.DataFrame(trained_weights['model_weights/dense_1/dense_1/kernel:0'][:])

Use ModelCheckpoint callback from Keras.
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_squared_error', verbose=1, save_best_only=True, mode='max')
use this as a callback in your model.fit() . This will always save the model with the highest validation accuracy (lowest MSE on validation) at the location specified by filepath.
You can find the documentation here.
Of course, you need validation data during training for this. Otherwise I think you can probably be able to check on lowest training MSE by writing a callback function yourself.

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?

Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .

I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

Keras batch training online predicting not learning

I have been working on a side project trying to learn machine learning with Keras myself and I think I am stuck here.
My intention is to predict the bike availability of a public sharing system that has 31 stations. For now I am only training my model to predict the availability of one station only. I'd like to do online predictions with batch training. I'd like to start giving it a number of bikes at, for example, 00:00 with N given time steps plus the day of the year and weekday.
The input data is this:
Day of the year, encoded as ints, 1-JAN is 0, 2-JAN is 1...
Time in 5' intervals encoded as ints the same way as before, 00:00 is 0, 00:05 is 1...
Weekday, again encoded as int
Those 3 columns are then normalized, then i add the columns that refer to the bikes, they are one hot encode, if the station has 20 bikes the encoded array will have length 21. The data is then transformed to a supervised problem more or less following this tutorial.
Now I divide my dataset into training (65%) and test (35%) samples. And then define the neural network as this:
model = Sequential()
model.add(LSTM(lstm_neurons, batch_input_shape=(1000, 5, 24), stateful=False))
model.add(Dense(max_cases, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy', 'mse', 'mae'])
Fit the model
for i in range(epochs):
model.fit(train_x, train_y, epochs=1, batch_size=new_batch_size, verbose=2, shuffle=False)
model.reset_states()
w = model.get_weights()
Accuracy plot looks good but the loss one does weird things.
Once the training finishes I predict values, I change from stateless to stateful and modify the batch size
model = Sequential()
model.add(LSTM(lstm_neurons, batch_input_shape=(1, 5, 24), stateful=True))
model.add(Dense(max_cases, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy', 'mse', 'mae'])
model.set_weights(w)
I now predict using the test values I got from before
for i in range(0, len(test_x)):
auxx = test_x[i].reshape((batch_size, test_x[i].shape[0], test_x[i].shape[1])) # (...,n_in,4)
yhat = model.predict(auxx, batch_size = batch_size)
This is the result, I am zooming it a bit to get a closer look and not a crowded plot. It doesn't look bad at all, it has some errors but overall the predictions looks good enough.
After this I create my set of data to do the online prediction and predict
for i in range(0,290):
# ...
predicted_bikes = model.predict(data_to_feed, batch_size = 1)
# ...
The result is this one, a continuous line.
As I've seen in the previous plot the predicted value is moved like an interval later to the real value which makes me think that the neural network has learnt to repeat the previous values. That's why here I got a straight line.

keras - seq2seq model predicting same output for all test inputs

I am trying to build a seq2seq model using LSTM in Keras. Currently working on the English to French pairs dataset-10k pairs(orig dataset has 147k pairs). After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.
What I observe is the predicted words are nothing but the most frequent words in the dataset and they are displayed in the decreasing order of their frequency.
eg:
'I know you', 'Can we go ?', 'snap out of it' -- for all these input seq the output is -- 'je suis en train' (same output for all three).
Can anyone help me what could be the reason why model is behaving like this. Am i missing something basic ?
I tried following with batchsize=32, epoch=50,maxinp=8, maxout=8, embeddingsize=100.
encoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='encoder_inputs')
encoder_lstm1 = LSTM(units=HIDDEN_UNITS, return_state=True, name="encoder_lstm1" , stateful=False, dropout=0.2)
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm1(encoder_inputs)
encoder_states = [encoder_state_h, encoder_state_c]
decoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='decoder_inputs')
decoder_lstm = LSTM(units=HIDDEN_UNITS, return_sequences=True, return_state=True, stateful=False,
name='decoder_lstm', dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)
self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(self.model.summary())
self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Xtrain, Xtest, Ytrain, Ytest = train_test_split(input_texts_word2em, self.target_texts, test_size=0.2, random_state=42)
train_gen = generate_batch(Xtrain, Ytrain, self)
test_gen = generate_batch(Xtest, Ytest, self)
train_num_batches = len(Xtrain) // BATCH_SIZE
test_num_batches = len(Xtest) // BATCH_SIZE
self.model.fit_generator(generator=train_gen, steps_per_epoch=train_num_batches,
epochs=NUM_EPOCHS,
verbose=1, validation_data=test_gen, validation_steps=test_num_batches ) #, callbacks=[checkpoint])
self.encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_inputs = [Input(shape=(HIDDEN_UNITS,)), Input(shape=(HIDDEN_UNITS,))]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_state_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
self.decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs] + decoder_states)
Update:: I have run on 147k dataset with 5 epochs and the results vary for every input. Thanks for the help.
However now I am running same model with another dataset with input and output sequences containing average no of words as 170 and 100 respectively after cleaning (removing stopwords and stuff).
This dataset has around 30k records and if i run even with 50 epochs, the results are same again for every test sentence. So what are my next options for me to try. I was expecting atleast different output for different inputs (even if it is wrong) but same output is adding more frustration whether the model is not learning properly.
Any answers ??

An LSTM-based encoder-decoder (Seq2Seq) that is correctly setup may produce the same output for any input when the net has not trained for enough epochs. I can reliably reproduce the "same stupid output no matter what the input" result simply by reducing my number of epochs from 200 to 30.
One thing that may be confusing is that default accuracy measures may not seem to improve much, as you go from 30 to 150 epochs. However, in cases where you are using Seq2Seq for chatbot or translation tasks, something like the BLEU score is more relevant. Even more relevant is your own evaluation of the 'realism' of the responses. These evaluations are done at inference time, not during training - but they can influence your evaluation of whether or not the model has trained sufficiently.
Another thing to consider is that, in my case, the training data was movie dialogue from the Cornell set. Chatbot dialogue is supposed to be helpful - movie dialogue has no such mandate - it's just supposed to be interesting. So we do have a mismatch here between training and use.
In any event, here are examples of the exact same net trained after 30 vs. 150 epochs, responding to the same inputs.
30 epochs
150 epochs
In this case, the default keras.Model.fit() accuracy reported went from .2118 (after 30 epochs) to .2243 (after 150), but clearly the inputs are now getting differentiated. If this were a classifier and we were just looking at the training accuracy (and didn't look at sample inferences) we might reasonably assume that all those additional training epochs were pointless.
But think of it this way - evaluating the ability of a model to, for example, classify a picture of a bird as a bird with labeled data is quite different than evaluating the ability of a model to encapsulate the idea of a sentence or phrase and respond appropriately with a sequence of characters that forms a coherent thought, so the default training metrics aren't as useful.
Another thing we might suspect when we see a oscillating accuracy is that our learning rate is too high or not sufficiently adaptive. I was using rmsprop - maybe Adam would address this problem? But after 30 epochs with Adam, acc was .2085 and starting to oscillate.
At this point, looking at my results, it's clear that training on movie dialogue is just going to produce movie-dialogue-ish text which isn't inherently helpful and not that easy to assess in terms of 'accuracy'. You would expect movie dialogue to have a 'spark of life' - originality, the unexpected, a difference in tone between characters, etc. So at this point, if I want a better chatbot, I need more applicable training data.

Python neural network accuracy - correct implementation?

I wrote a simple neural net/MLP and I'm getting some strange accuracy values and wanted to double check things.
This is my intended setup: features matrix with 913 samples and 192 features (913,192). I'm classifying 2 outcomes, so my labels are binary and have shape (913,1). 1 hidden layer with 100 units (for now). All activations will use tanh and all losses use l2 regularization, optimized with SGD
The code is below. It was writtin in python with the Keras framework (http://keras.io/) but my question isn't specific to Keras
input_size = 192
hidden_size = 100
output_size = 1
lambda_reg = 0.01
learning_rate = 0.01
num_epochs = 100
batch_size = 10
model = Sequential()
model.add(Dense(input_size, hidden_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(hidden_size, output_size, W_regularizer=l2(lambda_reg), init='uniform'))
model.add(Activation('tanh'))
sgd = SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd, class_mode="binary")
history = History()
model.fit(features_all, labels_all, batch_size=batch_size, nb_epoch=num_epochs, show_accuracy=True, verbose=2, validation_split=0.2, callbacks=[history])
score = model.evaluate(features_all, labels_all, show_accuracy=True, verbose=1)
I have 2 questions:
This is my first time using Keras, so I want to double check that the code I wrote is actually correct for what I want it to do in terms of my choice of parameters and their values etc.
Using the code above, I get training and test set accuracy hovering around 50-60%. Maybe I'm just using bad features, but I wanted to test to see what might be wrong, so I manually set all the labels and features to something that should be predictable:
labels_all[:500] = 1
labels_all[500:] = 0
features_all[:500] = np.ones(192)*500
features_all[500:] = np.ones(192)
So I set the first 500 samples to have a label of 1, everything else is labelled 0. I set all the features manually to 500 for each of the first 500 samples, and all other features (for the rest of the samples) get a 1
When I run this, I get training accuracy of around 65%, and validation accuracy around 0%. I was expecting both accuracies to be extremely high/almost perfect - is this incorrect? My thinking was that the features with extremely high values all have the same label (1), while the features with low values get a 0 label
Mostly I'm just wondering if my code/model is incorrect or whether my logic is wrong
thanks!

I don't know that library, so I can't tell you if this is correctly implemented, but it looks legit.
I think your problem lies with activation function - tanh(500)=1 and tanh(1)=0.76. This difference seem too small for me. Try using -1 instead of 500 for testing purposes and normalize your real data to something about [-2, 2]. If you need full real numbers range, try using linear activation function. If you only care about positive half on real numbers, I propose softplus or ReLU. I've checked and all those functions are provided with Keras.
You can try thresholding your output too - answer 0.75 when expecting 1 and 0.25 when expecting 0 are valid, but may impact you accuracy.
Also, try tweaking your parameters. I can propose (basing on my own experience) that you'd use:
learning rate = 0.1
lambda in L2 = 0.2
number of epochs = 250 and bigger
batch size around 20-30
momentum = 0.1
learning rate decay about 10e-2 or 10e-3
I'd say that learning rate, number of epochs, momentum and lambda are the most important factors here - in order from most to least important.
PS. I've just spotted that you're initializing your weights uniformly (is that even a word? I'm not a native speaker...). I can't tell you why, but my intuition tells me that this is a bad idea. I'd go with random initial weights.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.