LSTM parity generator

LSTM parity generator - python

I am trying to learn deep learning, I have stumbled on one exercise here
It is first warm-up exercise. I am stuck. For constant sequence of small lengths(2,3) it solves it no problem. However when I try whole sequence of 50. it stops at 50% accuracy, which is basically random guess.
According to here it is too big flat space ant cant find gradient to solve it. So i tried approach of continuously increasing length ans saving model each time (2,5,10,15,20,30,40,50).It seems it does not generalise well, as if i type bigger sequence then what I learned it on, it fails.
According to here it should be easy problem. I cant figure it out. There is used some different LSTM architecture hoverer.
And one solution here to exactly same problem says it works with Adagrad optimizer and learning rate of 0.5.
I am unsure about one bit at time, if I am feeding it right in first place. I hope I got it right.
And for variable length, i tried and failed miserably.
Code:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, LSTM
from keras.optimizers import SGD, Adagrad, Adadelta
from keras.callbacks import TensorBoard
from keras.models import load_model
import numpy as np
import time
import os.path
# building the model
def build_model():
model = Sequential()
model.add(LSTM(
32,
input_shape=(None, 1),
return_sequences=False))
model.add(Dropout(0.1))
model.add(Dense(
1))
model.add(Activation("sigmoid"))
return model
# generating random data
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.sum(axis=1) % 2
return x, y
X, y = generate_data(100000, 50)
X_test, y_test = generate_data(1000, 50)
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=0,
write_graph=True, write_images=False)
if os.path.isfile('model.h5'):
model = load_model('model.h5')
else:
model = build_model()
opti = Adagrad(lr=0.5)
model.compile(loss="mse", optimizer=opti, metrics=['binary_accuracy'])
model.fit(
X, y,
batch_size=10, callbacks=[tensorboard],
epochs=5)
score = model.evaluate(X_test, y_test,
batch_size=1,
verbose=1)
print('Test score:', score)
print('Model saved')
model.save('model.h5')
I am so confused now. Thanks for any response!
Edit: Fixed return_sequences to False typo from previous experiments.

Well, this might be a really valuable exercise about LSTM and vanishing gradient. So let's dive into it. I'd start from changing task a little bit. Let's change our dataset to:
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.cumsum(axis=1) % 2
return x, y
and model by setting return_sequences=True, changing the loss to binary_crossentropy and epochs=10. So well - if we solve this task perfectly - then we'd also solve the initial task. Well - in 10 out of 10 runs of the setup I provided I observed the following behavior - for first few epochs model saturated around 50% of accuracy - and then suddenly dropped to 99% of accuracy.
Why have this happened?
Well - in LSTM a sweet spot for parameters is a synchrony between memory cells dynamics and normal activation dynamics. Very often one should wait a lot of time in order to get such behavior. Moreover - the architecture needs to be sufficient in order to catch valuable dependencies. In a changed behavior - we are providing much more insights to a network thanks to which it could be trained faster. Still - it takes some time to find the sweet spot.
Why your network failed?
Vanishing gradient problem and problem complexity - it's completely not obvious what information network should extract if it gets only a single signal at the end of the sequence of computations. This is why it needs either supervision in the form which I provided (cumsum) or a lot of time and luck in order to finally find a sweet stop on its own.

Related

How can I get constant value val accuracy and val loss in keras [duplicate]

This question already has answers here:
How to get reproducible results in keras
(11 answers)
Closed 16 days ago.
I'm newbie in neural network and I try to do mlp text classification using keras. everytime I run the code, it get different val loss and and val accuracy. Val loss is increase and val accuarcy is decrease everytime I re-run it. The code that I'm using is like this:
#Split data training and testing (80:20)
Train_X2, Test_X2, Train_Y2, Test_Y2 = model_selection.train_test_split(dataset['review'],dataset['sentiment'],test_size=0.2, random_state=1)
Encoder = LabelEncoder()
Train_Y2 = Encoder.fit_transform(Train_Y2)
Test_Y2 = Encoder.fit_transform(Test_Y2)
Tfidf_vect2 = TfidfVectorizer(max_features=None)
Tfidf_vect2.fit(dataset['review'])
Train_X2_Tfidf = Tfidf_vect2.transform(Train_X2)
Test_X2_Tfidf = Tfidf_vect2.transform(Test_X2)
#Model
model = Sequential()
model.add(Dense(100, input_dim= 1148, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
opt = Adam (learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.summary()
from keras.backend import clear_session
clear_session()
es = EarlyStopping(monitor="val_loss",mode='min',patience=10)
history = model.fit(arr_Train_X2_Tfidf, Train_Y2, epochs=100,verbose=1, validation_split=0.2,validation_data=(arr_Test_X2_Tfidf, Test_Y2), batch_size=32, callbacks =[es])
I try using clear_session() to make the model not start off with the computed weights from the previous training. But it still get difference value. How to fix it? thank you

How can I get constant value val accuracy and val loss in keras
I guess what you want is reproducible train runs. For that you will have to seed the random number generator. Getting reproducible result with seed is tricky on the GPU because some operations on GPU are non deterministic. However, with the model architecture you are using it is not a problem.
make model not start off with the computed weights from the previous training.
It is not the case, you are creating the model every time and the Dense layers you are using gets initialised from glorot_uniform distributions.
Sample:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import callbacks
import matplotlib.pyplot as plt
import os
import numpy as np
import random as rn
import random as python_random
def seed():
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
np.random.seed(123)
python_random.seed(123)
tf.random.set_seed(1234)
def train(set_seed):
if set_seed:
seed()
dataset = {
'review': [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
],
'sentiment': [0,1,0,1]
}
Train_X2, Test_X2, Train_Y2, Test_Y2 = train_test_split(
dataset['review'],dataset['sentiment'],test_size=0.2, random_state=1)
Encoder = LabelEncoder()
Train_Y2 = Encoder.fit_transform(Train_Y2)
Test_Y2 = Encoder.fit_transform(Test_Y2)
Tfidf_vect2 = TfidfVectorizer(max_features=None)
Tfidf_vect2.fit(dataset['review'])
Train_X2_Tfidf = Tfidf_vect2.transform(Train_X2).toarray()
Test_X2_Tfidf = Tfidf_vect2.transform(Test_X2).toarray()
#Model
model = keras.Sequential()
model.add(layers.Dense(100, input_dim= 9, activation='sigmoid'))
model.add(layers.Dense(1, activation='sigmoid'))
opt = optimizers.Adam (learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
#model.summary()
history = model.fit(Train_X2_Tfidf, Train_Y2, epochs=10,verbose=0,
validation_data=(Test_X2_Tfidf, Test_Y2), batch_size=32)
return history
def run(set_seed=False):
plt.figure(figsize=(7,7))
for i in range(5):
history = train(set_seed)
plt.plot(history.history['val_loss'], label=f"{i+1}")
plt.legend()
plt.title("With Seed" if set_seed else "Wihout Seed")
run()
run(True)
Outout:
You can see how the val_loss is different without seed (as it depends on the initial value of Dense layer and other places where random number generation is used) and how the val_loss is exactly the same with seed which make sure the initial values of Dense layers you are using are same between runs (and at other places where random number generation is used).

During the training process, several different sources of randomness exist. You would need to eliminate all of those to get perfectly reproducible results. Here are a few of those:
The random weight and bias initialization leads to different training runs each time.
During training mini batching is used which influences the trajectory of the training process differently each time. Each training process that uses Stochastic Gradient Descent (SGD) as an optimization technique will thus observe a different ordering of inputs.
Randomness induced by regularization techniques such as Dropout.
Some of those sources of randomness can be overcome by setting a fixed seed, how this is done is described in this older question.
This paper names several other possible sources of bias (such as data augmentation or hardware differences).

CNN-LSTM model for sentiment analysis has low validation accuracy

I am working on a project to implement CNN-LSTM sentiment analysis. Below is the code
from keras.models import Sequential
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Conv1D , MaxPool1D , Flatten , Dropout
from keras.layers import BatchNormalization
from keras import regularizers
model7 = Sequential()
model7.add(Embedding(max_words, 40,input_length=max_len)) #The embedding layer
model7.add(Conv1D(20, 5, activation='relu', kernel_regularizer = regularizers.l2(l = 0.0001), bias_regularizer=regularizers.l2(0.01)))
model7.add(Dropout(0.5))
model7.add(Bidirectional(LSTM(20,dropout=0.5, kernel_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))))
model7.add(Dense(1,activation='sigmoid'))
model7.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
checkpoint7 = ModelCheckpoint("best_model7.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model7.fit(X_train_padded, y_train, epochs=10,validation_data=(X_test_padded, y_test),callbacks=[checkpoint7])
Even after adding regularizers and dropout, my model has very high validation loss and low accuracy.
Epoch 3: val_accuracy improved from 0.54517 to 0.57010, saving model to best_model7.hdf5
2188/2188 [==============================] - 290s 132ms/step - loss: 0.4241 - accuracy: 0.8301 - val_loss: 0.9713 - val_accuracy: 0.5701
My train and test data:
train: (70000, 7)
test: (30000, 7)
train['sentiment'].value_counts()
1 41044
0 28956
test['sentiment'].value_counts()
1 17591
0 12409
Can anyone please let me know how to reduce overfitting.

Since your code works, I believe that your network is failing silently by 'not learning' a lot from the data. Here's a list of some of the things you can generally check:
Is your textual data well transformed into numerical data? Is it well reprented using TF-IDF or bag of words or any other method that returns a numerical representation?
I see that you imported batch normalization but you do not apply it. Batch norm actually helps and most importantly, does the job of regularizers since each input to each layer is normalized using the mini-batch the network has seen. So maybe remove your L2 regularizations in all layers and apply a simple batch norm instead which should reduce overfitting (also, use it without the drop out since some empirical studies show that they should not be combined together)
Your embedding output is currently set to 40, that is 40 numerical elements of a text vector that may contain more than 10,000 elements. It seems a bit low. Try something more 'standard' such as 128 or 256 instead of 40.
Lastly, you set the adam optimizer with all the default parameters. However, the learning rate can have a big impact on the way your loss function is computed. As I am sure you know, the gradient step uses this learning rate to progress in its calculation of the derivatives for each neuron. the default is learning_rate=0.001. So try the following code and increase a bit the learning rate (for example 0.01 or even 0.1).
A simple example :
# define model
model = Sequential()
model.add(LSTM(32)) # or CNN
model.add(BatchNormalization())
model.add(Dense(1))
# define optimizer
optimizer = keras.optimizers.Adam(0.01)
# define loss function
loss = keras.losses.binary_crossentropy
# define metric to optimize
metric = [keras.metrics.Accuracy(name='accuracy')] # you can add more
# compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metric)
Final thought: I see that you went for a combination of CNN and LSTM which has great merite. However, it is always recommended to try a simple MLP network to establish a baseline score that you may later try to beat. Does a simple MLP with 1 or 2 layers and not a lot of units produce a low accuracy score as well? If it performs better than maybe the problem is in the implementation or in the hyper parameters that you chose for the layers (or even theoretical).
I hope this answer helps and cheers!

Training a LSTM auto-encoder gets NaN / super high MSE loss

I'm trying to train a LSTM ae.
It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this.
Can you inspect the code and give me a hint?
The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error?
This is the input sequence I get from training set:
sequence1 = x_train[0][:128]
looks like this:
I get the data from a public signal dataset(128*1)
This is the code: (I modify it from keras blog)
# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers
# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
model.fit(sequence, sequence, epochs=epo)
The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.

You might have a problem with exploding gradient values when doing the backpropagation.
Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/
Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.

'relu' is the main culprit - see here. Possible solutions:
Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks
Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.
def inspect_weights_l2(model, names='lstm', axis=-1):
def _get_l2(w, axis=-1):
axis = axis if axis != -1 else len(w.shape) - 1
reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
return np.sqrt(np.sum(np.square(w), axis=reduction_axes))
def _print_layer_l2(layer, idx, axis=-1):
W = layer.get_weights()
l2_all = []
txt = "{} "
for w in W:
txt += "{:.4f}, {:.4f} -- "
l2 = _get_l2(w, axis)
l2_all.extend([l2.max(), l2.mean()])
txt = txt.rstrip(" -- ")
print(txt.format(idx, *l2_all))
names = [names] if isinstance(names, str) else names
for idx, layer in enumerate(model.layers):
if any([name in layer.name.lower() for name in names]):
_print_layer_l2(layer, idx, axis=axis)

Big difference when using Relu over Tanh on a simple problem

I was testing a toy problem, where you have as input zeros and ones, and the output is whether the number of ones is odd or even (simplicity itself). With a MLP that uses Tanh activation, I never managed to get around random guess performance (~50%)! Just completely by chance, I tried Relu (out of desperation), and...it worked perfectly (getting an accuracy of 100% most of the time).
Then, while discussing it with a friend, we wanted to see what will happen if we replace the zeros with -1 (the task stay the same, odd or even ones). To my sheer surprise, it worked with the Tanh (performance between 75~90 %). Relu still performs better.
The code
import numpy as np
from sklearn.neural_network import MLPClassifier
# from sklearn.preprocessing import StandardScaler
def generate_data(batch_size, data_length=10, zeros=True):
x = np.random.randint(0, 2, (batch_size, data_length))
y = x.sum(axis=1) % 2
y = y.astype(np.int16).reshape(-1)
if not zeros: # in this case, convert the zeros to -1
x[x==0] = -1
return x, y
# With ReLU, it is perfect!. With Tanh, it is shit
# clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="relu")
clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="tanh")
X_train, y_train = generate_data(batch_size=10000, data_length=10, zeros=True)
X_test, y_test = generate_data(batch_size=1000, data_length=10, zeros=True)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
To get the -1 instead of zeros, just make the zeros parameter False when using generate_data function.
Can someone please explain what is happening here?
Edit:
Thanks to #BlackBear and #Andreas K. for there answers. So apparently using Tanh leads the neurons to saturate (the gradient is not moving). With better choice for the learning rate, or to let the network optimize for longer time, it does work. For example, with updating the classifier choices to
clf = MLPClassifier(solver='adam', verbose=True, batch_size=512, activation="tanh", max_iter=5000, learning_rate="adaptive", n_iter_no_change=100)
It always works!

It is just an issue with the optimization procedure that is not able to find good values for the weights. You can construct a network with 2^10=1024 neurons in the hidden layer, one for each input pattern, and let the output neuron respond to the neurons corresponding to inputs with even number of ones. With this procedure, you can model every boolean function.

Acc decreasing to zero in LSTM Keras Training

While trying to implement an LSTM network for trajectory classification, I have been struggling to get decent classification results even for simple trajectories. Also, my training accuracy keeps fluctuating without increasing significantly, this can also be seen in tensorboard:
Training accuracy:
This is my model:
model1 = Sequential()
model1.add(LSTM(8, dropout=0.2, return_sequences=True, input_shape=(40,2)))
model1.add(LSTM(8,return_sequences=True))
model1.add(LSTM(8,return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))`
and my training code:
model1.compile(optimizer='adagrad',loss='binary_crossentropy', metrics=['accuracy'])
hist1 = model1.fit(dataScatter[:,70:110,:],outputScatter,validation_split=0.25,epochs=50, batch_size=20, callbacks = [tensorboard], verbose = 2)
I think the problem is probably due to the data input and output shape, since the model itself seems to be fine. The Data input has (2000,40,2) shape and the output has (2000,1) shape.
Can anyone spot a mistake?

Try to change:
model1.add(Dense(1, activation='sigmoid'))`
to:
model1.add(TimeDistributed(Dense(1, activation='sigmoid')))
The TimeDistributed applies the same Dense layer (same weights) to the LSTMs outputs for one time step at a time.
I recommend this tutorial as well https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/ .

I was able to increase the accuracy to 97% with a few adjustments that were data related. The main obstacle was an unbalanced dataset split for the training and validation set. Further improvements came from normalizing the input trajectories. I also increased the number of cells in the first layer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

LSTM parity generator - python

Related

How can I get constant value val accuracy and val loss in keras [duplicate]

CNN-LSTM model for sentiment analysis has low validation accuracy

Training a LSTM auto-encoder gets NaN / super high MSE loss

Big difference when using Relu over Tanh on a simple problem

Acc decreasing to zero in LSTM Keras Training

Categories

Resources