I'm working on a LSTM model in Keras with the goal of next word prediction utilizing BERT word vectors as a part of my inputs for the model.
This is a multi-class categorical problem, and I've done some weird steps to simplify English into clusters of words using BERT and stop-words and k-means, and for my initial practice model I'm using 144 target categories. I plan to up that to about 1000 after working out some kinks.
Here's the architecture of my Keras model:
model = Sequential()
model.add(LSTM(32, input_shape=(SENTENCE_LENGTH, COM_WORDS), dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(COM_WORDS))
model.add(Activation('softmax'))
optimizer = Adam(lr=lr)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.fit(X, y, validation_split=0.05, batch_size=128, epochs=epochs)
My loss starts arounds around 6 and goes down, which isn't unusual as far as I know. I then tried to incorporate class weights, since the model was over-predicting common words like 'the', which is expected. so I used this code to make the weights:
max_count = 0
for word in range(COM_WORDS):
if Ys.count(word) > max_count:
max_count = Ys.count(word)
class_weights = {}
for word in range(COM_WORDS):
class_weights[word] = (max_count - Ys.count(word) + 1)
So my most common y-input would have a value of 1 in the dictionary, and an y-input that is only represented once would be weighted at the count of the most common y-input: around 1 million in this case. Then I added it to my fit() and restarted the model.
When I run my model with the weights, i get insanely high loss (this is just a batch of 100,000 of all my inputs being run):
Epoch 1/3
950000/950000 [==============================] - 160s 168us/step - loss: 3014409.5359 - acc: 0.1261 - val_loss: 2808283.0898 - val_acc: 0.1604
The accuracy is fine though! Not too different than when I didn't use weights.
MY QUESTION(s):
Does this high loss matter? Is it just a reflection of my huge weight numbers, or is it indicating something sinister? Are loss numbers relative?
Side question: Should I use a better method to weight my inputs?
Thank you!
Related
Firstly, I know that similar questions have been asked before, but mainly for classification problems. Mine is a regression-style problem.
I am trying to train a neural network using keras to evaluate chess positions using stockfish evaluations. The input is boards in a (12,8,8) array (representing piece placement for each individual piece) and output is the evaluation in pawns. When training, the loss stagnates at around 500,000-600,000. I have a little over 12 million boards + evaluations and I train on all the data at once. The loss function is MSE.
This is my current code:
model = Sequential()
model.add(Dense(16, activation = "relu", input_shape = (12, 8, 8)))
model.add(Dropout(0.2))
model.add(Dense(16, activation = "relu"))
model.add(Dense(10, activation = "relu"))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1, activation = "linear"))
model.compile(optimizer = "adam", loss = "mean_squared_error", metrics = ["mse"])
model.summary()
# model = load_model("model.h5")
boards = np.load("boards.npy")
evals = np.load("evals.npy")
perf = model.fit(boards, evals, epochs = 10).history
model.save("model.h5")
plt.figure(dpi = 600)
plt.title("Loss")
plt.plot(perf["loss"])
plt.show()
This is the output of a previous epoch:
145856/398997 [=========>....................] - ETA: 26:23 - loss: 593797.4375 - mse: 593797.4375
The loss will remain at 570,000-580,000 upon further fitting, which is not ideal. The loss should decrease by a few more orders of magnitude if I am not wrong.
What is the problem and how can I fix it to make the model learn better?
I would suspect that your evaluation data contains very big values, like 100000 pawns if one of sides forcefully wins. Than, if your model predicts something like 0 in the same position, then squared error is very high and this pushes MSE high as well. You might want to check your evaluation data and ensure they are in some limited range like [-20..20].
Furthermore, evaluating a chess position is a very complex problem. It looks like your model has too few parameters for the task. Possible improvements:
Increase the numbers of neurons in your dense layers (say to 300,
200, 100).
Increase the numbers of hidden layers (say to 10).
Use convolutional layers.
Besides this, you might want to create a simple "baseline model" to better evaluate the performance of your neural network. This baseline model could be just a python function, which runs on input data and does position evaluation based on material counting (like bishop - 3 pawns, rook - 5 etc.) Than you can run this function on your dataset and see MSE for it. If your neural network produces a smaller MSE than this baseline model, than it is really learning some useful patterns.
I also recommend the following book: "Neural Networks For Chess: The magic of deep and reinforcement learning revealed" by Dominik Klein. The book contains a description of network architecture used in AlphaZero chess engine and a neural network used in Stockfish.
I am working on a project to implement CNN-LSTM sentiment analysis. Below is the code
from keras.models import Sequential
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Conv1D , MaxPool1D , Flatten , Dropout
from keras.layers import BatchNormalization
from keras import regularizers
model7 = Sequential()
model7.add(Embedding(max_words, 40,input_length=max_len)) #The embedding layer
model7.add(Conv1D(20, 5, activation='relu', kernel_regularizer = regularizers.l2(l = 0.0001), bias_regularizer=regularizers.l2(0.01)))
model7.add(Dropout(0.5))
model7.add(Bidirectional(LSTM(20,dropout=0.5, kernel_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))))
model7.add(Dense(1,activation='sigmoid'))
model7.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
checkpoint7 = ModelCheckpoint("best_model7.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model7.fit(X_train_padded, y_train, epochs=10,validation_data=(X_test_padded, y_test),callbacks=[checkpoint7])
Even after adding regularizers and dropout, my model has very high validation loss and low accuracy.
Epoch 3: val_accuracy improved from 0.54517 to 0.57010, saving model to best_model7.hdf5
2188/2188 [==============================] - 290s 132ms/step - loss: 0.4241 - accuracy: 0.8301 - val_loss: 0.9713 - val_accuracy: 0.5701
My train and test data:
train: (70000, 7)
test: (30000, 7)
train['sentiment'].value_counts()
1 41044
0 28956
test['sentiment'].value_counts()
1 17591
0 12409
Can anyone please let me know how to reduce overfitting.
Since your code works, I believe that your network is failing silently by 'not learning' a lot from the data. Here's a list of some of the things you can generally check:
Is your textual data well transformed into numerical data? Is it well reprented using TF-IDF or bag of words or any other method that returns a numerical representation?
I see that you imported batch normalization but you do not apply it. Batch norm actually helps and most importantly, does the job of regularizers since each input to each layer is normalized using the mini-batch the network has seen. So maybe remove your L2 regularizations in all layers and apply a simple batch norm instead which should reduce overfitting (also, use it without the drop out since some empirical studies show that they should not be combined together)
Your embedding output is currently set to 40, that is 40 numerical elements of a text vector that may contain more than 10,000 elements. It seems a bit low. Try something more 'standard' such as 128 or 256 instead of 40.
Lastly, you set the adam optimizer with all the default parameters. However, the learning rate can have a big impact on the way your loss function is computed. As I am sure you know, the gradient step uses this learning rate to progress in its calculation of the derivatives for each neuron. the default is learning_rate=0.001. So try the following code and increase a bit the learning rate (for example 0.01 or even 0.1).
A simple example :
# define model
model = Sequential()
model.add(LSTM(32)) # or CNN
model.add(BatchNormalization())
model.add(Dense(1))
# define optimizer
optimizer = keras.optimizers.Adam(0.01)
# define loss function
loss = keras.losses.binary_crossentropy
# define metric to optimize
metric = [keras.metrics.Accuracy(name='accuracy')] # you can add more
# compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metric)
Final thought: I see that you went for a combination of CNN and LSTM which has great merite. However, it is always recommended to try a simple MLP network to establish a baseline score that you may later try to beat. Does a simple MLP with 1 or 2 layers and not a lot of units produce a low accuracy score as well? If it performs better than maybe the problem is in the implementation or in the hyper parameters that you chose for the layers (or even theoretical).
I hope this answer helps and cheers!
My model is like this:
def _get_model(input_shape, latent_dim, num_classes):
inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)
during training I get:
Epoch 1/2000
1/1 [==============================] - 1s 698ms/step - loss: 0.2338 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1185 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2341 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1181 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 2/2000
1/1 [==============================] - 0s 34ms/step - loss: 0.2328 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1175 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2329 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1169 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
Epoch 3/2000
1/1 [==============================] - 0s 38ms/step - loss: 0.2316 - activation_26_loss: 0.1153 - lstm_151_loss: 0.1163 - activation_26_accuracy: 0.0000e+00 - lstm_151_accuracy: 0.0000e+00 - val_loss: 0.2315 - val_activation_26_loss: 0.1160 - val_lstm_151_loss: 0.1155 - val_activation_26_accuracy: 0.0000e+00 - val_lstm_151_accuracy: 0.0000e+00
when i see history:
print (history.history.keys)
dict_keys(['loss', 'activation_26_loss', 'lstm_151_loss', 'activation_26_accuracy', 'lstm_151_accuracy', 'val_loss', 'val_activation_26_loss', 'val_lstm_151_loss', 'val_activation_26_accuracy', 'val_lstm_151_accuracy'])
which ones are the training loss and training accuracy?
Since there are only 2 outputs, why are there 3 losses,loss,activation_26_lossand lstm_151_loss BUT 2 accuracies:activation_26_accuracy and lstm_151_accuracy? what is each loss and each accuracy standing for?
TLDR;
Three losses (2+1), two losses for individual outputs, and one as the combination of the 2 losses weighed by 0.5 each. You can set both the losses explicitly and their weights as well.
Two accuracies since there are 2 outputs. metrics are just for the user to view and don't affect the neural network.
Detailed explanation;
Let's try to see what you are doing here first. (I am referring to the previous question you asked to get the shapes for inputs.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<------- One input, 2 outputs
model.compile(optimizer='adam', loss='mse')
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
You are building a supervised model that takes an input of (15,5) shape and outputs 2 things: first a (7,) which should contain the cell_states from the 7 LSTM cells and second a (4,) vector that should contain probability values for the 4 classes. The loss you are using to train the model for learning how to predict both of the outputs is mse.
Since this is a supervised model, you will have to provide the model samples of inputs and outputs. If you have 100 samples then your inputs would be (100,15,5) shaped and your outputs will be (100,7) and (100,4), since you have 2 outputs.
Loss(y_actual, y_pred) is a function that tells the neural network how far is its prediction from the actual value. Based on this, it tells the neural network to update itself (its weights specifically using backpropagation) so that its predictions become closer and closer to actual and thus reduce the Loss.
If the above points are clear then let's look at what this network is doing specifically
Your current model has one input and 2 outputs.
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
Since you have defined mse as loss, both the outputs are trying to minimize mse. These are the 2 losses out of the 3: activation_26_loss which is the loss for the final Dense layer and lstm_151_loss which is the loss from the LSTM cell state. Keras just gives random names to these layers with numbers unless specified properly.
The loss mentioned is basically the weighted average of the other 2 losses. Ill talk about this more later.
The metrics=['accuracy'] is just a metric for users to track. Since there are 2 outputs, you get 2 different accuracy metrics, one for each output. They don't affect the neural network's training.
Now, when working with neural networks, it's important to know which loss to use where. Here is a table describing what loss and activation functions to use for which type of network.
As you can see, it's a good practice to use softmax and categorical_crossentropy for multi-class problems. So let's try to recreate the model with this change. We want each output to have a different loss to minimize.
Also, let's say the first output is more important than the second. We can also tell the model how to weigh the losses so that it prioritizes which loss to focus on more and by how much.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('softmax')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<--- Softmax for first outputs activation
model.compile(optimizer='adam',
loss=['categorial_crossentropy','mse'], #<--- 2 losses, one for each output
loss_weights=[0.4, 0.6]) #<--- 2 loss weights for final loss
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
utils.plot_model(model, show_layer_names=False, show_shapes=True)
Here, the final loss (named simply as loss) is the combination of the 2 separate losses after combining them with 0.4 and 0.6 weights.
Hope this clarifies what you are trying to achieve.
ONE A SIDE NOTE: I am curious as to how you are getting the actual values for the final cell state to train the model to predict a cell state. Do let me know if that is what your intention is. It's not very clear what your final goal here is (as I had asked your previous question as well).
I am trying to optimize the best conditions for a sequential model I am building in keras.
I have recently come across Hparams dashboards which looks like a really nice way of doing this. However I am running into a problem at the stage of actually running the model to carry out the parameter optimization!
The code I am running (just to begin with taken directly from the tf page)
https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I have modified the code for Hparams on tf to my sequential model. For the purpose of practice I have removed a dropout layer (as I don't have any in my model) as well as the optimizer. For now I would like to see how my model is affected by changing nodes in layers. My code is as follows:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
METRIC_ACCURACY = 'accuracy'
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_NUM_UNITS],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
def train_test_model(hparams):
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=tf.nn.relu),
tf.keras.layers.Dense(24, activation=tf.nn.sigmoid),
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
)
model.fit(X_train.values, y_train, epochs=50)
_, accuracy = model.evaluate(X_test, y_test)
return accuracy
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Up to this point, everything works fine! for the purpose of my first attempt, i have not changed much apart from removing dropout and optimizer plus applying my own model in the code. I require more units than 16 and 32 etc however this is just for the purpose of making a pipeline...
When I run the following code to execute the optimization, I get the error. the code is:
session_num = 0
for num_units in HP_NUM_UNITS.domain.values:
hparams = {
HP_NUM_UNITS: num_units,
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
This throws the error! the error is (which I don't quite understand):
ValueError: Cannot create an execution function which is comprised of elements from multiple graphs.
This error takes place following what looks like the first attempt at a model as for the first set of units (16) a model is fit. If i look at the traceback i get the progress report:
Epoch 1/50
140/140 [==============================] - 0s 3ms/sample - loss: 0.6847 - accuracy: 0.5723......
Epoch 50/50
140/140 [==============================] - 0s 206us/sample - loss: 0.2661 - accuracy: 0.8857
And after this is when I get the error( cannot create an execution function... etc)
I am unsure about how to fix this and any help would be much appreciated!
I am more than happy to provide any more detail/code!
Thank you!
I had the same error and I fixed it by turning my train and test values from pandas dataframe to numpy array. So just use X_train.values and so on so forth.
If this does just tell me at what line is the error exactly occurring at.
I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
The columns and some values of the dataset are follows:
'Person_ID', 'time_in_game', 'python_time', 'permutation_game, 'round', 'level', 'times_level_played_before', 'speed', 'costheta', 'y_label', 'gender', 'age_precise', 'ax_f', 'ay_f', 'az_f', 'acc', 'jerk'
1, 0.25, 1.497942e+09, 2, 1, 'level_B', 1, 0.8, 0.4655, 1, [...]
I reduced the dataset to only 480 rows per person, by just using the row at each half of a second.
Now I want to use a recurrent neural network to predict the binary y_label.
This code extracts the costheta feature used for the input data X and the y-label for output Y.
X = []
Y = []
for ID in person_list:
person_frame = df.loc[df['Person_ID'] == Person_ID]
# costheta is a measurement of performance
coslist = list(person_frame['costheta'])
# extract y-label
score = list(person_frame['y_label'].head(1))[0]
X.append(coslist)
Y.append(binary)
I splitted the data in to training and testing data using a 0.2 test split. Then I tried to create the RNN with Keras as follows:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
# different_input_values are the set of possible input values
model.add(Embedding(different_input_values, embedding_size, input_length=480))
model.add(LSTM(1000))
# output is binary
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
At last, I began training with this code:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 64
num_epochs = 100
X_valid, y_valid = X_train[:batch_size], Y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], Y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs).
However, the gained accuracy is really low. Depending on the batch size it varies between 0.4 and 0.6.
12/12 [==============================] - 13s 1s/step - loss: 0.6921 -
acc: 0.7500 - val_loss: 0.7069 - val_acc: 0.4219
My question is, in general, with complicated data like this, how does one efficiently train a RNN. Should one refrain from reducing the data to 480 rows per person and keep it around 25,000 rows per? Could multiple metrics, such as acc (acceleration in game) and jerk cause a significant accuracy gain? What are significant improvements that one could change and consider?