Why do I have such inconsistent results when training my model?

Why do I have such inconsistent results when training my model? - python

I am using Keras to train my model.
I have initialised numpy and tensorflow seeds. I have made a 50-iterations loop where I train and test my Keras deep learning architecture (always the same) on the same training, validation and test sets. I get those results :
print (np.mean(train_accuracy_vec))
print (np.std(train_accuracy_vec))
print ()
print (np.mean(valid_accuracy_vec))
print (np.std(valid_accuracy_vec))
print ()
print (np.mean(test_accuracy_vec))
print (np.std(test_accuracy_vec))
print ()
I get this :
Sometimes it gives an unacceptable false positive rates while sometimes, it works quite well. I used EarlyStopping based on val_acc behaviour.
So, what could cause a so great instability ?
Also isn't it a bit odd to have validation score far under test score ?
Thanks
EDIT: Despite #Thomas Pinetz kind answer, I don't get better results at the second time : still high std...
To be more precise, here is how my loop is made :
# tf, random and numpy seeds...
# lots of data reading, preprocessing,...(including split between train, valid and test sets)
for k in range (0,50) :
print (k)
model = Sequential()
model.add(Dense(200, activation='elu', input_dim=trainX.shape[1], init=keras.initializers.glorot_uniform(1)))
model.add(Dropout(0.3))
# some additional layers...
model.compile(loss='binary_crossentropy',metrics=['accuracy'], optimizer='adam')
model.fit(trainX, trainY, validation_data=(validX, validY), epochs=100, verbose=0 , callbacks=callbacks_list)
train_score = model.evaluate(trainX, trainY)
train_accuracy_vec.append (train_score[1])
print(train_score)
trainPredict = model.predict(trainX)
print(confusion_matrix(trainY, trainPredict.round()))
# and the same for valid and test...

What causes differences between runs is the random initialization of weights. Gradient-descent based methods get stuck in local minima, so, the best solution that will be found on each run depends on the initial weights. There's not much you can do about that. It's inherent problem of neural networks. It might help to take a look at Xavier/He initialization though.
As to why your validation error is quite worse than the test error, it's indeed weird. However, if your dataset is relatively small, and you are using the same splitting at all runs, it might have just happened that the test set has similar patterns to the training set, while the validation has different. You'd better split at each run.

To obtain reproducable results in keras follow the following instructions: https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development.
It might be that multi-threading is causing problems.
Edit:
Every time you run a method, that returns a random number, this number depends on your initial seed. So while your script always returns the same values, e.g. same mean and std for your training/val/test set evaluation it will not use the same random number in each iteration of the for loop.
What you can try is to loop around the entire script and set the random set at the beginning of the for loop. Maybe then you will get the same results.
There is all kind of randomness in generating and training a DL model. From the initialization of the weights to the order of your training set, which by default is random. This initialization will not be the same if you do not reset the random seed. Same for the order of the dataset. In each epoch your training data is shuffled and this will be different in every for loop run. There are also layers which use stochastic elements like dropout, which need the same seed to guarantee the same performance.

Related

Why does my LSTM model predict wrong values although the loss is decreasing?

I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using an LSTM model with Tensorflow.
You can imagine my dataset to look something like this:
Index
x data
y data
0
np.array(shape (10000,1) )
numpy.float32
1
np.array(shape (10000,1) )
numpy.float32
2
np.array(shape (10000,1) )
numpy.float32
...
...
...
56
np.array(shape (10000,1) )
numpy.float32
Easily said I just want my model to predict a number (y data) from a sequence of numbers (x data).
For example like this:
array([3.59280851, 3.60459062, 3.60459062, ...]) => 2.8989773
array([3.54752101, 3.56740332, 3.56740332, ...]) => 3.0893357
...
x and y data
From my x data I created a numpy array x_train which I want to use to train the network.
Because I am using an LSTM network, x_train should be of shape (samples, time_steps, features).
I reshaped my x_train array to be shaped like this: (57, 10000, 1), because I have 57 samples, which each are of length 10000 and contain a single number.
The y data was created similarly and is of shape (57,1) because, once again, I have 57 samples which each contain a single number as the desired y output.
Current model attempt
My model summary looks like this:
The model was compiled with model.compile(loss="mse", optimizer="adam") so my loss function is simply the mean squared error and as an optimizer I'm using Adam.
Current results
Training of the model works fine and I can see that the loss and validation loss decreases after some epochs.
The actual problem occurs when I want to predict some data y_verify from some data x_verify.
I do this after the training is finished to determine how well the model is trained.
In the following example I simply used the data I used for training to determine how well the model is trained (I know about overfitting and that verifying with the training set is not the right way of doing it, but that is not the problem I want to demonstrate right not).
In the following graph you can see the y data I provided to the model in blue.
The orange line is the result of calling model.predict(x_verify) where x_verify is of the same shape as x_train.
I also calculated the mean absolute percentage error (MAPE) of my prediction and the actual data and it came out to be around 4% which is not bad, because I only trained for 40 epochs. But this result still is not helpful at all because as you can see in the graph above the curves do not match at all.
Question:
What is going on here?
Am I using an incorrect loss function?
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be?
Ideally the prediction should be the y data which I provided so the curves should look the same (more or less).
Do you have any ideas?
Thanks! :)

After some back and forth in the comments, I'll give my best estimation to your questions:
What is going on here?
Very complex (too many layers deep) model with very little data, trained for too few epochs on non-normalized data (credit to Muhammad in his answer). The biggest issue, as far as I can tell, is the number of training epochs.
Am I using an incorrect loss function?
MSE is an appropriate loss function for a regression task.
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be? Ideally the prediction should be the y data which I provided so the curves should look the same (more or less). Do you have any ideas?
Too few training epochs is the biggest contributor, as far as I can tell.
Based on the collab notebook that Luca shared:
30 Epochs, no normalization
Way off target, flat predictions (though I can't reproduce how flat the predictions are that Luca posted)
30 Epochs, with normalization
Worse off.
2000(!) epochs, no normalization
Okay, now the predictions are at least in the ballpark
2000 epochs, with normalization
And now the model seems to be starting to figure things out, like we'd hope it should. Given, this is training on the 11 samples that were cobbled together in the notebook, so it's naturally going to overfit. We're just happy to see it learn something.
2000 epochs, normalization, different loss
Never be afraid to try out different losses, as some may be better suited than others. Not knowing the domain of this task, I'm just trying out mean_absolute_error instead of mean_squared_error.
Caution! Don't compare loss values between different losses. They're not on the same scale.
2000 epochs, normalization, larger learning rate
Okay, so it's taking a long time to learn. Can I nudge it along a little faster? Sure, up the learning rate of the optimizer, and it'll get you to where you're going faster. Here, we up it by a factor of 5.
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))
You could even employ a learning rate scheduler that starts big and slowly diminishes it over the course of epochs.
def scheduler(epoch, lr):
if epoch < 400:
return lr
else:
return lr * tf.math.exp(-0.01)
lrs = tf.keras.callbacks.LearningRateScheduler(scheduler)
history = model.fit(x=x_train, y=y_train, epochs=1000, callbacks=[lrs])
Hope this all helps!

From the notebook it seems you are not scaling your data. You should normalize or standardize your data before training your model.
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/
can add normalization layer in keras https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization

I just wanted to post a quick update.
First of all, this is my current result:
I am absolutely happy, that I was finally able to achieve what I wanted to. At least to some extent.
There were some steps I had to take to achieve this result:
Normalization
Training for 500-1000 epochs
Most importantly: Reducing the amount of time steps to 1000
In the end my thought of "the more data, the better" was a huge misconception. I was not able to achieve such results with 10000 time steps per sample AT ALL. So I'm glad that I just gave 1000 a shot.
Thank you all very much for your answers!
I will try to further imroved my model with your suggestions :)

i think it would be helpful if you change loss into huber loss and even change optimizer into sgd and then first try out to define the best learning rate based on a callback (learning rate schedule) cause of small dataset and even normalize or standardize data before training model.

Accuracy starts decresing while loss keeps going down

I am training the following NN with tensorflow:
def build_model():
inputs_layers = []
concat_layers= []
for k in range(k_i, k_f+1):
kmers = train_datasets[k].shape[1]
unique_kmers = train_datasets[k].shape[2]
input = Input(shape=(kmers, unique_kmers))
inputs_layers.append(input)
x = Dense(4, activity_regularizer=tf.keras.regularizers.l2(0.2))(input)
x = Dropout(0.4)(x)
x = Flatten()(x)
concat_layers.append(x)
inputs = keras.layers.concatenate(concat_layers, name='concat_layer')
x = Dense(4, activation='relu',activity_regularizer=tf.keras.regularizers.l2(0.2))(inputs)
x = Dropout(0.3)(x)
x = Flatten()(x)
outputs = Dense(1, activation='sigmoid')(x)
return inputs_layers, outputs
I used the for loop for creating the input layers because I need them to be flexible.
The problem is that when I train this NN, at the beginning the validation loss starts going down, as the accuracy goes up. But after some point, the validation accuracy starts to go down while the loss keeps going down.
I understand that this might be possible because the accuracy is mesured when the proablilites of the output are converted into 1 or 0, but I expect this to be an exception when I am not "lucky" with a particular validation set. However, I shuffled my dataset and obtained different validation sets several times, but the output is always the same: loss and accuracy go down together.
I understand that the model is overfitting. Desipite that, I would still excpect to obtain a correlation between accuracy and loss. I am using a stop_early callback monitoring val_loss. I dont like the idea to change it to monitor val_accuracy, because I feel I would be loosing fitness (because I would prevent val_loss to reach the lowest value)

To me this looks like overfitting:
One technique to stop training in such a case is early stopping (see also the Keras Early Stopping callback).
Avoiding overfitting is tricky. You are already employing one method which is using the Dropout layer. You can try increasing the probability but there is a sweetspot and "wrong" values will just hurt the model quality and performance.
The holy grale here is usually "more data". If not available, you can try data-augmentation.

This is not so unusual, especially when you're using a "regularizer".
Your loss is a sum of the "actual loss" (the one you defined in compile) and the regularization losses.
So, it's perfectly possible that your weights/activations are going down, thus your regularization loss is going down too, while the actual loss may be going up invisibly.
Also, accuracy and loss are not always well connected. Although your case seems extreme, sometimes the accuracy might stop improving while the loss keeps improving (especially in a case where the loss follows a strange logic).
So, some hints:
Try this model without regularization and see what happens
If you see the regularization was the problem and you still want it, decrease the regularization coefficients
Create a callback to stop training a few epochs after the maximum validation accuracy
Use losses that follow the accuracy relatively well (usually the stardard losses do)
Don't forget to check whether your validation data is in the "same scale" as the training data.

In tensorflow why for a same dropout value of 0.8 when run with adam optimiser with 50epochs give different accuracy each time i run it?

I am building ANN as below:-
model=Sequential()
model.add(Flatten(input_shape=(25,)))
model.add(Dense(25,activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(5,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(xtraindata,ytraindata,epochs=50)
test_loss,test_acc=model.evaluate(xtestdata,ytestdata)
print(test_acc)
I am adding different features into the model and checking whether the newly added feature decreases or increases the accuracy but the problem is that each time I run this code with the same values I get different accuracy, sometimes it gets as low as 0.50 and so, I have few doubts and kindly answer them:-
Is the model giving different accuracy each time because in dropout reg. there are random dropouts in nodes and each time I run diff. nodes get silenced so thereby giving different accuracies i.e sometimes low and sometimes high?
How can I trust the accuracy of the model if each time it gives different accuracies? How can I know that the feature I have added has resulted in a decrement or increment of the accuracy?
If I get high accuracy and wanted to reproduce these results how do I save the parameters that the model has used?

Great questions. Answers:
I think your theory is right; it's the dropout. That's the only layer with an element of randomness each run, so it's likely the culprit. Try removing that layer, leaving everything else fixed, and run multiple times. Check if the accuracy is the same.
Cross validation. This article explains how it works, but the gist is that it is a statistical technique that trains and checks the accuracy of multiple runs of your model, all with different slices of data. The average accuracy of all runs is used. So highs and lows will be averaged to a true(ish) accuracy. That being said, if your model has inconsistent results by just varying dropout, it's an indicator that when you move the model to production and use real data, it will perform poorly.
Keras api has a method model.save("model_name") to save models. You can use keras.models.load_models("model_name") to get it back. As I said in point 2 though; if your model is so finicky that some trainings drastically affect accuracy, then even if you train and get good accuracy, it probably won't be useful on new data. So when you say "If I get high accuracy and wanted to reproduce these results", really you shouldn't be thinking along these lines. Instead, try to get consistently high training accuracy.

How to get the prediction of new data by LSTM in python

This is a univariate time series prediction problem. As the following code shows, I divide the initial data into a train dataset (trainX) and a test dataset(testX), then I create a LSTM network by keras. Next, I train the model by the train dataset. However, when I want to get the prediction, I need to know the test value, so my problem is: why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time? If I have some misunderstandings about LSTM network, please tell me.
Thank you!
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

Since we don't have the future value with us while training the model, we just divide the data into train and test sets. Then we just imagine that test sets are the future values. We train our model using train set (and also usually a validation set). And after our model is trained, we test it using the test set to check our models performance.

why do I have to predict since I have known the true value which is test dataset in this problem. What I want to get is the prediction value of future time?
In ML, we give test data X and it returns us Y. In the case of time-series, it may mislead a beginner a bit as we use the X and output is apparently X as well: The difference here is that we are inputting old values of time-series as X and the output Y is value of same time-series but we are predicting in future (can be applied for present or even past as well) as you have identified it correctly.
(P.S: I would recommend you to begin with simple regression and then come to LSTMs etc. if all you want is to learn the Machine Learning.)

I think the correct term in this context is 'Forecasting'.
A good explanation is: after you train and test your model, with the data that you already had (as the other ones said here before me), you want to predict future data, which is, I think, the trully interresting thing about recurrent networks.
So in order to make this, you need to start predicting the values from one day after your final date in your original dataset, using the model (which is trained with this past data). Once you predict this value, you do the same thing, but considering the last values predict, and so on.
The fact that you are using a prediction to make others predictions, implies that is much more difficult to get good results, so is common to try to predict short ranges of time.
The exact code that you need to perform to do this could vary, but I think that is the prime concept
In the link below, in the last part, in which is perform a forecast, the author show us a code and a explanation on how he did it.
https://towardsdatascience.com/time-series-forecasting-with-recurrent-neural-networks-74674e289816
I guess that's it.

Difference between training and testing accuracy+ Tensorflow tutorial

The code in this tensorflow tutorial uses this section of the code to calculate the validation accuracy right?
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)
Question: So if I had to calculate the training set accuracy that is to see if my model is overfitting my training set data, if I changed the value of "x" to train_data and feed the training data for testing as well, Would it give me the training set accuracy?
If not, how do I check if my model is overfitting my dataset?
How does the number of steps affect the accuracy?
Like if I have trained it for 20000 steps and then if I train it for another 100. Why does it change the accuracy? Is it since the weights are being calculated all over again? Would it be advisable to do something like this then?
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook])

Normally you have 3 datasets, 1 for training, 1 for validation and 1 for testing. All these datasets have to be unique, an image of the training set may not occur in the validation or test set, etc. You train with the training set and after each epoch, you validate the model with the validation data. The optimizers will always try to update the weights to perfectly classify the training data, the training accuracy will therefore get very high (>90). The validation data is data the model has never seen before, and its done after each epoch (or x amount of steps) to show how well the model reacts to data is hasn't seen before, this shows how well the model will improve overtime.
The more you train, the higher the training accuracy will become, since the optimizer will do its best to get that value to 100%. The validation data, that does not update the weights, also increases overtime, but not continuously. While the training accuracy keeps improving, the validation accuracy might stop improving. The moment the validation accuracy decreases over time, well then you're overfitting. This means that the model is focusing too much on the training data, and that it can't classify another character correctly if it differs from the training set.
At the end of all the training you use a test set, this will determine the actual accuracy of your model on new data.
#xmacz: I cannot add comments yet, only answers so I just update my answer. Yes, I checked the source code, your first lines of code tests the model on test data

The evaluate is just a function which does some numerical activities to the input data and produces some output. If you use it for testing data it should give the testing accuracy and if you input the training data it should output the training accuracy.
At the end of the day it is just mathematics. What the output is intuitively, is something that you would have to ascertain.

How to know whether your model is overfitting is something you do while training the model. You have to set apart another set called validation dataset which is different from the test and training sets. A typical split of datasets is 70%-20%-10% for training, testing and validating respectively.
During the training, every n steps you test your model on the validation dataset. During the first iterations the score on your validation set will get better but at some point it will get worse. You can use this information to stop your training when your model starts to overfit but doing it right is an art. You could for instance stop after 5 tests that your accuracy has been decreasing consecutively, because sometimes you can see that it gets worse but in the next test it gets better. It's hard to say, it depends on many factors.
Regarding to your second question, iterating another 100 steps could make your model better or worse, depending on whether it's overfitting or not, so I'm afraid that question doesn't have a clear answer. The weights will rarely stop changing because the iterations/steps are "moving" them, for good or for bad. Again, it's difficult to say how to get good results, but you could try early stopping using a validation set, as I've mentioned before.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.