The code in this tensorflow tutorial uses this section of the code to calculate the validation accuracy right?
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)
Question: So if I had to calculate the training set accuracy that is to see if my model is overfitting my training set data, if I changed the value of "x" to train_data and feed the training data for testing as well, Would it give me the training set accuracy?
If not, how do I check if my model is overfitting my dataset?
How does the number of steps affect the accuracy?
Like if I have trained it for 20000 steps and then if I train it for another 100. Why does it change the accuracy? Is it since the weights are being calculated all over again? Would it be advisable to do something like this then?
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook])
Normally you have 3 datasets, 1 for training, 1 for validation and 1 for testing. All these datasets have to be unique, an image of the training set may not occur in the validation or test set, etc. You train with the training set and after each epoch, you validate the model with the validation data. The optimizers will always try to update the weights to perfectly classify the training data, the training accuracy will therefore get very high (>90). The validation data is data the model has never seen before, and its done after each epoch (or x amount of steps) to show how well the model reacts to data is hasn't seen before, this shows how well the model will improve overtime.
The more you train, the higher the training accuracy will become, since the optimizer will do its best to get that value to 100%. The validation data, that does not update the weights, also increases overtime, but not continuously. While the training accuracy keeps improving, the validation accuracy might stop improving. The moment the validation accuracy decreases over time, well then you're overfitting. This means that the model is focusing too much on the training data, and that it can't classify another character correctly if it differs from the training set.
At the end of all the training you use a test set, this will determine the actual accuracy of your model on new data.
#xmacz: I cannot add comments yet, only answers so I just update my answer. Yes, I checked the source code, your first lines of code tests the model on test data
The evaluate is just a function which does some numerical activities to the input data and produces some output. If you use it for testing data it should give the testing accuracy and if you input the training data it should output the training accuracy.
At the end of the day it is just mathematics. What the output is intuitively, is something that you would have to ascertain.
How to know whether your model is overfitting is something you do while training the model. You have to set apart another set called validation dataset which is different from the test and training sets. A typical split of datasets is 70%-20%-10% for training, testing and validating respectively.
During the training, every n steps you test your model on the validation dataset. During the first iterations the score on your validation set will get better but at some point it will get worse. You can use this information to stop your training when your model starts to overfit but doing it right is an art. You could for instance stop after 5 tests that your accuracy has been decreasing consecutively, because sometimes you can see that it gets worse but in the next test it gets better. It's hard to say, it depends on many factors.
Regarding to your second question, iterating another 100 steps could make your model better or worse, depending on whether it's overfitting or not, so I'm afraid that question doesn't have a clear answer. The weights will rarely stop changing because the iterations/steps are "moving" them, for good or for bad. Again, it's difficult to say how to get good results, but you could try early stopping using a validation set, as I've mentioned before.
Related
I have a weird performance which I can not explain. During the training of my network, and after I am done with the calculation of my training batch, I directly go as well over the validation set and take a batch and test my model on it. So, my validation is not done on a different step from the training. But I just run a batch of the training then as well a batch of a validation.
My code looks similar to that:
for (data, targets) in tqdm(training_loader):
output = net(data)
log_p_y = log_softmax_fn(output)
loss = loss_fn(log_p_y, targets)
# Do backpropagation
val_data = itertools.cycle(val_loader)
valdata, valtargets = next(val_data)
val_output = net(valdata)
log_p_yval = log_softmax_fn(val_output)
loss_val = loss_fn(log_p_yval, valtargets)
At the end of the day, I can see that my model has both very good training and validation accuracies per epochs. However, on test data it is not as good. Moreover, the validation loss is like double of the training loss.
What may be the reasons of having poor testing accuracy at the end? I will be glad if can some one explain what I am experiencing and recommend some solutions :)
Here is a picture of the training and validation losses.
And here is a picture of the training and validation accuracies.
I am pretty sure that the reason in Unrepresentative Validation Dataset.
An unrepresentative validation dataset means that the validation dataset does not provide sufficient information to evaluate the ability of the model to generalize.
This may occur if the validation dataset has too few examples as compared to the training dataset.
Make sure your training and testing data are picked randomly and represent as accurately as possible the same distribution and the real distribution.
What is your data set sizes for training, test and validation?
Over-fitting seems to be the issue here, which you can check by using Regularization techniques like L1,L2 or Dropout.
It seems very likely a special case of unrepresentative train dataset described in this article: Diagnosing Model Performance with Learning Curves.
In addition, acc may not be a good metric in your scenario, so try pr curve or roc.
What will happen if I use the same training data and validation data for my machine learning classifier?
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
TLDR;
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy will be the same as training_accuracy if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
Datasets:
Training dataset: The data used to fit the model.
Validation dataset: the data used to validate the generalization ability of the model or for early stopping, during the training process. In most cases, this is the same as the test dataset
Evaluations:
Training accuracy: The accuracy you achieve when comparing predictions and actuals from the training data itself.
Testing accuracy: The accuracy you achieve when comparing predictions and actuals from the testing/validation data.
With the training_accuracy, you can get a sense of how well a model fits your data and the testing_accuracy tells you how well that model is generalizable. If train_accuracy is low, then your model has underfitted and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy is high but testing_accuracy is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit) and inference from a model (predict / score) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training the model on the same training dataset but while inference, you are using the training dataset which will give you the same accuracy as the training_accuracy.
You will therefore not come to know if at all you overfit BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit and get an extremely low model accuracy
I have a training data with 3961 different rows and 32 columns I want to fit to a Random Forest and a Gradient Boosting model. While training, I need to fine-tune the hyper-parameters of the models to get the best AUC possible. To do so, I minimize the quantity 1-AUC(Y_real,Y_pred) using the Basin-Hopping algorithm described in Scipy; so my training and internal validation subsamples are the same.
When the optimization is finished, I get for Random Forest an AUC=0.994, while for the Gradient Boosting I get AUC=1. Am I overfitting these models? How could I know when an overfitting is taking place during training?
To know if your are overfitting you have to compute:
Training set accuracy (or 1-AUC in your case)
Test set accuracy (or 1-AUC in your case)(You can use validation data set if you have it)
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
To know if you are overfitting, you always need to do this process. However, if your training accuracy or score is too perfect (e.g. accuracy of 100%), you can sense that you are overfitting too.
So, if you don't have training and test data, you have to create it using sklearn.model_selection.train_test_split. Then you will be able to compare both accuracy. Otherwise, you won't be able to know, with confidence, if you are overfitting or not.
I have trained an LSTM model for Time Series Forecasting. I have used an early stopping method with a patience of 150 epochs.
I have used a dropout of 0.2, and this is the plot of train and validation loss:
The early stopping method stop the training after 650 epochs, and save the best weight around epoch 460 where the validation loss was the best.
My question is :
Is it normal that the train loss is always above the validation loss?
I know that if it was the opposite(validation loss above the train) it would have been a sign of overfitting.
But what about this case?
EDIT:
My dataset is a Time Series with hourly temporal frequence. It is composed of 35000 instance. I have split the data into 80 % train and 20% validation but in temporal order. So for example the training will contain the data until the beginning of 2017 and the validation the data from 2017 until the end.
I have created this plot by averaging the data over 15 days and this is the result:
So maybe the reason is as you said that the validation data have an easier pattern. How can i solve this problem?
For most cases, the validation loss should be higher than the training loss because the labels in the training set are accessible to the model. In fact, one good habit to train a new network is to use a small subset of the data and see whether the training loss can converge to 0 (fully overfits the training set). If not, it means this model is somehow incompetent to memorize the data.
Let's go back to your problem. I think the observation that validation loss is less than training loss happens. But this possibly is not because of your model, but how you split the data. Consider that there are two types of patterns (A and B) in the dataset. If you split in a way that the training set contains both pattern A and pattern B, while the small validation set only contains pattern B. In this case, if B is easier to be recognized, then you might get a higher training loss.
In a more extreme example, pattern A is almost impossible to recognize but there are only 1% of them in the dataset. And the model can recognize all pattern B. If the validation set happens to have only pattern B, then the validation loss will be smaller.
As alex mentioned, using K-fold is a good solution to make sure every sample will be used as both validation and training data. Also, printing out the confusion matrix to make sure all labels are relatively balanced is another method to try.
Usually the opposite is true. But since you are using drop out,it is common to have the validation loss less than the training loss.And like others have suggested try k-fold cross validation
My model which I have trained on a set of 29K images for 36 classes and validated on 7K images. The model has a training accuracy of 94.59% and validation accuracy of 95.72%
It has been created for OCR on digits and characters. I know the amount of images for training on 36 classes might not be sufficient. I'm not certain what to infer from these results.
Question: Is this a good result? Should the testing accuracy always be greater than training accuracy? Is my model overfitting?
Question: How would I know if my model was overfitting? I'm assuming a very high training accuracy and very low testing accuracy would indicate that?
95% is rather good for 36 classes. If your validation accuracy is higher than training accuracy, you are underfitting. You can run some more epochs, until your training accuracy is a bit higher than validation accuracy.
Exactly, if training accuracy is much higher, you are overfitting.
The training accuracy should always be higher than the testing accuracy/ validation accuracy. This is because your model has to be good on the data that it is provided to be able to predict unknown datas. However, it also happens sometimes and the reason could be
a. The test test wasn't randomly selected or was randomly selected but turned out to be
favourable one (a coincidence).
b. Your Model is very generalized and in combination with the first problem.
Check the learning curve first, Your case is rare in which the training accuracy is smaller. Solution may be more data or More complex models or more epochs (Solution to underfitting)