i'm using conv net for image classification.
There is something I dont understand theoretically
For training I split my data 60%train/20%validation/20% test
I save weight when metric on validation set is the best (I have same performance on training and validation set).
Now, I do a new split. Some data from training set will be on test set. I load the weight and I classify new test set.
Since weight have been computed on a part of the new test set, are we agree to says this is a bad procedure and I should retrain my model with my new training/validation set?
yes, for fair evaluation no sample in the test set should be seen during training
The all purpose of having a test set is that the model must never see it until the very last moment.
So if your model trained on some of the data in your test set, it becomes useless and the results it will gives you will have no meaning.
So basicly:
1.Train on your train set
2.Validate on your validation set
3.Repeat 1 and 2 until you are happy with the results
4.At the very end, finally test your model on the test set
Related
I've been trying to create a stateful LSTM model with keras, and I pretty much figured out the training part, but I don't get the predicting part.
So, let's imagine that we had 10000 time-series datapoints. we use 9000 in front for training, and the other 1000 for testing. So, as we start training, we set the window length to 2, and slide the window forward as we set the input(X) as the first datapoint and set the output(y) as the second datapoint.
And as we train, the model converges because of it's stateful nature. Finally we finish training.
Now, we are left with a model, and some test data. The problem begins here. We test the first datapoint.
It returns a guessed value. Nice.
We test the second datapoint of the test set.
We get an output. But, the problem is that because we were using a stateful model, and we only one value as an input, the only way the model is going to figure out the next value is from memory of the previous time-series.
But since we didn't train the data on the first datapoint of the test set, the time-series is broken, and the model will think that the second datapoint on the test set is the first datapoint on the test set!
So, my question is,
does keras take care of this and automaticaly train the network as it's predicting?
or do I have to train the net as I am predicting
or is there some other reason that enables me to just keep predicting without training the model farther?
For a stateful LSTM, if will retain information in its cells as you predict. If you were to take any random point in the train or test dataset and repeatedly predict on it, your answer will change each time, because it keeps seeing this data and uses it every time it predicts. The only way to get a repeatable answer would be to call reset_states().
You should be calling reset_states() after each training epoch, and when you save the model, those cells should be empty. Then if you want to start predicting on the test set, you can predict on the last n training points (without saving the values anywhere), then start saving values once you get to your first test point.
It is often good practice to seed the model before prediction. If I want to evaluate on test_set[10:20,:], I can let the model predict on test_set[:10,:] first to seed the model then start saving my predicted values once I get to the range I am interested in.
To address the further training question, you do not need to train the model further to predict. Training will only be for tuning the model's weights. Look into this blog for more information on Stateful vs Stateless LSTM.
What will happen if I use the same training data and validation data for my machine learning classifier?
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
TLDR;
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy will be the same as training_accuracy if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
Datasets:
Training dataset: The data used to fit the model.
Validation dataset: the data used to validate the generalization ability of the model or for early stopping, during the training process. In most cases, this is the same as the test dataset
Evaluations:
Training accuracy: The accuracy you achieve when comparing predictions and actuals from the training data itself.
Testing accuracy: The accuracy you achieve when comparing predictions and actuals from the testing/validation data.
With the training_accuracy, you can get a sense of how well a model fits your data and the testing_accuracy tells you how well that model is generalizable. If train_accuracy is low, then your model has underfitted and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy is high but testing_accuracy is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit) and inference from a model (predict / score) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training the model on the same training dataset but while inference, you are using the training dataset which will give you the same accuracy as the training_accuracy.
You will therefore not come to know if at all you overfit BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit and get an extremely low model accuracy
I'm doing my best with creating model for imbalanced data using NN. I have a separate test set but I have a problem with validation data. Can I add validation data to train set after adjusting hyperparameters? Or it's better to leave this out and train final model only on the train data set? What do you think and what's your experience with this kind of data?
The validation dataset is only for keras to calculate you a score for each epoch. So your model is not affected by this dataset, but you will get better statistics.
That means: You can set the validatedata set after you adjusted the hyperparameters and if you don't want to, you don't have to set validation data.
I have trained an LSTM model for Time Series Forecasting. I have used an early stopping method with a patience of 150 epochs.
I have used a dropout of 0.2, and this is the plot of train and validation loss:
The early stopping method stop the training after 650 epochs, and save the best weight around epoch 460 where the validation loss was the best.
My question is :
Is it normal that the train loss is always above the validation loss?
I know that if it was the opposite(validation loss above the train) it would have been a sign of overfitting.
But what about this case?
EDIT:
My dataset is a Time Series with hourly temporal frequence. It is composed of 35000 instance. I have split the data into 80 % train and 20% validation but in temporal order. So for example the training will contain the data until the beginning of 2017 and the validation the data from 2017 until the end.
I have created this plot by averaging the data over 15 days and this is the result:
So maybe the reason is as you said that the validation data have an easier pattern. How can i solve this problem?
For most cases, the validation loss should be higher than the training loss because the labels in the training set are accessible to the model. In fact, one good habit to train a new network is to use a small subset of the data and see whether the training loss can converge to 0 (fully overfits the training set). If not, it means this model is somehow incompetent to memorize the data.
Let's go back to your problem. I think the observation that validation loss is less than training loss happens. But this possibly is not because of your model, but how you split the data. Consider that there are two types of patterns (A and B) in the dataset. If you split in a way that the training set contains both pattern A and pattern B, while the small validation set only contains pattern B. In this case, if B is easier to be recognized, then you might get a higher training loss.
In a more extreme example, pattern A is almost impossible to recognize but there are only 1% of them in the dataset. And the model can recognize all pattern B. If the validation set happens to have only pattern B, then the validation loss will be smaller.
As alex mentioned, using K-fold is a good solution to make sure every sample will be used as both validation and training data. Also, printing out the confusion matrix to make sure all labels are relatively balanced is another method to try.
Usually the opposite is true. But since you are using drop out,it is common to have the validation loss less than the training loss.And like others have suggested try k-fold cross validation
The code in this tensorflow tutorial uses this section of the code to calculate the validation accuracy right?
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)
Question: So if I had to calculate the training set accuracy that is to see if my model is overfitting my training set data, if I changed the value of "x" to train_data and feed the training data for testing as well, Would it give me the training set accuracy?
If not, how do I check if my model is overfitting my dataset?
How does the number of steps affect the accuracy?
Like if I have trained it for 20000 steps and then if I train it for another 100. Why does it change the accuracy? Is it since the weights are being calculated all over again? Would it be advisable to do something like this then?
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook])
Normally you have 3 datasets, 1 for training, 1 for validation and 1 for testing. All these datasets have to be unique, an image of the training set may not occur in the validation or test set, etc. You train with the training set and after each epoch, you validate the model with the validation data. The optimizers will always try to update the weights to perfectly classify the training data, the training accuracy will therefore get very high (>90). The validation data is data the model has never seen before, and its done after each epoch (or x amount of steps) to show how well the model reacts to data is hasn't seen before, this shows how well the model will improve overtime.
The more you train, the higher the training accuracy will become, since the optimizer will do its best to get that value to 100%. The validation data, that does not update the weights, also increases overtime, but not continuously. While the training accuracy keeps improving, the validation accuracy might stop improving. The moment the validation accuracy decreases over time, well then you're overfitting. This means that the model is focusing too much on the training data, and that it can't classify another character correctly if it differs from the training set.
At the end of all the training you use a test set, this will determine the actual accuracy of your model on new data.
#xmacz: I cannot add comments yet, only answers so I just update my answer. Yes, I checked the source code, your first lines of code tests the model on test data
The evaluate is just a function which does some numerical activities to the input data and produces some output. If you use it for testing data it should give the testing accuracy and if you input the training data it should output the training accuracy.
At the end of the day it is just mathematics. What the output is intuitively, is something that you would have to ascertain.
How to know whether your model is overfitting is something you do while training the model. You have to set apart another set called validation dataset which is different from the test and training sets. A typical split of datasets is 70%-20%-10% for training, testing and validating respectively.
During the training, every n steps you test your model on the validation dataset. During the first iterations the score on your validation set will get better but at some point it will get worse. You can use this information to stop your training when your model starts to overfit but doing it right is an art. You could for instance stop after 5 tests that your accuracy has been decreasing consecutively, because sometimes you can see that it gets worse but in the next test it gets better. It's hard to say, it depends on many factors.
Regarding to your second question, iterating another 100 steps could make your model better or worse, depending on whether it's overfitting or not, so I'm afraid that question doesn't have a clear answer. The weights will rarely stop changing because the iterations/steps are "moving" them, for good or for bad. Again, it's difficult to say how to get good results, but you could try early stopping using a validation set, as I've mentioned before.