Using validation set in training model after adjust hyperparameters - python

I'm doing my best with creating model for imbalanced data using NN. I have a separate test set but I have a problem with validation data. Can I add validation data to train set after adjusting hyperparameters? Or it's better to leave this out and train final model only on the train data set? What do you think and what's your experience with this kind of data?

The validation dataset is only for keras to calculate you a score for each epoch. So your model is not affected by this dataset, but you will get better statistics.
That means: You can set the validatedata set after you adjusted the hyperparameters and if you don't want to, you don't have to set validation data.

Related

Train and validation data structure

What will happen if I use the same training data and validation data for my machine learning classifier?
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
TLDR;
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy will be the same as training_accuracy if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
Datasets:
Training dataset: The data used to fit the model.
Validation dataset: the data used to validate the generalization ability of the model or for early stopping, during the training process. In most cases, this is the same as the test dataset
Evaluations:
Training accuracy: The accuracy you achieve when comparing predictions and actuals from the training data itself.
Testing accuracy: The accuracy you achieve when comparing predictions and actuals from the testing/validation data.
With the training_accuracy, you can get a sense of how well a model fits your data and the testing_accuracy tells you how well that model is generalizable. If train_accuracy is low, then your model has underfitted and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy is high but testing_accuracy is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit) and inference from a model (predict / score) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training the model on the same training dataset but while inference, you are using the training dataset which will give you the same accuracy as the training_accuracy.
You will therefore not come to know if at all you overfit BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit and get an extremely low model accuracy

conv net save weight and new test set

i'm using conv net for image classification.
There is something I dont understand theoretically
For training I split my data 60%train/20%validation/20% test
I save weight when metric on validation set is the best (I have same performance on training and validation set).
Now, I do a new split. Some data from training set will be on test set. I load the weight and I classify new test set.
Since weight have been computed on a part of the new test set, are we agree to says this is a bad procedure and I should retrain my model with my new training/validation set?
yes, for fair evaluation no sample in the test set should be seen during training
The all purpose of having a test set is that the model must never see it until the very last moment.
So if your model trained on some of the data in your test set, it becomes useless and the results it will gives you will have no meaning.
So basicly:
1.Train on your train set
2.Validate on your validation set
3.Repeat 1 and 2 until you are happy with the results
4.At the very end, finally test your model on the test set

Train and Export Model from Kfold Cross Validation with scikit learn and tensorflow

I have a small data set and therefore want to use cross-validation to achieve a higher accuracy of results.
I use a canned tf.estimator (tf.estimator.DNN regressor).
My question is now:
If I have the model trained for each split from Kfold, do I have to use "shutil.rmtree(model_dir, ignore_errors=True)" every time to reset the training success or it will be done automatically with every train command.
If I reset it now, how can I export the optimal model from all splits afterwards?
Or do I not have to reset train at all but just keep on training with a new data set from the splits.

I need help understanding train_test_split?

from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!
First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.
Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.

Tensorflow Object Detection API validation vs test set

I recently started looking into the Tensorflow Object Detection API and have a question on the validation set:
Is the validation used at all for the model training?
For instance are the weights of the model selected based on the accuracy on the validation set?
I am trying to figure out whether I need to have an independent test set (different from the evaluation set) to get unbiased results on the model performance, or can use the validation set for that.
Thank you!
The validation dataset (the test.record ) is not used in the training.
It is always better to have a validation dataset, to prevent overfitting for example.

Categories

Resources