I'm currently creating a model and while creating it I came with some questions. Does training the same model with the same data multiple times leads to better precision of those objects, since your training it every time? And what could be the issue when sometimes the object gets 90% precision and when I re-run it gets lower precision or even not predicting the right object? Is it because of Tensorflow running on the GPU?
I will guess that you are doing image recognition and that you want to identify images (objects) using a neuronal network made with Keras. You should train it once, but during training you will do several epochs, meaning the algorithm adapts the weights in several rounds (epochs). For each round it goes over all training images. Once trained, you can use the model to identify images/objects.
You can evaluate the accuracy of your trained model over the same training set, but it is better to use a different set (see train_test_split from sklearn for instance).
Training is a stochastic process, meaning that every time you train your network it will be different in the end. Hence, you will get different accurcies. The stochasticity comes from different initial weights or from using stochastic gradient descent methods for instance.
The question does not appear to have anything to do with Keras or TensorFlow but basic understandting of how neuronal networks work. There is no connection to running Tensorflow on the GPU. You will also not get better precision by training with the same objects. If you train your model on a dataset for a very long time (many epochs), you might get into overfitting. Then, the accuracy of your model on this training dataset will be very high, but the model will have low accuracy on other datasets.
A common technique is split your date in train and validation datasets, then repeatedly train your model using EarlyStopping. This will train on the training dataset, then calculate the loss against the validation dataset, and then keep training until no further improvement is seen. You can set a patience parameter to wait for X epochs without an improvement to stop training (and optionally save the best model)
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Another trick is image augmentation with ImageDataGenerator which will generate synthetic data for you (rotations, shifts, mirror images, brightness adjusts, noise etc). This can effectively increase the amount of data you have to train with, thus reducing overfitting.
https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/
Related
I am using Tensorflow to train my model. I am routinely saving my model every 10 epochs. I have a limited number of samples to train, so I am augmenting my dataset to make a larger training dataset.
If I need to use my saved model to resume training after a power outage would it be best to resume training using the same dataset or to make a new dataset?
Your question very much depends on how you're augmenting your dataset. If your augmentation skews the statistical distribution of the underlying dataset then you should resume training with the pre-power outage dataset. Otherwise, you're assuming that your augmentation has not changed the distribution of the dataset.
It is a fairly safe assumption to make (assuming your augmentations do not change the data in an extremely significant way) that you are safe to resume training on a new dataset or the old dataset without significant change in accuracy.
What will happen if I use the same training data and validation data for my machine learning classifier?
If the train data and the validation data are the same, the trained classifier will have a high accuracy, because it has already seen the data. That is why we use train-test splits. We take 60-70% of the training data to train the classifier, and then run the classifier against 30-40% of the data, the validation data which the classifier has not seen yet. This helps measure the accuracy of the classifier and its behavior, such as over fitting or under fitting, against a real test set with no labels.
We create multiple models and then use the validation to see which model performed the best. We also use the validation data to reduce the complexity of our model to the correct level. If you use train data as your validation data, you will achieve incredibly high levels of success (your misclassification rate or average square error will be tiny), but when you apply the model to real data that isn't from your train data, your model will do very poorly. This is called OVERFITTING to the train data.
Basically nothing happens. You are just trying to validate your model's performance on the same data it was trained on, which practically doesn't yield anything different or useful. It is like teaching someone to recognize an apple and asking them to recognize just the same apple and see how they performed.
Why a validation set is used then? To answer this in short, the train and validation sets are assumed to be generated from the same distribution and thus the model trained on training set should perform almost equally well on the examples from validation set that it has not seen before.
Generally, we divide the data to validation and training to prevent overfitting. To explain it, we can think a model that classifies that it is human or not and you have dataset contains 1000 human images. If you train your model with all your images in that dataset , and again validate it with again same data set your accuracy will be 99%. However, when you put another image from different dataset to be classified by the your model ,your accuracy will be much more lower than the first. Therefore, generalization of the model for this example is a training a model looking for a stickman to define basically it is human or not instead of looking for specific handsome blonde man. Therefore, we divide dataset into validation and training to generalize the model and prevent overfitting.
TLDR;
If you use the same dataset for training and validation then:
training_accuracy = testing_accuracy
Your testing_accuracy will be the same as training_accuracy if you use the training dataset as the validation dataset. Therefore you will NOT be able to tell if your model has underfit or not.
Let's talk about datasets and evaluation metrics. Here is some terminology (reference) -
Datasets:
Training dataset: The data used to fit the model.
Validation dataset: the data used to validate the generalization ability of the model or for early stopping, during the training process. In most cases, this is the same as the test dataset
Evaluations:
Training accuracy: The accuracy you achieve when comparing predictions and actuals from the training data itself.
Testing accuracy: The accuracy you achieve when comparing predictions and actuals from the testing/validation data.
With the training_accuracy, you can get a sense of how well a model fits your data and the testing_accuracy tells you how well that model is generalizable. If train_accuracy is low, then your model has underfitted and you may need a better model (better features, different architecture, etc) for modeling the given problem. If training_accuracy is high but testing_accuracy is low, this means your model fits the data well, but it's not generalizable on unseen data. This is overfitting.
Note: In practice, it is better to have a overfit model and regularize it heavily rather than work with an underfit model.
Another important thing you need to understand that training a model (fit) and inference from a model (predict / score) are 2 separate tasks. Therefore, when you use the validation dataset as the training dataset, you are basically still training the model on the same training dataset but while inference, you are using the training dataset which will give you the same accuracy as the training_accuracy.
You will therefore not come to know if at all you overfit BUT that doesn't mean you will get 99% accuracy like the other answer to suggest! You may still underfit and get an extremely low model accuracy
I have a training data with 3961 different rows and 32 columns I want to fit to a Random Forest and a Gradient Boosting model. While training, I need to fine-tune the hyper-parameters of the models to get the best AUC possible. To do so, I minimize the quantity 1-AUC(Y_real,Y_pred) using the Basin-Hopping algorithm described in Scipy; so my training and internal validation subsamples are the same.
When the optimization is finished, I get for Random Forest an AUC=0.994, while for the Gradient Boosting I get AUC=1. Am I overfitting these models? How could I know when an overfitting is taking place during training?
To know if your are overfitting you have to compute:
Training set accuracy (or 1-AUC in your case)
Test set accuracy (or 1-AUC in your case)(You can use validation data set if you have it)
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
To know if you are overfitting, you always need to do this process. However, if your training accuracy or score is too perfect (e.g. accuracy of 100%), you can sense that you are overfitting too.
So, if you don't have training and test data, you have to create it using sklearn.model_selection.train_test_split. Then you will be able to compare both accuracy. Otherwise, you won't be able to know, with confidence, if you are overfitting or not.
Today I was working on a classifier to detect whether or not a mushroom was poisonous given its features. The data was in a .csv file(read to a pandas DataFrame) and the link to the data can be found at the end.
I used sci-kit learn's train_test_split function to split the data into training and testing sets.
I then removed the column that specified whether or not the mushroom was poisonous or not for the training and testing labels and assigned this to a yTrain, and yTest variable.
I then applied a one-hot-encoding (Using pd.get_dummies()) to the data since the parameters were categorical.
After this, I normalized the training and testing input data.
Essentially the training and testing input data was a distinct list of one-hot-encoded parameters and the output data was a list of one's and zeroes representing the output(one meant poisonous, zero meant edible).
I used Keras and a simple-feed forward network for this project. This network is comprised of three layers; A simple Dense(Linear Layer for PyTorch users) layer with 300 neurons, a Dense layer with 100 neurons, and a Dense layer with two neurons, each representing the probability of whether or not the given parameters of the mushroom signified it was poisonous, or edible. Adam was the optimizer that I had used, and Sparse-Categorical-Crossentropy was my loss-function.
I trained my network for 60 epochs. After about 5 epochs the loss was basically zero, and my accuracy was 1. After training, I was worried that my network had overfitted, so I tried it on my distinct testing data. The results were the same as the training and validation data; the accuracy was at 100% and my loss was negligible.
My validation loss at the end of 50 epochs is 2.258996e-07, and my training loss is 1.998715e-07. My testing loss was 4.732502e-09. I am really confused at the state of this, is the loss supposed to be this low? I don't think I am overfitting, and my validation loss is only a bit higher than my training loss, so I don't think that I am underfitting, as well.
Do any of you know the answer to this question? I am sorry if I had messed up in a silly way of some sort.
Link to dataset: https://www.kaggle.com/uciml/mushroom-classification
It seems that that Kaggle dataset is solvable, in the sense that you can create a model which gives the correct answer 100% of the time (if these results are to be believed). If you look at those results, you can see that the author was actually able to find models which give 100% accuracy using several methods, including decisions trees.
Lets say I have a training sample (with their corresponding training labels) for a defined neural network (the architecture of the neural network does not matter for answering this question). Lets call the neural network 'model'.
In order to not create any missunderstandings, lets say that I introduce the initial weights and biases for 'model'.
Experiment 1.
I use the training sample and the training labels to train the 'model' for 40 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB_Final_experiment1.
Experiment 2
I use the training sample and the training labels to train 'model' for 20 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB_Intermediate.
Now I introduce WB_Intermediate in 'model' and train for another 20 epochs. After the training, the neural network will have a specific set of weights and biases for the entire neural network, lets call it WB__Final_experiment2.
Considerations. Every single parameter, hyperparameter, activation functions, loss functions....is exactly the same for both experiments, except the epochs.
Question: Are WB_Final_experiment1 and WB__Final_experiment2 exactly the same?
If you follow this tutorial here, you will find the results of the two experiments as given below -
Experiment 1
Experiment 2
In the first experiment the model ran for 4 epochs and in the second experiment, the model ran for 2 epochs and then trained for 2 more epochs using last weights of previous training. You will find that the results vary but to a very small amount. And they will always vary due to the randomized initialization of weights. But the prediction of both models will lie very near to each other.
If the models are initialized with the same weights then the results at the end of 4 epochs for both the models will remain same.
On the other hand if you trained for 2 epochs, then shut down your training session and the weights are not saved and if you train now for 2 epochs after restarting session, the prediction won't be the same. To avoid that before training, always load the saved weights to continue training using model.load_weights("path to model").
TL;DR
If models are initialized with the exact same weights, then the output at the end of same training epochs will remain same. If they are randomly initialized the output will only vary slightly.
If the operations you are doing are entirely deterministic, then yes. Epochs are implemented as an iteration number for a for loop around your training algorithm. You can see this in implementations in PyTorch.
Typically no, the model weights will not be the same as the optimisation will accrue its own values during training. You will need to save those too to truly resume from where you left off.
See the Pytorch documentation regarding saving and resuming here. But this concept is not limited to the Pytorch framework.
Specifically:
It is important to also save the optimizer’s state_dict, as this
contains buffers and parameters that are updated as the model trains.