Not an ML expert but the normal flow I follow to train a machine learning model is after data cleaning, split the dataset to train, and test using scikit-learn's train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X.values,
y.values,
test_size=0.30,
random_state=0)
skipping over the whole model building process...when you go to train the model(fit it) after defining and compiling it, it is to use the validation split parameter like below
history = model.fit(X_train, y_train, epochs=10,validation_split=0.2)
this seems to again divide the training datasets by 20% data points to validate our model during training. If you had let's say 1000 data points(rows) in a dataset the first code above will lead to
700 training data points for training and 300 for testing
the second will again divide that 700's 20% for validation leaving as
640 data points for training and 160 for validation
leaving us with small data to train our model with.
I recently encountered a method tho where you use the test data for validation like below
history = model.fit(x_train, y_train, validation_data=(x_test, y_test))
My question is what will actually happen to the validation data after training completes is it automatically added to train our model somehow improving our accuracy at the end and also is using the test data for validation enough and if we do that will it have an effect when we try to evaluate our model using the test data.
As stated by keras (the last model.fit method with validation_data comes from keras)
validation_split
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data. This could be
a list (x_val, y_val) or a list (x_val, y_val, val_sample_weights).
validation_data will override validation_split.
So with this setup, the model will not train on the validation data.
In theory, validation data is used to evaluate your model and tune its hyperparameters.
If you need to put this model in production you should retrain the model with all the data, knowing that the performance would be what you obtained from the validation\test data
Anyway, the performances on validation\test data are often an optimistic estimate of the performance
source
Related
I know that for training you model and see the accuracy you must use something like this:
print("Fit model on training data")
history = model.fit(
X_train,
y_train,
batch_size=64,
epochs=2,
validation_data=valid_set,
)
However, I don't know how to put my own validation data set since it has a different length than my training data set and it has caused me many error along the way.
I even pre-processed my own validation dataset with the same process I used for my base dataset.
Any tips on how to solve this?
I'm a self-taught Python user.
In Python codes,
model.fit(x_train, y_train, verbose=1, validation_split=0.2, shuffle=True, epochs=20000)
Then, 80% of the data is used for training and 20% is used for validation, and the epoch is repeated 20,000 times for training.
And,
shuffle=True
So, I think this code is a cross-validation, or more specifically, a k-divisional cross-validation with k=5.
I was wondering if this is correct, because when I looked up the Keras code for k-fold cross-validation, I found some code that uses Scikit-learn's Kfold.
I apologize for the rudimentary nature of this question, but I would appreciate it if you could help me.
The model first shuffles the data and then splits it to train and validation
For the next epoch, the train & validation have already been defined in the first epoch, so it does not shuffle & split again, but uses the previously defined datasets.
Therefore, it is a cross-validation.
Here I have this piece of python code, taken from SoloLearn,
scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))
My question then is, do I need to create a new model in every split?
Why don't we just create one LogisticRegression before the for?
I would put it before to save computation time, but since it has been presented this way I thought there was a reason.
Great question! The answer is...you don't have to create the model each time. Your intuition is correct. Feel free to move model = LogisticRegression() to the top, outside the loop, and re-run to confirm.
The model object that exists after model.fit(X_train, y_train) each time through the loop will be the same either way.
Short answer is yes.
The reason why is because this is k-fold cross validation
Simply put, this means that you are training k number of models, evaluating the results of each and averaging together.
We do this in cases where we do not have separate data sets for training and testing. Cross validation is splitting the training data into k subgroups, each of which contains its own test/train split (we call these folds). We then train a model on the training data of the first fold and test on the test data. Repeat for all folds with a new model for each and now we have proper predictions for the full dataset.
Here is a link to a detailed description of cross validation - https://machinelearningmastery.com/k-fold-cross-validation/
KFold is used for cross validation, that means training a model and evaluating it.
Here is an example of documentation on the subject.
When doing that you obviously need two datasets: a training AND an evaluation data set.
When using KFold, you split your training set in number of folds (5 in your example) and run five models, using one fifth each time as the validation set and the rest of the dataset as the training set.
Now, in order to answer the question : you need a new model each time because you have five models, as each of the fifth times you have a different training set, as well as a different validation set. You must create a new one in scikit learn because when you run model.fit() the model is trained on a specific dataset, so you cannot use it for another training dataset.
If you want to create it only once, you can make copies for example :
model = LogisticRegression(**params)
def parse_kfold(model)
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
model_fold = model
...
i split my data into training and test samples (70/30) for regression-forecasting based problem (MLP, LSTM, etc.).
Within the code:
history = model.fit(X_train, y_train, epochs=100, batch_size=32,
validation_data=(X_test, y_test), verbose=0, shuffle=False)
I put my test data as the validation set and did couple weeks worth of predictions. So i did not hold back the test data...
But now that i think about it, i guess it was wrong to put the test data into the fit function, or was it ok?
NEVER EVER! use your testing that as part of training or validation. The test set should only be used for inference after training. So yes it's wrong to use your test data in the fit function, it should only be in model.predict(y_test)
I'm trying to understand using kfolds cross validation from the sklearn python module.
I understand the basic flow:
instantiate a model e.g. model = LogisticRegression()
fitting the model e.g. model.fit(xtrain, ytrain)
predicting e.g. model.predict(ytest)
use e.g. cross val score to test the fitted model accuracy.
Where i'm confused is using sklearn kfolds with cross val score. As I understand it the cross_val_score function will fit the model and predict on the kfolds giving you an accuracy score for each fold.
e.g. using code like this:
kf = KFold(n=data.shape[0], n_folds=5, shuffle=True, random_state=8)
lr = linear_model.LogisticRegression()
accuracies = cross_val_score(lr, X_train,y_train, scoring='accuracy', cv = kf)
So if I have a dataset with training and testing data, and I use the cross_val_score function with kfolds to determine the accuracy of the algorithm on my training data for each fold, is the model now fitted and ready for prediction on the testing data?
So in the case above using lr.predict
No the model is not fitted. Looking at the source code for cross_val_score:
scores=parallel(delayed(_fit_and_score)(clone(estimator),X,y,scorer,
train,test,verbose,None,fit_params)
As you can see, cross_val_score clones the estimator before fitting the fold training data to it. cross_val_score will give you output an array of scores which you can analyse to know how the estimator performs for different folds of the data to check if it overfits the data or not. You can know more about it here
You need to fit the whole training data to the estimator once you are satisfied with the results of cross_val_score, before you can use it to predict on test data.