Is validation_split=0.2 in Keras a cross-validation? - python

I'm a self-taught Python user.
In Python codes,
model.fit(x_train, y_train, verbose=1, validation_split=0.2, shuffle=True, epochs=20000)
Then, 80% of the data is used for training and 20% is used for validation, and the epoch is repeated 20,000 times for training.
And,
shuffle=True
So, I think this code is a cross-validation, or more specifically, a k-divisional cross-validation with k=5.
I was wondering if this is correct, because when I looked up the Keras code for k-fold cross-validation, I found some code that uses Scikit-learn's Kfold.
I apologize for the rudimentary nature of this question, but I would appreciate it if you could help me.

The model first shuffles the data and then splits it to train and validation
For the next epoch, the train & validation have already been defined in the first epoch, so it does not shuffle & split again, but uses the previously defined datasets.
Therefore, it is a cross-validation.

Related

How to use my own validation data set in tensor flow for a prediction model?

I know that for training you model and see the accuracy you must use something like this:
print("Fit model on training data")
history = model.fit(
X_train,
y_train,
batch_size=64,
epochs=2,
validation_data=valid_set,
)
However, I don't know how to put my own validation data set since it has a different length than my training data set and it has caused me many error along the way.
I even pre-processed my own validation dataset with the same process I used for my base dataset.
Any tips on how to solve this?

In keras, is doing fit() over many single datepoints the same as doing fit() over a dataset?

In keras, is doing fit() over many single datepoints the same as doing fit() over a dataset? For example, is doing a single
model.fit(train_X,
train_y,
batch_size=1,
epochs=1)
The same as doing
for i in range(len(train_X)):
model.fit([train_X[i]],
[train_y[i]],
batch_size=1,
epochs=1)
Or is it different?
The difference is that model.fit() for a whole dataset will shuffle the samples each epoch, which can help with the learning process. If you do it the looped-way, you it will be updating the weights to the same progression of samples.
model.fit will update the weights after every batch, in your case after every sample. So aside of the shuffling, the two methods you proposed are the same.

what happens to validation data after training completes?

Not an ML expert but the normal flow I follow to train a machine learning model is after data cleaning, split the dataset to train, and test using scikit-learn's train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X.values,
y.values,
test_size=0.30,
random_state=0)
skipping over the whole model building process...when you go to train the model(fit it) after defining and compiling it, it is to use the validation split parameter like below
history = model.fit(X_train, y_train, epochs=10,validation_split=0.2)
this seems to again divide the training datasets by 20% data points to validate our model during training. If you had let's say 1000 data points(rows) in a dataset the first code above will lead to
700 training data points for training and 300 for testing
the second will again divide that 700's 20% for validation leaving as
640 data points for training and 160 for validation
leaving us with small data to train our model with.
I recently encountered a method tho where you use the test data for validation like below
history = model.fit(x_train, y_train, validation_data=(x_test, y_test))
My question is what will actually happen to the validation data after training completes is it automatically added to train our model somehow improving our accuracy at the end and also is using the test data for validation enough and if we do that will it have an effect when we try to evaluate our model using the test data.
As stated by keras (the last model.fit method with validation_data comes from keras)
validation_split
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data. This could be
a list (x_val, y_val) or a list (x_val, y_val, val_sample_weights).
validation_data will override validation_split.
So with this setup, the model will not train on the validation data.
In theory, validation data is used to evaluate your model and tune its hyperparameters.
If you need to put this model in production you should retrain the model with all the data, knowing that the performance would be what you obtained from the validation\test data
Anyway, the performances on validation\test data are often an optimistic estimate of the performance
source

Regression: Training Test Split - held out test?

i split my data into training and test samples (70/30) for regression-forecasting based problem (MLP, LSTM, etc.).
Within the code:
history = model.fit(X_train, y_train, epochs=100, batch_size=32,
validation_data=(X_test, y_test), verbose=0, shuffle=False)
I put my test data as the validation set and did couple weeks worth of predictions. So i did not hold back the test data...
But now that i think about it, i guess it was wrong to put the test data into the fit function, or was it ok?
NEVER EVER! use your testing that as part of training or validation. The test set should only be used for inference after training. So yes it's wrong to use your test data in the fit function, it should only be in model.predict(y_test)

Different accuracy by fit() and evaluate() in Keras with the same dataset

I program Keras's code to train GoogleNet. However, accuracy gotten from fit() is 100% yet with the same training dataset used for evaluate(), accuracy remains 25% only, which has such huge discrepancy!!! Also, accuracy by evaluate(), which is not like fit(), won't get improved for training more times, which means it almost stays in 25%.
Does anyone has idea of what is wrong with this situation?
# Training Dataset and labels r given. Here load GoogleNet model
from keras.models import load_model
model = load_model('FT_InceptionV3.h5')
# Training Phase
model.fit(x=X_train,
y=y_train,
batch_size=5,
epochs=20,
validation_split=0,
#callbacks=[tensorboard]
)
#Testing Phase
train_loss , train_acc=model.evaluate(X_train, y_train, verbose=1)
print("Train loss=",train_loss,"Train accuracy",train_acc)
Training Result
Testing Result
After some digging into Keras issues, I found this.
The reason for this is that when you use fit, At each batch of the training data the weights are updated. The loss value returned by the fit method is not the mean of the loss of the final model, but the mean of the loss of all slightly different models used on each batch.
On the other hand, when you use to evaluate, the same model is used on the whole dataset. And this model actually doesn't even appear in the loss of the fit method since even at the last batch of training, the loss computed is used to update the model's weights.
To sum everything up, fit and evaluate have two completely different behaviours.
Reference:-
Keras_issues_thread
Keras_official_doc

Categories

Resources