How is Cross Validation performed and how GridSearchCV() specifically? - python

How is GridSearchCV() (and or RandomizedSearchCV()) implemented in scikit? I wonder about the following: When using one of these techniques, how are the following aspects taken into account:
validation set
model selection
hyperparameter tuning
? Here is a picture that summarizes my confusion:
What happens when and how often? Maybe to keep it simple, let's assume a single neural network acting as our model.
My understanding so far:
In the first iteration, the model is fit on the training fold, separated into different folds. Here I struggle already: Is the model trained on a single fold and then tested on the validation fold?
What happens then with the next fold? Does the model keep the weights achieved by its first training fold or will it re-initialize for the next training fold?
To be more precise: In the first iteration, is the model fit four times and tested four times on the validation set, independently between all folds?
When the next iteration begins, the model keeps no information from the first iteration, right?
Thus, are all iterations and all folds are independent from each other?
How are the hyperparameters tuned here?
In above example, there are 25 folds in total. Is the model with a constant set of hyperparameters fit and tested 20 times?
Let's say, we have two hyperparameters to tune: Learning rate and dropout rate, both with two levels:
learning_rate = [0.3, 0.6] and
dropout_rate = [0.4, 0.8].
Will the neural net now fitted 80 times? And when having not only a single model but e.g. two models (neural network and random forest), the whole procedure will be performed twice?
Is there a possibility to see how many folds GridSearchCV() will consider?
I have seen Does GridSearchCV perform cross-validation? , Model help using Scikit-learn when using GridSearch and scikit-learn GridSearchCV with multiple repetitions but I can't see a clear and precise answer to my questions.

So the k-folds method:
you split your training set into n parts (k folds) for example 5. You take de first part as the validation set and the 4 other parts as the training set. You train and this gives you a training/CV performance. You do this 5 (number of folds) times, each folds become the validation set and the others de training set. At the end you do the mean of the performances to obtain the cv performance of your model. This is for the k-fold.
Now, GridSearchCV is an hyperparameter tuner which uses k-folds method. The principel is you give to gridsearch a dictionary with all the hyper parameters you want to test then it will tests all the hyperparameters (dictionary) and select the best set of hyperparameters (those with the best model cv performance). It can take a very loooooooong time.
You pass a model (estimator) in gridsearch, a set of params and if you want the number of k-folds.
GridSearchCV(SVC(), parameters, cv = 5)
where SVC() is the estimator, parameters is your dictionary of hyperparameters and cv is the number of folds.


How to know when an overfitting is taking place?

I have a training data with 3961 different rows and 32 columns I want to fit to a Random Forest and a Gradient Boosting model. While training, I need to fine-tune the hyper-parameters of the models to get the best AUC possible. To do so, I minimize the quantity 1-AUC(Y_real,Y_pred) using the Basin-Hopping algorithm described in Scipy; so my training and internal validation subsamples are the same.
When the optimization is finished, I get for Random Forest an AUC=0.994, while for the Gradient Boosting I get AUC=1. Am I overfitting these models? How could I know when an overfitting is taking place during training?
To know if your are overfitting you have to compute:
Training set accuracy (or 1-AUC in your case)
Test set accuracy (or 1-AUC in your case)(You can use validation data set if you have it)
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
To know if you are overfitting, you always need to do this process. However, if your training accuracy or score is too perfect (e.g. accuracy of 100%), you can sense that you are overfitting too.
So, if you don't have training and test data, you have to create it using sklearn.model_selection.train_test_split. Then you will be able to compare both accuracy. Otherwise, you won't be able to know, with confidence, if you are overfitting or not.

Using cross-validation to determine weights of machine learning algorithms (GridSearchCv,RidgeCV,StackingClassifier)

My question has to do with GridSearchCV, RidgeCV, and StackingClassifier/Regressor.
Stacking Classifier/Regressor-AFAIK, it first trains the whole train set individually for each base estimator. Then, it uses a cross validation scheme, using the predictions for each base estimator as the new features to train the new final estimator. From the documentation: "To generalize and avoid over-fitting, the final_estimator is trained on out-samples using sklearn.model_selection.cross_val_predict internally."
My question is, what exactly does this mean? Does it break the train data into k folds, and then for each fold, train the final estimator on the training section of the fold, test it on the testing section of the fold, and then take the final estimator weights from the fold with the best score? or what?
I think I can group GridSearchCV and RidgeCV into the same question as they are quite similar. ( albeit, ridgeCV uses one vs all CV by default)
-To find the best hyperparameters, do they do a CV on all the folds, for each hyperparameter, find the hyperparameters that had the best average score AND THEN AFTER finding the best hyperparameters, train the model with the best hyperparameters, using the WHOLE training set? Or am I looking at it wrong?
If anyone could shed some light on this, that would be great. Thanks!
You're exactly right. The process looks like this:
Select the first set of hyperparameters
Partition the data into k-folds
Run the model on each fold
Obtain the average score (loss, r2, or whatever specified criteria)
Repeat steps 2-4 for all other sets of hyperparameters
Choose the set of hyperparameters with the best score
Retrain the model on the entire dataset (as opposed to a single fold) using the best hyperparameters

Cross-validation of neural network: How to treat the number of epochs?

I'm implementing a pytorch neural network (regression) and want to identify the best network topology, optimizer etc.. I use cross validation, because I have x databases of measurements and I want to evaluate whether I can train a neural network with a subset of the x databases and apply the neural network to the unseen databases. Therefore, I also introduce a test database, which I doesn't use in the phase of the hyperparameter identification.
I am confused on how to treat the number of epochs in cross validation, e.g. I have a number of epochs = 100. There are two options:
The number of epochs is a hyperparameter to tune. In each epoch, the mean error across all cross validation iterations is determined. After models are trained with all network topologies, optimizers etc. the model with the smallest mean error is determined and has parameters like: -network topology: 1
-optimizer: SGD
-number of epochs: 54
To calculate the performance on the test set, a model is trained with exactly these parameters (number of epochs = 54) on the training and the validation data. Then it is applied and evaluated on the test set.
The number of epochs is NOT a hyperparameter to tune. Models are trained with all the network topologies, optimizers etc. For each model, the number of epochs, where the error is the smallest, is used. The models are compared and the best model can be determined with parameters like:
-network topology: 1
-optimizer: SGD
To calculate the performance on the test data, a “simple” training and validation split is used (e.g. 80-20). The model is trained with the above parameters and 100 epochs on the training and validation data. Finally, a model with a number of epochs yielding the smallest validation error, is evaluated on the test data.
Which option is the correct or the better one?
The number of epochs is better not to be fine-tuned.
Option 2 is a better option.
Actually, if the # of epochs is fixed, you need not to have validation set. Validation set gives you the optimal epoch of the saved model.

Interpretation of train-validation loss of a Neural Network

I have trained an LSTM model for Time Series Forecasting. I have used an early stopping method with a patience of 150 epochs.
I have used a dropout of 0.2, and this is the plot of train and validation loss:
The early stopping method stop the training after 650 epochs, and save the best weight around epoch 460 where the validation loss was the best.
My question is :
Is it normal that the train loss is always above the validation loss?
I know that if it was the opposite(validation loss above the train) it would have been a sign of overfitting.
But what about this case?
My dataset is a Time Series with hourly temporal frequence. It is composed of 35000 instance. I have split the data into 80 % train and 20% validation but in temporal order. So for example the training will contain the data until the beginning of 2017 and the validation the data from 2017 until the end.
I have created this plot by averaging the data over 15 days and this is the result:
So maybe the reason is as you said that the validation data have an easier pattern. How can i solve this problem?
For most cases, the validation loss should be higher than the training loss because the labels in the training set are accessible to the model. In fact, one good habit to train a new network is to use a small subset of the data and see whether the training loss can converge to 0 (fully overfits the training set). If not, it means this model is somehow incompetent to memorize the data.
Let's go back to your problem. I think the observation that validation loss is less than training loss happens. But this possibly is not because of your model, but how you split the data. Consider that there are two types of patterns (A and B) in the dataset. If you split in a way that the training set contains both pattern A and pattern B, while the small validation set only contains pattern B. In this case, if B is easier to be recognized, then you might get a higher training loss.
In a more extreme example, pattern A is almost impossible to recognize but there are only 1% of them in the dataset. And the model can recognize all pattern B. If the validation set happens to have only pattern B, then the validation loss will be smaller.
As alex mentioned, using K-fold is a good solution to make sure every sample will be used as both validation and training data. Also, printing out the confusion matrix to make sure all labels are relatively balanced is another method to try.
Usually the opposite is true. But since you are using drop out,it is common to have the validation loss less than the training loss.And like others have suggested try k-fold cross validation

Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5),ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation
You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.
You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense
I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
best_indices_C = best_indices % len(self.Cs_)

