Nested cross-validation example on Scikit-learn - python

I'm trying to work my head around the example of Nested vs. Non-Nested CV in Sklearn. I checked multiple answers but I am still confused on the example.
To my knowledge, a nested CV aims to use a different subset of data to select the best parameters of a classifier (e.g. C in SVM) and validate its performance. Therefore, from a dataset X, the outer 10-folds CV (for simplicity n=10) creates 10 training sets and 10 test sets:
(Tr0, Te0),..., (Tr0, Te9)
Then, the inner 10-CV splits EACH outer training set into 10 training and 10 test sets:
From Tr0: (Tr0_0,Te_0_0), ... , (Tr0_9,Te0_9)
From Tr9: (Tr9_0,Te_9_0), ... , (Tr9_9,Te9_9)
Now, using the inner CV, we can find the best values of C for every single outer Training set. This is done by testing all the possible values of C with the inner CV. The value providing the highest performance (e.g. accuracy) is chosen for that specific outer Training set. Finally, having discovered the best C values for every outer Training set, we can calculate an unbiased accuracy using the outer Test sets. With this procedure, the samples used to identify the best parameter (i.e. C) are not used to compute the performance of the classifier, hence we have a totally unbiased validation.
The example provided in the Sklearn page is:
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()
From what I understand, the code simply calculates the scores using two different cross-validations (i.e. different splits into training and test set). Both of them used the entire dataset. The GridCV identifies the best parameters using one (of the two CVs), then cross_val_score calculates, with the second CV, the performance when using the best parameters.
Am I interpreting a Nested CV in the wrong way? What am I missing from the example?

Related

GridsearchCV and Kfold Cross validation

I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1)
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration n_estimators = 100 and max_depth = 4 is selected for model building.Now will the score for this model be choosen with the help of 10 fold cross-validation ?
a. My understanding about the process is as follows
1.X_train and y_train will be splitted in to 10 sets.
Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say score_list
This process will be repeated 9 more times and each of this 9 scores will be added to the score_list to give 10 score in all
Finally the average of the score_list will be taken to give a final_score for the model with parameters :n_estimators = 100 and max_depth = 4
b. The above process will repeated with all other possible combinations of n_estimators and max_depth and each time we will get a final_score for that model
c. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by CV_rfc.best_params_
Is my understanding about GridSearchCV correct ?
Now say I get best model parameters as {'max_depth': 10, 'n_estimators': 100}. I declare an intance of the model as below
RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
rm_score = -scores
rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct
Regarding (1), your understanding is indeed correct; a wording detail to be corrected in principle is "better final_score" instead of "higher", as there are several performance metrics (everything measuring the error, such as MSE, MAE etc) that are the-lower-the-better.
Now, step (2) is more tricky; it requires taking a step back to check the whole procedure...
To start with, in general CV is used either for parameter tuning (your step 1) or for model assessment (i.e. what you are trying to do in step 2), which are different things indeed. Splitting from the very beginning your data into training & test sets as you have done here, and then sequentially performing the steps 1 (for parameter tuning) and 2b (model assessment in unseen data) is arguably the most "correct" procedure in principle (as for the bias you note in the comment, this is something we have to live with, since by default all our fitted models are "biased" toward the data used for their training, and this cannot be avoided).
Nevertheless, since early on, practitioners have been wondering if they can avoid "sacrificing" a part of their precious data only for testing (model assessment) purposes, and trying to see if they can actually skip the model assessment part (and the test set itself), using as model assessment the best results obtained from the parameter tuning procedure (your step 1). This is clearly cutting corners, but, as usually, the question is how off the actual results will be? and will it still be meaninful?
Again, in theory, what Vivek Kumar writes in his linked answer is correct:
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
But here is a relevant excerpt of the (highly recommended) Applied Predictive Modeling book (p. 78):
In short: if you use the whole X in step 1 and consider the results of the tuning as model assessment, there will indeed be a bias/leakage, but it is usually small, at least for moderately large training sets...
Wrapping-up:
The "most correct" procedure in theory is indeed the combination of your steps 1 and 2b
You can try to cut corners, using the whole training set X in step 1, and most probably you will still be within acceptable limits regarding your model assessment.

Cross validation: cross_val_score function from scikit-learn arguments

According to the DOC of scikit-learn
sklearn.model_selection.cross_val_score(estimator, X, y=None,
groups=None, scoring=None, cv=None, n_jobs=1, verbose=0,
fit_params=None, pre_dispatch=‘2*n_jobs’)
X and y
X : array-like The data to fit. Can be for example a list, or an
array.
y : array-like, optional, default: None The target variable to
try to predict in the case of supervised learning.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.
To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.
What do you think?
Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.
So that's why we need the cross validation.
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be
the whole dataset.
[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.
Suppose you use cross validation with 5 folds (cv = 5).
We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.
Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.
By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.
R^2 score is called coefficient of determination.

Model help using Scikit-learn when using GridSearch

As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

Scikit - Combining scale and grid search

I am new to scikit, and have 2 slight issues to combine a data scale and grid search.
Efficient scaler
Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold.
My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly:
classifier = svm.SVC(C=1)
clf = make_pipeline(preprocessing.StandardScaler(), classifier)
tuned_parameters = [{'C': [1, 10, 100, 1000]}]
my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5)
Retrieve inner scaler fitting
When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?
GridSearchCV knows nothing about the Pipeline object; it assumes that the provided estimator is atomic in the sense that it cannot choose only some particular stage (StandartScaler for example) and fit different stages on different data.
All GridSearchCV does - calls fit(X, y) method on the provided estimator, where X,y - some splits of data. Thus it fits all stages on same splits.
Try this:
best_pipeline = my_grid_search.best_estimator_
best_scaler = best_pipeline["standartscaler"]
In case when you wrap your transformers/estimators into Pipeline - you have to add a prefix to a name of each parameter, e.g: tuned_parameters = [{'svc__C': [1, 10, 100, 1000]}], look at these examples for more details Concatenating multiple feature extraction methods, Pipelining: chaining a PCA and a logistic regression
Anyway read this, it may help you GridSearchCV

How to generate a custom cross-validation generator in scikit-learn?

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.
The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )
I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.
Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)
class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))

Categories

Resources