I am using a Random Forest Classifier and I want to perform k-fold cross validation.
My dataset is already split in 10 different subsets, so I'd like to use them to do k-fold cross validation, without using automatic functions that randomly split the dataset.
Is it possible in Python?
Random Forest doesn't have the partial_fit() method, so I can't do an incremental fit.
try kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123) to evenly split your data
try kf=TimeSeriesSpit(n_splits=5) to split by time stamp
try kf = KFold(n_splits=5, random_state=123, shuffle=True) to shuffle your training data before splitting.
for train_index, test_index in kf.split(bryant_shots):
cv_train, cv_test = df.iloc[train_index], df.iloc[test_index]
#fit the classifier
you can also stratefy by groupings or categories and get mean averages for these groupings using kfold. It is super powerful for understanding your data.
It is best to join all subsets and then split them for k-fold but here is the other way:
for in range(10):
model = what_model_you_want
model.fit(dataset.drop(i_th_subset))
prediction = model.predict(i_th_subset)
test_result = compute_accuracy(i_th_subset.target, prediction)
Related
I am learning Machine learning and I am having this doubt. Can anyone tell me what is the difference between:-
from sklearn.model_selection import cross_val_score
and
from sklearn.model_selection import KFold
I think both are used for k fold cross validation, but I am not sure why to use two different code for same function.
If there is something I am missing please do let me know. ( If possible please explain difference between these two methods)
Thanks,
cross_val_score is a function which evaluates a data and returns the score.
On the other hand, KFold is a class, which lets you to split your data to K folds.
So, these are completely different. Yo can make K fold of data and use it on cross validation like this:
# create a splitter object
kfold = KFold(n_splits = 10)
# define your model (any model)
model = XGBRegressor(**params)
# pass your model and KFold object to cross_val_score
# to fit and get the mse of each fold of data
cv_score = cross_val_score(model,
X, y,
cv=kfold,
scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())
cross_val_score evaluates the score using cross validation by randomly splitting the training sets into distinct subsets called folds, then it trains and evaluated the model on the folds, picking a different fold for evaluation every time and training on the other folds.
cv_score = cross_val_score(model, data, target, scoring, cv)
KFold procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
cv_score = cross_val_score(model, data, target, scoring, cv=cv)
where model is your model on which you want to evaluate,
data is training data,
target is target variable,
scoring parameter controls what metric applied to the estimator applied and cv is the number of splits.
I'm training a dataset and then testing it on some other dataset.
To improve performance, I wanted to fine-tune my parameters with a 5-fold cross validation.
However, I think I'm not writing the correct code as when I try to fit the model to my testing set, it says it hasn't fit it yet. I though the cross-validation part fitted the model? Or maybe I have to extract it?
Here's my code:
svm = SVC(kernel='rbf', probability=True, random_state=42)
accuracies = cross_val_score(svm, data_train, lbs_train, cv=5)
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
That is correct, the cross_validate_score doesn't return a fitted model. In your example, you have cv=5 which means that the model was fit 5 times. So, which of those do you want? The last?
The function cross_val_score is a simpler version of the sklearn.model_selection.cross_validate. Which doesn't only return the scores, but more information.
So you can do something like this:
from sklearn.model_selection import cross_validate
svm = SVC(kernel='rbf', probability=True, random_state=42)
cv_results = cross_validate(svm, data_train, lbs_train, cv=5, return_estimator=True)
# cv_results is a dict with the following keys:
# 'test_score' which is what cross_val_score returns
# 'train_score'
# 'fit_time'
# 'score_time'
# 'estimator' which is a tuple of size cv and only if return_estimator=True
accuracies = cv_results['test_score'] # what you had before
svms = cv_results['estimator']
print(len(svms)) # 5
svm = svms[-1] # the last fitted svm, or pick any that you want
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
Note, here you need to pick one of the 5 fitted SVMs. Ideally, you would use cross-validation for testing the performance of your model. So, you don't need to do it again at the end. Then, you would fit your model one more time, but this time with ALL the data which would be the model you will actually use in production.
Another note, you mentioned that you want this to fine tune the parameters of your model. Perhaps you should look at hyper-parameter optimization. For example: https://datascience.stackexchange.com/a/36087/54395 here you will see how to use cross-validation and define a parameter search space.
I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.
Option 1
X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5)
clf.fit(X_train, y_train)
Option 2
clf = GridSearchCV(... cv=5)
clf.fit(X y)
CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data.
If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achive slightly better model score. But don't do it before you sure your split and CV score is consistent.
As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?
GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.
Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)