Is
class sklearn.cross_validation.ShuffleSplit(
n,
n_iterations=10,
test_fraction=0.10000000000000001,
indices=True,
random_state=None
)
the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)
Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.
If ShuffleSplit is the right, one concern is that it is mentioned
Note: contrary to other cross-validation strategies, random splits do not
guarantee that all folds will be different, although this is still
very likely for sizeable datasets
Is this always the case for 10*10 fold CV?
I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:
>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
... random_state=42)
If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):
>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
... X, y = shuffle(X_orig, y_orig, random_state=i)
... skf = StratifiedKFold(y, 10)
... print cross_val_score(clf, X, y, cv=skf)
Related
I've tried fitting a random forest like so:
from xgboost import XGBRFRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
forest = XGBRFRegressor(num_parallel_tree = 10, num_boost_round = 1000, verbose=3)
forest.fit(
X_train,
y_train,
eval_set = [(X_test, y_test)],
early_stopping_rounds = 10,
verbose = True
)
However, early stopping never seems to kick in and as far as I can tell, the model fits the full 10,000 trees requested. The evaluation metric is only printed once, rather than after every boosting round as I would have expected.
What's the right way to set up this type of model (working within the scikit-learn API) so that early stopping takes effect as I would expect?
I have requested clarification from the developers here:
https://discuss.xgboost.ai/t/how-is-xgbrfregressor-intended-to-work-with-early-stopping/2391
The docs say:
[XGBRFRegressor has] default values and meaning of some of the parameters adjusted accordingly. In particular:
n_estimators specifies the size of the forest to be trained; it is converted to num_parallel_tree, instead of the number of boosting rounds
learning_rate is set to 1 by default
colsample_bynode and subsample are set to 0.8 by default
booster is always gbtree
And you can see that in action in the code: num_parallel_trees gets overridden as the input n_estimators, and the num_boosting_rounds gets overridden as 1.
It's probably worth reading the paragraphs preceding the documentation link in order to understand how xgboost treats random forests.
I am learning Machine learning and I am having this doubt. Can anyone tell me what is the difference between:-
from sklearn.model_selection import cross_val_score
and
from sklearn.model_selection import KFold
I think both are used for k fold cross validation, but I am not sure why to use two different code for same function.
If there is something I am missing please do let me know. ( If possible please explain difference between these two methods)
Thanks,
cross_val_score is a function which evaluates a data and returns the score.
On the other hand, KFold is a class, which lets you to split your data to K folds.
So, these are completely different. Yo can make K fold of data and use it on cross validation like this:
# create a splitter object
kfold = KFold(n_splits = 10)
# define your model (any model)
model = XGBRegressor(**params)
# pass your model and KFold object to cross_val_score
# to fit and get the mse of each fold of data
cv_score = cross_val_score(model,
X, y,
cv=kfold,
scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())
cross_val_score evaluates the score using cross validation by randomly splitting the training sets into distinct subsets called folds, then it trains and evaluated the model on the folds, picking a different fold for evaluation every time and training on the other folds.
cv_score = cross_val_score(model, data, target, scoring, cv)
KFold procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
cv_score = cross_val_score(model, data, target, scoring, cv=cv)
where model is your model on which you want to evaluate,
data is training data,
target is target variable,
scoring parameter controls what metric applied to the estimator applied and cv is the number of splits.
I have managed to write some code doing a nested cross-validation using lightGBM as my regressor and wrapping everying with sklearn.pipeline.
Ultimately, I would now want to do feature selection (or really just get the features' importance for the final model) but I am wondering what is the best path to take from here. I guess there would be two possibilities:
1# Use this methodology to build a model (using .fit and .predict) using the best hyperparameters. Then check the importance of the features for this model.
2# Do feature selection in the inner fold of the nest cv but I am unsure how to do this exactly.
I guess #1 would be the easiest but I am unsure how to get the best hyperparamters for each outerfold.
This thread touches on it:
Putting together sklearn pipeline+nested cross-validation for KNN regression
But the selected answers drops the cross_val_score altogether, meaning that it isn't nested cross-validation anymore (I would still like to perform the CV on the outer fold after getting the best hyperparameters on the inner fold).
So my problem is the following:
Can I get feature importances for each fold of the outer CV (I am
aware that if I have 5 folds, I will get 5 different sets of feature
importance)? And if yes, how?
Alternatively, should I just get the best hyperparameters for each
fold (how?) and build a new model without CV on the whole dataset,
based on these hyperparameters?
Here is the code I have so far:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import scipy.stats as st
#Parameters for model building an reproducibility
X = X_age
y = y_age
RNGesus = 42
state = 13
outer_scoring = 'neg_mean_absolute_error'
inner_scoring = 'neg_mean_absolute_error'
#### Nested CV with Random gridsearch ####
# Pipeline with standard scaling and the regressor
regressors = [lgb.LGBMRegressor(random_state = state)]
continuous_transformer = Pipeline([('scaler', StandardScaler())])
preprocessor = ColumnTransformer([('cont',continuous_transformer, continuous_variables)], remainder = 'passthrough')
for reg in regressors:
steps=[('preprocessor', preprocessor), ('regressor', reg)]
pipeline = Pipeline(steps)
#inner and outer fold to be used
inner_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=RNGesus)
#Hyperparameters of the regressor to be optimized using randomized search
params = {
'regressor__max_depth': (3, 5, 7, 10),
'regressor__lambda_l1': st.uniform(0, 5),
'regressor__lambda_l2': st.uniform(0, 3)
}
#Pass the RandomizedSearchCV to cross_val_score
regression = RandomizedSearchCV(estimator = pipeline, param_distributions = params, scoring=inner_scoring, cv=inner_cv, n_iter=200, verbose= 3, n_jobs= -1)
nested_score = cross_val_score(regression, X= X, y= y, cv = outer_cv, scoring=outer_scoring)
print('\n MAE for lightGBM model predicting age: %.3f' % (abs(nested_score.mean())))
print('\n'str(nested_score) + '<- outer CV')
Edit: Stated the problem clearly.
I encountered problems importing the lightGBM module so I coundn't run your code. But here is a post explaining how you cannot get the "winning" or optimal hyperparameters (as well as the feature_importance_) out of nested cross-validation by cross_val_score. Briefly, the reason is that cross_val_score only returns the measurement value.
Can I get feature importances for each fold of the outer CV (I am aware that if I have 5 folds, I will get 5 different sets of feature importance)? And if yes, how?
The answer is no with cross_val_score. But if you follow the code from that post, you'll be able to get the feature_importance_ simply by GSCV.best_estimator_.feature_importance_ under the for loop after GSCV.fit().
Alternatively, should I just get the best hyperparameters for each fold (how?) and build a new model without CV on the whole dataset, based on these hyperparameters?
This is exactly what that post is talking about: getting you the "best" hyperparameters by nested cv. Ideally, you'll observe one combination of hyperparameters that wins all the time and that is the hyperparameters you'll use for the final model (with the entire training set). But when different "best" hyperparameter combinations appear during cv, there is no standard way to deal with it as far as I know.
I am looking to use k-fold cross validation on a random forest regressor in Python. I understand that k refers to the number of folds in the data-set, but how can I adjust the test-set size? Say I wanted to split the data in ten different ways, but in each fold I wanted the data split 50/50, how would I do this? Here is what i currently have;
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestRegressor as rfr
<wasn't sure how to include the data as its a big file>
# BUILD RANDOM FOREST MODEL
rfmodel = rfr(n_estimators = 100, random_state = 0)
# make cross validated predictions
cv_preds = cross_val_predict(rfmodel, x, y, cv=10)
cv_preds = (np.around(cv_preds, 2))
I'm aware that k-fold is not necessarily needed for RF, however for the purposes of this project that is what I need to do.
EDIT: I'll try to re-articulate as I probably haven't described my problem well enough. Say I have 100 observations with k=5, instead of dividing observations into five equal-sized folds, training on k-1 and testing on the remaining fold, what I am looking to do is to randomly allocate the 100 observations into a 50/50 train-test split, run the model, and then re-allocate the 100 observations into another 50/50 split and run the model again. I would then do this 5 times.
How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict (used to obtain predictions to be then given to a metric function)?
Here is an example:
from sklearn import cross_validation
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
gnb_clf = GaussianNB()
# compute mean accuracy with cross_val_predict
predicted = cross_validation.cross_val_predict(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvp = metrics.accuracy_score(iris.target, predicted)
# compute mean accuracy with cross_val_score
score_cvs = cross_validation.cross_val_score(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvs = score_cvs.mean()
print('Accuracy cvp: %0.8f\nAccuracy cvs: %0.8f' % (accuracy_cvp, accuracy_cvs))
In this case, we obtain the same result:
Accuracy cvp: 0.95333333
Accuracy cvs: 0.95333333
Nevertheless, this seems not to be always the case, as on the official documentation it is written (regarding a result computed using cross_val_predict):
Note that the result of this computation may be slightly different
from those obtained using cross_val_score as the elements are grouped
in different ways.
Imagine following labels and splitting
[010|101|10]
So you have 8 data points, 4 per class and you split it to 3 folds, leading to 2 folds with 3 elements and one with 2. Now let us assume that during cross validation you get following preds
[010|100|00]
thus, your scores are [100%, 67%, 50%], and cross val score (as an average) is around 72%. Now what about accuracy over predictions? You clearly have 6/8 things right, thus 75%. As you can see the scores are different, even thoug they both rely on cross validation. Here, the difference arises because the splits are not exactly the same size, thus this last "50%" is actually lowering total score because it is an avergae over just 2 samples (and the rest are based on 3).
There might be other similar phenomena, in general - it should boil down to the way averaging is computed. Thus - cross val score is an average over averages, which does not have to be an average over cross validation predictions.
In addition to lejlot's answer, another way that you might get slightly different results between cross_val_score and cross_val_predict is when the target classes are not distributed in a way that allows them to be evenly split between folds.
According to the documentation for cross_val_predict, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used by default. This may lead to a situation where even though the total number of instances in the dataset is divisible by the number of folds, you end up with folds of slightly different sizes, because the splitter is splitting based on the presence of the target. This can then lead to the issue where an average of averages is slightly different to an overall average.
For example, if you have 100 data points, and 33 of these are the target class, then KFold with n_splits=5 would split this into 5 folds of 20 observations, but StratifiedKFold would not necessarily give you equally-sized folds.