I am learning Machine learning and I am having this doubt. Can anyone tell me what is the difference between:-
from sklearn.model_selection import cross_val_score
and
from sklearn.model_selection import KFold
I think both are used for k fold cross validation, but I am not sure why to use two different code for same function.
If there is something I am missing please do let me know. ( If possible please explain difference between these two methods)
Thanks,
cross_val_score is a function which evaluates a data and returns the score.
On the other hand, KFold is a class, which lets you to split your data to K folds.
So, these are completely different. Yo can make K fold of data and use it on cross validation like this:
# create a splitter object
kfold = KFold(n_splits = 10)
# define your model (any model)
model = XGBRegressor(**params)
# pass your model and KFold object to cross_val_score
# to fit and get the mse of each fold of data
cv_score = cross_val_score(model,
X, y,
cv=kfold,
scoring='neg_root_mean_squared_error')
print(cv_score.mean(), cv_score.std())
cross_val_score evaluates the score using cross validation by randomly splitting the training sets into distinct subsets called folds, then it trains and evaluated the model on the folds, picking a different fold for evaluation every time and training on the other folds.
cv_score = cross_val_score(model, data, target, scoring, cv)
KFold procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
cv_score = cross_val_score(model, data, target, scoring, cv=cv)
where model is your model on which you want to evaluate,
data is training data,
target is target variable,
scoring parameter controls what metric applied to the estimator applied and cv is the number of splits.
Related
Exploring some classification models in Scikit learn I noticed that the scores I got for log loss and for ROC AUC were consistently lower while performing cross validation than while fitting and predicting on the whole training set (done to check for overfitting), thing that did not make sense to me.
Specifically, using cross_validate I set the scorings as ['neg_log_loss', 'roc_auc'] and while performing manual fitting and prediction on the training set I used the metric functions log_loss' and roc_auc_score.
To try to figure out what was happening, i wrote a code to perform the cross validation manually in order to be able to call the metric functions manually on the various folds and compare the results with the ones from cross_validate. As you can see below, I got different results even like this!
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=3, random_state=42, shuffle=True)
log_reg = LogisticRegression(max_iter=1000)
for train_index, test_index in kf.split(dataset, dataset_labels):
X_train, X_test = dataset[train_index], dataset[test_index]
y_train, y_test = dataset_labels_np[train_index], dataset_labels_np[test_index]
log_reg.fit(X_train, y_train)
pr = log_reg.predict(X_test)
ll = log_loss(y_test, pr)
print(ll)
from sklearn.model_selection import cross_val_score
cv_ll = cross_val_score(log_reg, dataset_prepared_stand, dataset_labels, scoring='neg_log_loss',
cv=KFold(n_splits=3, random_state=42, shuffle=True))
print(abs(cv_ll))
Outputs:
4.795481869275026
4.560119170517534
5.589818973403791
[0.409817 0.32309 0.398375]
The output running the same code for ROC AUC are:
0.8609669592272686
0.8678563239907938
0.8367147503682851
[0.925635 0.94032 0.910885]
To be sure to have written the code right, I also tried the code using 'accuracy' as scoring for cross validation and accuracy_score as metric function and the results are instead consistent:
0.8611584327086882
0.8679727427597955
0.838160136286201
[0.861158 0.867973 0.83816 ]
Can someone explain me why the results in the case of the log loss and the ROC AUC are different? Thanks!
Log-loss and auROC both need probability predictions, not the hard class predictions. So change
pr = log_reg.predict(X_test)
to
pr = log_reg.predict_proba(X_test)[:, 1]
(the subscripting is to grab the probabilities for the positive class, and assumes you're doing binary classification).
I'm training a dataset and then testing it on some other dataset.
To improve performance, I wanted to fine-tune my parameters with a 5-fold cross validation.
However, I think I'm not writing the correct code as when I try to fit the model to my testing set, it says it hasn't fit it yet. I though the cross-validation part fitted the model? Or maybe I have to extract it?
Here's my code:
svm = SVC(kernel='rbf', probability=True, random_state=42)
accuracies = cross_val_score(svm, data_train, lbs_train, cv=5)
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
That is correct, the cross_validate_score doesn't return a fitted model. In your example, you have cv=5 which means that the model was fit 5 times. So, which of those do you want? The last?
The function cross_val_score is a simpler version of the sklearn.model_selection.cross_validate. Which doesn't only return the scores, but more information.
So you can do something like this:
from sklearn.model_selection import cross_validate
svm = SVC(kernel='rbf', probability=True, random_state=42)
cv_results = cross_validate(svm, data_train, lbs_train, cv=5, return_estimator=True)
# cv_results is a dict with the following keys:
# 'test_score' which is what cross_val_score returns
# 'train_score'
# 'fit_time'
# 'score_time'
# 'estimator' which is a tuple of size cv and only if return_estimator=True
accuracies = cv_results['test_score'] # what you had before
svms = cv_results['estimator']
print(len(svms)) # 5
svm = svms[-1] # the last fitted svm, or pick any that you want
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
Note, here you need to pick one of the 5 fitted SVMs. Ideally, you would use cross-validation for testing the performance of your model. So, you don't need to do it again at the end. Then, you would fit your model one more time, but this time with ALL the data which would be the model you will actually use in production.
Another note, you mentioned that you want this to fine tune the parameters of your model. Perhaps you should look at hyper-parameter optimization. For example: https://datascience.stackexchange.com/a/36087/54395 here you will see how to use cross-validation and define a parameter search space.
I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)
I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
I'm wondering, what is the difference between the default cross validation that is implemented in GridSearchCV method in sklearn, and the Kfold method used with it, like in the following code:
without using Kfold:
clf = GridSearchCV(estimator=model, param_grid=parameters, cv=10, scoring='f1_macro')
clf = clf.fit(xOri, yOri)
with Kfold:
NUM_TRIALS = 5
for i in range(NUM_TRIALS):
cv = KFold(n_splits=10, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=model, param_grid=parameters, cv=cv, scoring='f1_macro')
clf = clf.fit(xOri, yOri)
As I understood from the manual, is that both of them split the data into 10 parts, 9 for training and 1 for validation, but in the example that uses Kfold .. it does the sampling process 5 times (NUM_TRIALS = 5) and each time the data is shuffled before splitting into 10 parts. Am I right?
Looks like you're right, ish.
Either KFold or StratifiedKFold are used by GridSearchCV depending if your model is for regression (KFold) or classification (then StratifiedKFold is used).
Since I don't know what your data is like I can't be sure what is being used in this situation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
But the code you have above will repeat the KFold validation 5 times with different random seeds.
Whether that will produe meaningfully different splits of the data? Not sure.