I'm trying to do run a simple RandomForestClassifier() with a large-ish dataset. I typically first do the cross-validation using train_test_split, and then start using cross_val_score.
In this case though, I get very different results from these two approaches, and I can't figure out why. My understanding these is that these two snippets should do exactly the same thing:
cfc = RandomForestClassifier(n_estimators=50)
scores = cross_val_score(cfc, X, y,
cv = ShuffleSplit(len(X), 1, 0.25),
scoring = 'roc_auc')
print(scores)
>>> [ 0.88482262]
and this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
cfc = RandomForestClassifier(n_estimators=50)
cfc.fit(X_train, y_train)
roc_auc_score(y_test, cfc.predict(X_test))
>>> 0.57733474562203269
And yet the scores are widely different. (The scores are very representative, I observed the same behavior across many many runs).
Any ideas why this can be? I am tempted to trust the cross_val_score result, but I want to be sure that I am not messing up somewhere..
** Update **
I noticed that when I reverse the order of the arguments to roc_auc_score, I get a similar result:
roc_auc_score(cfc.predict(X_test), y_test)
But the documentation explicitly states that the first element should be the real values, and the second one the target.
I'm not sure what's the issue but here are two things you could try:
ROC AUC needs prediction probabilities for proper evaluation, not hard scores (i.e. 0 or 1). Therefore change the cross_val_score to work with probabilities. You can check the first answer on this link for more details.
Compare this with roc_auc_score(y_test, cfc.predict_proba(X_test)[:,1])
As xysmas said, try setting a random_state to both cross_val_score and roc_auc_score
Related
I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)
I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.
You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.
The documentation of best_param_ in GridSearchCV states:
best_params_ : dict
Parameter setting that gave the best results on the hold out data.
From that, I assumed "best results" means best score (highest accuracy / lowest error) and lowest variance over my k-folds.
However, this is not case as we can see in cv_results_:
Here best_param_ returns k=5 instead of k=9 where mean_test_score and the variance would be optimal.
I know I can implement my own scoring function or my own best_param function using the output of cv_results_. But what is the rationale behind not taking the variance into account in the first place?
I ran in that situation by applying KNN to iris dataset with 70% train split and a 3-fold cross-validation.
Edit: Example code:
import numpy as np
import pandas as pd
from sklearn import neighbors
from sklearn import model_selection
from sklearn import datasets
X = datasets.load_iris().data
y = datasets.load_iris().target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=62)
knn_model = neighbors.KNeighborsClassifier()
param_grid = [{"n_neighbors" : np.arange(1, 31, 2)}]
grid_search = model_selection.GridSearchCV(knn_model, param_grid, cv=3, return_train_score=False)
grid_search.fit(X_train, y_train.ravel())
results = pd.DataFrame(grid_search.cv_results_)
k_opt = grid_search.best_params_.get("n_neighbors")
print("Value returned by best_param_:",k_opt)
results.head(6)
It results in a different table than the image above, but the situation is the same: for k=5 mean_test_score and std_test_score are optimal. However best_param_ returns k=1.
From the GridSearchCV source
# Find the best parameters by comparing on the mean validation score:
# note that `sorted` is deterministic in the way it breaks ties
best = sorted(grid_scores, key=lambda x: x.mean_validation_score,
reverse=True)[0]
It sorts by mean_val score and that's it. sorted() preserves the existing order for ties, so in this case k=1 is best.
I agree with your thoughts and think a PR could be submitted to have better tie breaking logic.
In Grid Search ,cv_results_ provides std_test_score which is standard deviation of score. From this you can calculate variance error by squaring it
I am doing some text minining/classification and attempt to evaluate performance with the precision_recall_fscore_support function from the sklearn.metrics module. I am not sure how I can create a really small example reproducing the problem, but maybe somebody can help even so because it is something obvious I am missing.
The aforementioned function returns among other things the support for each class. The documentation states
support: int (if average is not None) or array of int, shape = [n_unique_labels] :
The number of occurrences of each label in y_true.
But in my case, the number of classes for which support is returned is not the same as the number of different classes in the testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
classifier = svm.SVC(kernel="linear")
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
prec, rec, fbeta, supp = precision_recall_fscore_support(y_test, y_pred)
print(len(classifier.classes_)) # prints 18
print(len(supp)) # prints 19
print(len(np.unique(y_test))) # prints 18
How can this be? How can there be support for a class which is not in the data?
I am not sure what the problem is, but in my case there seems to be a mismatch between the classes learned by the classifier and the ones occurring in the test data. One can force the the function to compute the performance measures for the right classes by explicitly naming them.
prec, rec, fbeta, supp = precision_recall_fscore_support(y_test, y_pred, labels=classifier.classes_)
I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.