When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea?
from sklearn.svm import SVR
from sklearn import cross_validation as CV
reg = SVR(C=1., epsilon=0.1, kernel='rbf')
scores = CV.cross_val_score(reg, X, y, cv=10, scoring='mean_squared_error')
all values in scores are then negative.
Trying to close this out, so am providing the answer that David and larsmans have eloquently described in the comments section:
Yes, this is supposed to happen. The actual MSE is simply the positive version of the number you're getting.
The unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline.
You can fix it by changing scoring method to "neg_mean_squared_error" as you can see below:
from sklearn.svm import SVR
from sklearn import cross_validation as CV
reg = SVR(C=1., epsilon=0.1, kernel='rbf')
scores = CV.cross_val_score(reg, X, y, cv=10, scoring='neg_mean_squared_error')
To see what are available scoring keys use:
import sklearn
print(sklearn.metrics.SCORERS.keys())
You can either use 'r2' or 'neg_mean_squared_error'. There are lots of options based on your requirement.
Related
I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865
In the sklearn.model_selection.cross_val_predict page it is stated:
Generate cross-validated estimates for each input data point. It is
not appropriate to pass these predictions into an evaluation metric.
Can someone explain what does it mean? If this gives estimate of Y (y prediction) for every Y (true Y), why can't I calculate metrics such as RMSE or coefficient of determination using these results?
It seems to be based on how samples are grouped and predicted. From the user guide linked in the cross_val_predict docs:
Warning Note on inappropriate usage of cross_val_predict
The result of
cross_val_predict may be different from those obtained using
cross_val_score as the elements are grouped in different ways. The
function cross_val_score takes an average over cross-validation folds,
whereas cross_val_predict simply returns the labels (or probabilities)
from several distinct models undistinguished. Thus, cross_val_predict
is not an appropriate measure of generalisation error.
The cross_val_score seems to say that it averages across all of the folds, while the cross_val_predict groups individual folds and distinct models but not all and therefore it won't necessarily generalize as well. For example, using the sample code from the sklearn page:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
diabetes = datasets.load_diabetes()
X = diabetes.data[:200]
y = diabetes.target[:200]
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=3)
print("Cross Val Prediction score:{}".format(mean_squared_error(y,y_pred)))
print("Cross Val Score:{}".format(np.mean(cross_val_score(lasso, X, y, cv=3, scoring = make_scorer(mean_squared_error)))))
Cross Val Prediction score:3993.771257795029
Cross Val Score:3997.1789145156217
Just to add a little more clarity, it is easier to understand the difference if you consider a non-linear scoring function such as Maximum-Absolute-Error instead of something like a mean-absolute error.
cross_val_score() would compute the maximum-absolute-error on each off the 3-folds (assuming 3 fold cross-validator) and report the aggregate (say mean?) over 3 such scores. That is, something like mean of (a, b, c) where a , b, c are the max-abs-errors for the 3 folds respectively. I guess it is safe to conclude the returned value as the max-absolute-error of your estimator in the average or general case.
with cross_val_predict() you would get 3-sets of predictions corresponding to 3-folds and taking the maximum-absolute-error over the aggregate (concatenation) of these 3-sets of predictions is certainly not the same as above. Even if the predicted values are identical in both the scenarios, what you end up with here is max of (a, b,c ). Also, max(a,b,c) would be an unreasonable and overly pessimistic characterization of the max-absolute-error score of your model.
The documentation of best_param_ in GridSearchCV states:
best_params_ : dict
Parameter setting that gave the best results on the hold out data.
From that, I assumed "best results" means best score (highest accuracy / lowest error) and lowest variance over my k-folds.
However, this is not case as we can see in cv_results_:
Here best_param_ returns k=5 instead of k=9 where mean_test_score and the variance would be optimal.
I know I can implement my own scoring function or my own best_param function using the output of cv_results_. But what is the rationale behind not taking the variance into account in the first place?
I ran in that situation by applying KNN to iris dataset with 70% train split and a 3-fold cross-validation.
Edit: Example code:
import numpy as np
import pandas as pd
from sklearn import neighbors
from sklearn import model_selection
from sklearn import datasets
X = datasets.load_iris().data
y = datasets.load_iris().target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=62)
knn_model = neighbors.KNeighborsClassifier()
param_grid = [{"n_neighbors" : np.arange(1, 31, 2)}]
grid_search = model_selection.GridSearchCV(knn_model, param_grid, cv=3, return_train_score=False)
grid_search.fit(X_train, y_train.ravel())
results = pd.DataFrame(grid_search.cv_results_)
k_opt = grid_search.best_params_.get("n_neighbors")
print("Value returned by best_param_:",k_opt)
results.head(6)
It results in a different table than the image above, but the situation is the same: for k=5 mean_test_score and std_test_score are optimal. However best_param_ returns k=1.
From the GridSearchCV source
# Find the best parameters by comparing on the mean validation score:
# note that `sorted` is deterministic in the way it breaks ties
best = sorted(grid_scores, key=lambda x: x.mean_validation_score,
reverse=True)[0]
It sorts by mean_val score and that's it. sorted() preserves the existing order for ties, so in this case k=1 is best.
I agree with your thoughts and think a PR could be submitted to have better tie breaking logic.
In Grid Search ,cv_results_ provides std_test_score which is standard deviation of score. From this you can calculate variance error by squaring it
I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).
I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]
depth = 5
maxFeat = 3
result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)
result
# result is now something like array([ 0.66773295, 0.58824739])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)
RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc #something like 0.83
RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?
Note:
When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.
The answer is what #KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
source: Better way to shuffle two numpy arrays in unison