I'm using GridSearchCV to identify the best set of parameters for a random forest classifier.
PARAMS = {
'max_depth': [8,None],
'n_estimators': [500,1000]
}
rf = RandomForestClassifier()
clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4)
clf.fit(data, labels)
where data and labels are respectively the full dataset and the corresponding labels.
Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_) with a "manual" AUC estimation:
aucs = []
for fold in range (0,n_folds):
probabilities = []
train_data,train_label = read_data(train_file_fold)
test_data,test_labels = read_data(test_file_fold)
clf = RandomForestClassifier(n_estimators = 1000,max_depth=8)
clf = clf.fit(train_data,train_labels)
predicted_probs = clf.predict_proba(test_data)
for value in predicted_probs:
for k, pr in enumerate(value):
if k == 1:
probabilities.append(pr)
fpr, tpr, thresholds = metrics.roc_curve(test_labels, probabilities, pos_label=1)
fold_auc = metrics.auc(fpr, tpr)
aucs.append(fold_auc)
performance = np.mean(aucs)
where I manually pre-split the data into training and test set (same 5 CV approach).
The AUC values returned by GridSearchCV are always higher than the one manually calculated (e.g. 0.62 vs. 0.70) when using the same parameter for RandomForest.
I know that different training and test split might give you different performance but this occurred constantly when testing 100 repetitions of the GridSearchCV. Interesting, if I use the accuarcy instead of roc_auc as scoring metric, the difference in performance is minimal and can be associated to the fact that I use different training and test set. Is this happening because the AUC value of GridSearchCV is estimated in a different way than by using metrics.roc_curve?
Related
I am trying to find reliable hyper parameters for training a multiclass classifier, using both lgbm's "gbdt" and scikitlearn's GridsearchCV.
On the feature side of things there is a ~4k x 40 matrix, containing continuous values.
On the labeling side there is a pool of 4 categorical mutually exclusive classes.
To judge whether any given fold is performing well I would like to use lgbm's auc_mu metric, but I'm ok with any at this point. As you can see in the code below I resorted to weighted accuracy instead.
Below is a simplified version of how the gridsearch is initialised.
param_set = {
'n_estimators':[15, 25]
}
clf = lgb.LGBMModel(
boosting_type='gbdt',
num_leaves=31,
max_depth=5,
learning_rate=0.1,
n_estimators=100,
objective='multiclass',
num_class= len(np.unique(training_data.label)),
min_split_gain=0,
min_child_weight=1e-3,
min_child_samples=10,
subsample=1,
subsample_freq=0,
colsample_bytree=0.6,
reg_alpha=0.3,
reg_lambda=0.7,
random_state=42,
n_jobs=2)
gsearch = GridSearchCV(estimator = clf,
param_grid = param_set,
scoring="balanced_accuracy",
error_score='raise',
n_jobs=2,
cv=5,
verbose = 2)
When I try to call the fit function on the GridSearchCV object,
# separate total data into train/validation and test
stratifiedss = StratifiedShuffleSplit(
n_splits = 1, test_size = 0.2, train_size = 0.8, random_state=723)
for train_ind, test_ind in stratifiedss.split(X,y):
train_feature_obs = X.loc[train_ind]
train_labels = y[train_ind]
validation_feature_obs = X.loc[test_ind]
validation_labels = y[test_ind]
# transform data into lgb Dataset
training_data = lgb.Dataset(train_feature_obs, label=train_labels)
# call the GridSearchCV.fit
lgb_model2 = gsearch.fit(training_data.data.reset_index(drop=True), training_data.label)
it returns
ValueError: Classification metrics can't handle a mix of unknown and continuous-multioutput targets
So I am guessing the sklearnGridSearchCV has trouble evaluating the output of lgbmModel.predict().
I tried fitting a lgbmModel separetly and it should return an array with probabilities of the observation for each of the four classes, summing up to 100%.
I looked at:
ValueError: Classification metrics can't handle a mix of unknown and binary targets
I got the warning "UserWarning: One or more of the test scores are non-finite" when revising a toy scikit-learn gridsearchCV example
But that has not been conclusive yet.
How can I enable the sklearn.GridSearchCV to evaluate the performance of each fold of the lgbmModel classifier?
I am mostly confused as to where the "unknown" type is comnig from.
Any help would be much appreciated.
Regards, Robert
I have a K nearest neighbour classifier which you can see below. From what I understand, the GridSearchCV is testing the model with different values of k between 1-20. When I do y_pred=knn_grid_cv.predict(x_test) I get a bunch of y predictions, but what value k (between 1-20) was used to obtain these y predictions? Would it be the highest scoring k value from the GridSearchCV?
x=football_df["Pace"].values.reshape(-1, 1)
print(x)
y=football_df["Position"].values.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4,random_state=42)
param_grid={"n_neighbors":np.arange(1,20)}
knn = KNeighborsClassifier()
knn_grid_cv = GridSearchCV(knn, param_grid, cv=5)
knn_grid_cv.fit(x_train,y_train)
y_pred=knn_grid_cv.predict(x_test)
print(y_pred)
You are correct. The way you defined param_grid will test the performance of 20 different models, each with a different value for n_neighbors. The best model is chosen as the one with the highest average cross-validated score. In the case of a KNeighborsClassifier, the default score metric used is the mean accuracy.
In your case, that'd be the model with the highest mean accuracy across all five splits.
To see what value of n_neighbors was chosen, simply do:
# Option 1: print the parameters of the best classifier
print(knn_grid_cv.best_estimator_.get_params())
# Option 2: print results of all model combinations
import pandas as pd
res = pd.DataFrame(knn_grid_cv.cv_results_)
print(res)
I am using sklearn GridSearch to find best parameters for random forest classification using a predefined validation set. The scores from the best estimator returned by GridSearch do not match the scores obtained by training a separate classifier with the same parameters.
The data split definition
X = pd.concat([X_train, X_devel])
y = pd.concat([y_train, y_devel])
test_fold = -X.index.str.contains('train').astype(int)
ps = PredefinedSplit(test_fold)
The GridSearch definition
n_estimators = [10]
max_depth = [4]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
rf = RandomForestClassifier(random_state=0)
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = ps, scoring='recall_macro')
rf_grid.fit(X, y)
The classifier definition
clf = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=0)
clf.fit(X_train, y_train)
The recall was calculated explicitly using sklearn.metrics.recall_score
y_pred_train = clf.predict(X_train)
y_pred_devel = clf.predict(X_devel)
uar_train = recall_score(y_train, y_pred_train, average='macro')
uar_devel = recall_score(y_devel, y_pred_devel, average='macro')
GridSearch
uar train: 0.32189884516029466
uar devel: 0.3328299259976279
Random Forest:
uar train: 0.483040291148839
uar devel: 0.40706644557392435
What is the reason for such a mismatch?
There are multiple issues here:
Your input arguments to recall_score are reversed. The actual correct order is:
recall_score(y_true, y_test)
But you are are doing:
recall_score(y_pred_train, y_train, average='macro')
Correct that to:
recall_score(y_train, y_pred_train, average='macro')
You are doing rf_grid.fit(X, y) for grid-search. That means that after finding the best parameter combinations, the GridSearchCV will fit the whole data (whole X, ignoring the PredefinedSplit because that's only used during cross-validation in search of best parameters). So in essence, the estimator from GridSearchCV will have seen the whole data, so scores will be different from what you get when you do clf.fit(X_train, y_train)
It's because in your GridSearchCV you are using the scoring function as recall-macro which basically return the recall score which is macro averaged. See this link.
However, when you are returning the default score from your RandomForestClassifier it returns the mean accuracy. So, that is why the scores are different. See this link for info on the same. (Since one is recall and the other is accuracy).
I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. I would like to use the magic of cross-validation and still meet the aforementioned criteria.
Currently, my code is following:
clf = LogisticRegression(penalty='l2', class_weight='balanced'
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)
Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data?
You should implement your second suggestion.
Cross-validation should be used to tune the parameters of your approach. Among others, such parameters in your example are the value of the C parameter and the class_weight='balanced' of Logistic Regression. So you should:
split in 50% training, 50% test
use the training data to select the optimal values of the parameters of your model with cross-validation
Refit the model with the optimal parameters on the training data
Predict for the test data and report the score of the evaluation measure you selected
Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. Imagine, that in reality you may not have access to them until the last moment, so you can not use them.
I have an imbalanced dataset containing a binary classification problem. I have built Random Forest Classifier and used k-fold cross-validation with 10 folds.
kfold = model_selection.KFold(n_splits=10, random_state=42)
model=RandomForestClassifier(n_estimators=50)
I got the results of the 10 folds
results = model_selection.cross_val_score(model,features,labels, cv=kfold)
print results
[ 0.60666667 0.60333333 0.52333333 0.73 0.75333333 0.72 0.7
0.73 0.83666667 0.88666667]
I have calculated accuracy by taking mean and standard deviation of the results
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)
Accuracy: 70.900% (10.345%)
I have computed my predictions as follows
predictions = cross_val_predict(model, features,labels ,cv=10)
Since this is an imbalanced dataset, I would like to calculate the precision, recall, and f1 score of each fold and average the results.
How to calculate the values in python?
When you use cross_val_score method, you can specify, which scorings you can calculate on each fold:
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score),
'recall' : make_scorer(recall_score),
'f1_score' : make_scorer(f1_score)}
kfold = model_selection.KFold(n_splits=10, random_state=42)
model=RandomForestClassifier(n_estimators=50)
results = model_selection.cross_val_score(estimator=model,
X=features,
y=labels,
cv=kfold,
scoring=scoring)
After cross validation, you will get results dictionary with keys: 'accuracy', 'precision', 'recall', 'f1_score', which store metrics values on each fold for certain metric. For each metric you can calculate mean and std value by using np.mean(results[value]) and np.std(results[value]), where value - one of your specified metric name.
All of the scores you mentioned (accuracy, precision, recall, f1) rely on the threshold you (manually) set for the prediction to predict the class. If you don’t specify a threshold, the default threshold is 0.5.
The threshold should always be set according to the cost of misclassification (if no cost is given, you should make an assumption.).
In order to be able to compare different models or hyperparameters, you might consider using the Area Under Curve (AUC) for the Precision Recall Curve since it is independent of the threshold (by showing precision and recall for different thresholds). In your specific case of imbalanced data, the PR-AUC is more appropriate then the AUC for the ROC.
See also here: https://datascience.stackexchange.com/a/96708/131238