Scoring metric for multi-class cross validation - python

I have a DataFrame X in which there is a column called target with 10 different labels: [0,1,2,3,4,5,6,7,8,9]. I have a Machine Learning model, let's say: model=AdaBoostClassifier() I would like to use to fit the data and predict again the labels by doing a cross-validation process to train the model. I use two metrics for the cross-validation, the accuracy and the neg_mean_squared_error to evaluate the performance and compute the ratio: neg_mean_squared_error/accuracy. The lines are like:
model.seed = 42
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring=('accuracy', 'neg_mean_squared_error')
scores = cross_validate(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring=scoring)
scores = abs(np.sqrt(np.mean(scores['test_neg_mean_squared_error'])*-1))/np.mean(scores['test_accuracy'])
score_description = [model,'{model}'.format(model=model.__class__.__name__),"%0.5f" % scores]
However, whenever I start to run, I get the following error message:
ValueError: Samplewise metrics are not available outside of multilabel classification.
How could I do to solve this issue with the metrics and perform the corresponding classification? Which metrics could I use to evaluate the performance of the model in the multi-label case?

Related

What is the difference between grid.score(X_valid, y_valid) and grid.best_score_

While doing GridSearchCV, what is the difference between the scores obtained through grid.score(...) and grid.best_score_
Kindly assume that a model, features, target, and param_grid are in place. Here is a part of the code I am very curious to know about.
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
There are two different outputs for each of best_score_1 and best_score_2
I am trying to know the difference between the two as well as which of the following should be considered to be the best scores that came out from the given param_grid.
Following is the full function.
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
UPDATED
As explained in the sklearn documentation, GridSearchCV takes all the parameter lists of parameters you pass and tries all possible combinations to find the best parameters.
To evaluate which are the best parameters, it calculates a k-fold cross-validation for each parameters combination. With k-fold cross-validation, the training set is divided into Training set and Validation set (which is a test set). If you choose, for example, cv=5 the dataset is divided into 5 non-overlapping folds, and each fold is used as a validation set, while all the other are used as training set. Hence, GridSearchCV, in the example, calculates the average validation score (which can be accuracy or something else) for each of the 5 folds, and does so for each parameters combination. Then, at the end of GridsearchCV there will be an average validation score for each parameter combination, and the one with the highest average validation score is returned. So, the average validation score, associated to the best parameters, is stored in the grid.best_score_ variable.
On the other hand, the grid.score(X_valid, y_valid) method gives the score on the given data, if the estimator has been refitted (refit=True).This means that it is not the average accuracy of the 5 folds, but is taken the model with the best parameters and is trained using the training set. Then, are computed the predictions on the X_valid and compared compared with the y_valid in order to get the score.

Checking for Overfitting and Underfitting in sklearn models

I am using the sklearn RandomForestClassifier as my classification. I could not figure out how to get evaluate Overfitting and Underfitting for sklearn models.
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
model.fit(X_train, y_train)
Currently, I am using other metrics to evaluate my model like - cross_val_score, confusion_matrix, classification_report, PermutationImportance. Could someone please help me with this.
There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

Sklearn GridSearch with PredefinedSplit scoring does not match a standalone classifier

I am using sklearn GridSearch to find best parameters for random forest classification using a predefined validation set. The scores from the best estimator returned by GridSearch do not match the scores obtained by training a separate classifier with the same parameters.
The data split definition
X = pd.concat([X_train, X_devel])
y = pd.concat([y_train, y_devel])
test_fold = -X.index.str.contains('train').astype(int)
ps = PredefinedSplit(test_fold)
The GridSearch definition
n_estimators = [10]
max_depth = [4]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
rf = RandomForestClassifier(random_state=0)
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = ps, scoring='recall_macro')
rf_grid.fit(X, y)
The classifier definition
clf = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=0)
clf.fit(X_train, y_train)
The recall was calculated explicitly using sklearn.metrics.recall_score
y_pred_train = clf.predict(X_train)
y_pred_devel = clf.predict(X_devel)
uar_train = recall_score(y_train, y_pred_train, average='macro')
uar_devel = recall_score(y_devel, y_pred_devel, average='macro')
GridSearch
uar train: 0.32189884516029466
uar devel: 0.3328299259976279
Random Forest:
uar train: 0.483040291148839
uar devel: 0.40706644557392435
What is the reason for such a mismatch?
There are multiple issues here:
Your input arguments to recall_score are reversed. The actual correct order is:
recall_score(y_true, y_test)
But you are are doing:
recall_score(y_pred_train, y_train, average='macro')
Correct that to:
recall_score(y_train, y_pred_train, average='macro')
You are doing rf_grid.fit(X, y) for grid-search. That means that after finding the best parameter combinations, the GridSearchCV will fit the whole data (whole X, ignoring the PredefinedSplit because that's only used during cross-validation in search of best parameters). So in essence, the estimator from GridSearchCV will have seen the whole data, so scores will be different from what you get when you do clf.fit(X_train, y_train)
It's because in your GridSearchCV you are using the scoring function as recall-macro which basically return the recall score which is macro averaged. See this link.
However, when you are returning the default score from your RandomForestClassifier it returns the mean accuracy. So, that is why the scores are different. See this link for info on the same. (Since one is recall and the other is accuracy).

XGBoost algorithm, question about the evaulate_model function

This evaulate model function is frequently used, I found it used here at IBM. But I will show the function here:
def evaluate_model(alg, train, target, predictors, useTrainCV=True , cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(train[predictors].values, target['Default Flag'].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(train[predictors], target['Default Flag'], eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(train[predictors])
dtrain_predprob = alg.predict_proba(train[predictors])[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.6g" % metrics.accuracy_score(target['Default Flag'].values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(target['Default Flag'], dtrain_predprob))
plt.figure(figsize=(12,12))
feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance', color='g')
plt.ylabel('Feature Importance Score')
plt.show()
After tuning the parameters for XGboost, I have
xgb4 = XGBClassifier(
objective="binary:logistic",
learning_rate=0.10,
n_esimators=5000,
max_depth=6,
min_child_weight=1,
gamma=0.1,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
nthread=4,
scale_pos_weight=1.0,
seed=27)
features = [x for x in X_train.columns if x not in ['Default Flag','ID']]
evaluate_model(xgb4, X_train, y_train, features)
and the results I get is
Model Report
Accuracy : 0.803236
AUC Score (Train): 0.856995
The question I have and perhaps ill-informed is that this evaulate_model() function is not tested on the test set of the data which I found odd. When I do call it on the test set (evaluate_model(xgb4, X_test, y_test, features)) I get this
Model Report
Accuracy : 0.873706
AUC Score (Train): 0.965286
I want to know if these two Model Reports are concerning at all given that the test set has a higher accuracy then the training set. My apologies if the structure of this question is poorly presented.
I will develop my answer a little bit more :
This function train on the dataset you give it, and return the train accuracy and AUC : this is therefore not a reliable way to evaluate your models.
In the link you provided, it is said that this function is used to tune the number of estimators:
The function below performs the following actions to find the best
number of boosting trees to use on your data:
Trains an XGBoost model using features of the data.
Performs k-fold cross validation on the model, using accuracy and AUC score as the evaluation metric.
Returns output for each boosting round so you can see how the model is learning. You will look at the detailed output in the next
section.
It stops running after the cross-validation score does not improve significantly with additional boosting rounds, giving you an
optimal number of estimators for the model.
You should not use it to evaluate your model performance, but rather perform a clean cross validation.
Your test scores are higher in this case because your test set is smaller, so the model overfit more easily.

Python: Evaluating an Isolation Forest

I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.

Categories

Resources