This evaulate model function is frequently used, I found it used here at IBM. But I will show the function here:
def evaluate_model(alg, train, target, predictors, useTrainCV=True , cv_folds=5, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(train[predictors].values, target['Default Flag'].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(train[predictors], target['Default Flag'], eval_metric='auc')
#Predict training set:
dtrain_predictions = alg.predict(train[predictors])
dtrain_predprob = alg.predict_proba(train[predictors])[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.6g" % metrics.accuracy_score(target['Default Flag'].values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(target['Default Flag'], dtrain_predprob))
plt.figure(figsize=(12,12))
feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance', color='g')
plt.ylabel('Feature Importance Score')
plt.show()
After tuning the parameters for XGboost, I have
xgb4 = XGBClassifier(
objective="binary:logistic",
learning_rate=0.10,
n_esimators=5000,
max_depth=6,
min_child_weight=1,
gamma=0.1,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
nthread=4,
scale_pos_weight=1.0,
seed=27)
features = [x for x in X_train.columns if x not in ['Default Flag','ID']]
evaluate_model(xgb4, X_train, y_train, features)
and the results I get is
Model Report
Accuracy : 0.803236
AUC Score (Train): 0.856995
The question I have and perhaps ill-informed is that this evaulate_model() function is not tested on the test set of the data which I found odd. When I do call it on the test set (evaluate_model(xgb4, X_test, y_test, features)) I get this
Model Report
Accuracy : 0.873706
AUC Score (Train): 0.965286
I want to know if these two Model Reports are concerning at all given that the test set has a higher accuracy then the training set. My apologies if the structure of this question is poorly presented.
I will develop my answer a little bit more :
This function train on the dataset you give it, and return the train accuracy and AUC : this is therefore not a reliable way to evaluate your models.
In the link you provided, it is said that this function is used to tune the number of estimators:
The function below performs the following actions to find the best
number of boosting trees to use on your data:
Trains an XGBoost model using features of the data.
Performs k-fold cross validation on the model, using accuracy and AUC score as the evaluation metric.
Returns output for each boosting round so you can see how the model is learning. You will look at the detailed output in the next
section.
It stops running after the cross-validation score does not improve significantly with additional boosting rounds, giving you an
optimal number of estimators for the model.
You should not use it to evaluate your model performance, but rather perform a clean cross validation.
Your test scores are higher in this case because your test set is smaller, so the model overfit more easily.
Related
While doing GridSearchCV, what is the difference between the scores obtained through grid.score(...) and grid.best_score_
Kindly assume that a model, features, target, and param_grid are in place. Here is a part of the code I am very curious to know about.
grid = GridSearchCV(X_train, y_train)
grid.fit(X_train, y_train)
scores = grid.score(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
best_score_1 = scores
best_score_2 = grid.best_score_
There are two different outputs for each of best_score_1 and best_score_2
I am trying to know the difference between the two as well as which of the following should be considered to be the best scores that came out from the given param_grid.
Following is the full function.
def apply_grid (df, model, features, target, params, test=False):
'''
Performs GridSearchCV after re-splitting the dataset, provides
comparison between train's MSE and test's MSE to check for
Generalization and optionally deploys the best-found parameters
on the Test Set as well.
Args:
df: DataFrame
model: a model to use
features: features to consider
target: labels
params: Param_Grid for Optimization
test: False by Default, if True, predicts on Test
Returns:
MSE scores on models and slice from the cv_results_
to compare the models generalization performance
'''
my_model = model()
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df[features],
df[target], random_state=0)
# Resplit the train dataset for GridSearchCV into train2 and valid to keep the test set separate
X_train2, X_valid, y_train2, y_valid = train_test_split(train[features],
train[target] , random_state=0)
# Use Grid Search to find the best parameters from the param_grid
grid = GridSearchCV(estimator=my_model, param_grid=params, cv=3,
return_train_score=True, scoring='neg_mean_squared_error')
grid.fit(X_train2, y_train2)
# Evaluate on Valid set
scores = grid.score(X_valid, y_valid)
scores = scores # CONFUSION
print('Best MSE through GridSearchCV: ', grid.best_score_) # CONFUSION
print('Best MSE through GridSearchCV: ', scores)
print('I AM CONFUSED ABOUT THESE TWO OUTPUTS ABOVE. WHY ARE THEY DIFFERENT')
print('Best Parameters: ',grid.best_params_)
print('-'*120)
print('mean_test_score is rather mean_valid_score')
report = pd.DataFrame(grid.cv_results_)
# If test is True, deploy the best_params_ on the test set
if test == True:
my_model = model(**grid.best_params_)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('TEST MSE with the best params: ', mse)
print('-'*120)
return report[['mean_train_score', 'mean_test_score']]
UPDATED
As explained in the sklearn documentation, GridSearchCV takes all the parameter lists of parameters you pass and tries all possible combinations to find the best parameters.
To evaluate which are the best parameters, it calculates a k-fold cross-validation for each parameters combination. With k-fold cross-validation, the training set is divided into Training set and Validation set (which is a test set). If you choose, for example, cv=5 the dataset is divided into 5 non-overlapping folds, and each fold is used as a validation set, while all the other are used as training set. Hence, GridSearchCV, in the example, calculates the average validation score (which can be accuracy or something else) for each of the 5 folds, and does so for each parameters combination. Then, at the end of GridsearchCV there will be an average validation score for each parameter combination, and the one with the highest average validation score is returned. So, the average validation score, associated to the best parameters, is stored in the grid.best_score_ variable.
On the other hand, the grid.score(X_valid, y_valid) method gives the score on the given data, if the estimator has been refitted (refit=True).This means that it is not the average accuracy of the 5 folds, but is taken the model with the best parameters and is trained using the training set. Then, are computed the predictions on the X_valid and compared compared with the y_valid in order to get the score.
The exported pipeline of TPOT stating that the Average CV score on the training set was: -128.90187963562252 (neg_MAE).
However, refitting the pipeline with the same exact training set yields way smaller MAE around (35).
Moreover predicting unseen test set would yield an MAE around (140) which is in line with what the exported pipeline stating.
I am a bit confused and wondering how to reproduce the error score on the training set.
The pipeline seems to be overfitting right??
cv = RepeatedKFold(n_splits=4, n_repeats=1, random_state=1)
model = TPOTRegressor(generations=10, population_size=25, offspring_size=None, mutation_rate=0.9,
crossover_rate=0.1, scoring='neg_mean_absolute_error', cv=cv,
subsample=0.75,n_jobs=-1, max_time_mins=None,
max_eval_time_mins=5,random_state=42,config_dict=None, template=None,
warm_start=False, memory=None,
use_dask=False,periodic_checkpoint_folder=None, early_stop=3, verbosity=2,
disable_update_check=False, log_file=None)
model.fit(train_df[x], train_df[y])
# The Exported model
# Average CV score on the training set was: -128.90187963562252
exported_pipeline = make_pipeline(StackingEstimator(estimator=LassoLarsCV(normalize=True)),
StackingEstimator(estimator=ExtraTreesRegressor(bootstrap=True,
max_features=0.4, min_samples_leaf=1,
min_samples_spli`enter code here`t=7, n_estimators=100)),
PolynomialFeatures(degree=2, include_bias=False,
interaction_only=False),
ExtraTreesRegressor(bootstrap=True,
max_features=0.15000000000000002, min_samples_leaf=9,
min_samples_split=7,n_estimators=100))
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Thanks in advance
I have a DataFrame X in which there is a column called target with 10 different labels: [0,1,2,3,4,5,6,7,8,9]. I have a Machine Learning model, let's say: model=AdaBoostClassifier() I would like to use to fit the data and predict again the labels by doing a cross-validation process to train the model. I use two metrics for the cross-validation, the accuracy and the neg_mean_squared_error to evaluate the performance and compute the ratio: neg_mean_squared_error/accuracy. The lines are like:
model.seed = 42
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring=('accuracy', 'neg_mean_squared_error')
scores = cross_validate(model, X.drop(target,axis=1), X[target], cv=outer_cv, n_jobs=-1, scoring=scoring)
scores = abs(np.sqrt(np.mean(scores['test_neg_mean_squared_error'])*-1))/np.mean(scores['test_accuracy'])
score_description = [model,'{model}'.format(model=model.__class__.__name__),"%0.5f" % scores]
However, whenever I start to run, I get the following error message:
ValueError: Samplewise metrics are not available outside of multilabel classification.
How could I do to solve this issue with the metrics and perform the corresponding classification? Which metrics could I use to evaluate the performance of the model in the multi-label case?
I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.
I'm currently trying to analyze data for the first time using XGBoost. I want to find the best parameters using GridsearchCV. I want to minimize the root mean squared error and to do this, I used "rmse" as eval_metric. However, scoring in grid search does not have such a metric. I found on this site that the "neg_mean_squared_error" does the same, but I found that this gives me different results than the RMSE. When I calculate the root of the absolute value of the "neg_mean_squared_error", I get a value of around 8.9 while a different function gives me a RMSE of about 4.4.
I don't know what goes wrong or how I get these two functions to agree/give the same values?
Because of this problem, I get wrong values as "best_params_" which give me a higher RMSE than some values I initially started with to tune.
Can anyone please explain me how to get score on the RMSE in the grid search or why my code gives different values?
Thanks in advance.
def modelfit(alg, trainx, trainy, useTrainCV=True, cv_folds=10, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(trainx, label=trainy)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])
# Fit the algorithm on the data
alg.fit(trainx, trainy, eval_metric='rmse')
# Predict training set:
dtrain_predictions = alg.predict(trainx)
# dtrain_predprob = alg.predict_proba(trainy)[:, 1]
print(dtrain_predictions)
print(np.sqrt(mean_squared_error(trainy, dtrain_predictions)))
# Print model report:
print("\nModel Report")
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(trainy, dtrain_predictions)))
param_test2 = {
'max_depth':[6,7,8],
'min_child_weight':[2,3,4]
}
grid2 = GridSearchCV(estimator = xgb.XGBRegressor( learning_rate =0.1, n_estimators=2000, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'reg:linear', nthread=4, scale_pos_weight=1, random_state=4),
param_grid = param_test2, scoring='neg_mean_squared_error', n_jobs=4,iid=False, cv=10, verbose=20)
grid2.fit(X_train,y_train)
# Mean cross-validated score of the best_estimator
print(grid2.best_params_, np.sqrt(np.abs(grid2.best_score_))), print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
modelfit(grid2.best_estimator_, X_train, y_train)
print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
In GridSearchCV the scoring parameter is transformed so that higher values are always better than lower values. In your example neg_mean_squared_error is just a negated version of RMSE. You should not interpret neg_mean_squared_error to be RMSE, rather in your cross-validation you should compare values of neg_mean_squared_error where a higher value is better than lower values.
In the scoring parameter portion of the model_evaluation documentation this behavior is mentioned.
Scikit-Learn Scoring Parameter Documentation
It's because XGBoostRegressor.score returns the coefficient of determination of the prediction, not RMSE.