I am working on a supervised machine learning algorithm and it seems to have a curious behavior.
So, let me start:
I have a function where I pass different classifiers, their parameters, training data and their labels:
def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm,
param_grid=parameters,scoring=scorer, cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results
return pipeline
To this function I pass parameters like:
clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]),
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})
Ok, so here is where things start to get strange. This functions is returning a f1_score but it is different from the score I am computing manually using the formula:
F1 = 2 * (precision * recall) / (precision + recall)
There are pretty big differences (0.68 compared with 0.89)
I am doing something wrong in the function ?
The score computed by grid_search (grid_search.best_score_) should be the same with the score on the whole training set (grid_search.best_estimator_.predict(train_new)) ?
Thanks
The score that you are manually calculating takes into account the global true positives and negatives for all classes. But in scikit, f1_score, the default approach is to calculate the binary average (i.e only for the positive class).
So, in order to achieve the same scores, use the f1_score as specified below:
scorer=make_scorer(f1_score, average='micro')
Or simply, in the gridSearchCV, use:
scoring = 'f1_micro'
More information about how the averaging of scores is done is given on:
- http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
You may also want to take a look at the following answer which describes the calculation of scores in scikit in detail:-
https://stackoverflow.com/a/31575870/3374996
EDIT:
Changed macro to micro. As written in documentation:
'micro': Calculate metrics globally by counting the total true
positives, false negatives and false positives.
Related
My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0)
The problem is, the GridSearchCV only consider the test set metrics. It disregard the train set metrics. Therefore, the result is still an overfitted model.
What I've did:
I tried to access the cv_results_ attribute, but there is tons of output, I am not sure how to read it, and I believe we are not supposed to do that manually.
The code
# model definition
rf_cv = GridSearchCV(estimator=rf_clf_default,
# what the user care is the model ability to find class 1.
scoring=make_scorer(score_func=f1_score, pos_label=1),
param_grid={'randomforestclassifier__n_estimators': [37,38,39,100,200],
'randomforestclassifier__max_depth': [4,5,6,10,20,30],
'randomforestclassifier__min_samples_leaf': [2,3,4]},
return_train_score=True,
refit=True)
# ignore OneHotEncoder warning about unkonwn categories
with warnings.catch_warnings():
warnings.simplefilter(action="ignore", category=UserWarning)
# Train the algorithm
rf_cv.fit(X=X_train, y=y_train)
# get the recall score for label 1
print("best recall score class 1", rf_cv.best_score_)
# get the best params
display("best parameters", rf_cv.best_params_)
You can provide a callable for the refit parameter:
Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.
For example, if you want to only consider hyperparameters whose train and test recall scores are within 0.05:
import pandas as pd
def my_refit_criteria(cv_results_):
cv_frame = pd.DataFrame(cv_results_)
candidate_mask = cv_frame['mean_train_recall'] - cv_frame['mean_test_recall'] < 0.05
if candidate_mask.sum() > 0:
candidates = cv_frame[candidate_mask]
else:
# if none, just pick the best
candidates = cv_frame
return candidates['mean_test_recall'].idxmax()
search = GridSearchCV(..., refit=my_refit_criteria)
(I haven't tested this; if you see errors let me know.)
There's a more complex example in the docs:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
I've been using GridSearchCV to optimize some parameters for a binary classifier. I want to operate the classifier at a point where it barely makes any false positives but reaches still a high true positive rate. So in short: Optimize TPR while restricting FPR to 0 (or close to).
Therefore I wanted to a slightly adapted roc_auc_score as scorer argument in the GridSearchCV.
clf1= SVC()
# define grid-space (obviously i would use a biger grid for the actual optimization)
grid1 = {'C':[1, 1000], 'kernel': ['poly'], 'degree' : [3], 'class_weight': ['balanced'], 'probability':[True]}
#define scoring function: Since we want to keep FPR = 0 we calculate the roc curve only between FPR = [0, 0.0001] (instead of [0, 1]
roc_spec = make_scorer(roc_auc_score, max_fpr=0.001)#define roc score for the unsave class
grid_clf_acc = GridSearchCV(clf1, param_grid = grid1 , scoring = roc_spec, n_jobs = -1, cv=cross_validation_folds)
grid_clf_acc.fit(X_train, y_train)
As you can see, I've adapted sklearn's standard roc_auc_score by setting it's max_fpr to 0.001.
If I now run the grid search, unfortunately the algorithm does not use multiple confidence thresholds anymore to compute the roc_score but uses only one confidence threshold instead.
On the other hand if I dont use the 'selfmade' scorer and use Gridsearch with the preimplemented roc_auc_score, the algorithm does indeed use multiple thresholds to compute the auc_roc_score.
grid_clf_acc = GridSearchCV(clf1, param_grid = grid1 , scoring = 'roc_auc', n_jobs = -1, cv=cross_validation_folds)
So somehow, the slightly adapted roc_auc_score has not the same capabilites as the original roc_auc_score. Is this a bug, or am I making a mistake when I define my own scorer?
(Remarks:
In this example I've used max_fpr=0.001. Even if I set it to 1, it still does calculate the roc_auc score based on one threshold only.
I also tried the two arguments of the make_scorer function (needs_thresh & or needs_proba) but neither of them solved the problem.
Finally I share an image, that shows two ROC's that I made to localize the problem. The left one shows an ROC for the classifier that was generated with multiple thresholds. The number on top is the calculated ROC score. This score did not match the score i've got in the GridSearch when using the customized scorer. However it did match the the score when I used the preimplemented scorer. On the right i plotted an ROC for the classifer that was generated with one threshold only (--> I've used predict instead of predict_prob). The calculated ROC did indeed match the calculated but "faulty" ROC_AUC score of the GridSearchCV when using the customized scorer.
I have found my mistake. What finally worked was initializing the scorer as following:
roc_spec = make_scorer(roc_auc_score, max_fpr=0.001, needs_proba=True)
Then i also had to set probabilty=True in the SVC:
clf1= SVC(probability=True)
This made it work.
I'm using a diabetics dataset which has 3 classes for the target variable. I have used Decision Tree Classifier for the same and optimized the hyperparameters using RandomizedSearchCV of sci-kit learn package and fitted the model to training data. Now, I have found the probability values for the test data which gives the probability for assigning the outcome variable to the 3 classes. Now, I want to calculate the cutoff value such that I can use it to assign the classes. For this purpose, I'm using F1 score to find the appropriate cut off value.
Now, I'm stuck how to find the F1 score. Will the F1 score metric help me to find it?
Here is the dataset
After preprocessing the data, I have spitted the data into training and testing set.
dtree = DecisionTreeClassifier()
params = {'class_weight':[None,'balanced'],
'criterion':['entropy','gini'],
'max_depth':[None,5,10,15,20,30,50,70],
'min_samples_leaf':[1,2,5,10,15,20],
'min_samples_split':[2,5,10,15,20]}
grid_search = RandomizedSearchCV(dtree,cv=10,n_jobs=-1,n_iter=10,scoring='roc_auc_ovr',verbose=20,param_distributions=params)
grid_search.fit(X_train,y_train)
mdl.fit(X_train,y_train)
test_score = mdl.predict_proba(X_test)
The following formula I have created for cutoff for binary classifier -
cutoffs = np.linspace(0.01,0.99,99)
true = y_train
train_score = mdl.predict_proba(X_train)[:,1]
F1_all = []
for cutoff in cutoffs:
pred = (train_score>cutoff).astype(int)
TP = ((pred==1)&(true==1)).sum()
FP = ((pred==1)&(true==0)).sum()
TN = ((pred==0)&(true==0)).sum()
FN = ((pred==0)&(true==1)).sum()
F1 = TP/(TP+0.5*(FP+FN))
F1_all.append(F1)
my_cutoff = cutoffs[F1_all==max(F1_all)][0]
preds = (test_score1>my_cutoff).astype(int)
There is no cutoff value for the softmax output of a multiclass classifier in the same sense as the cutoff value for binary classifier.
When your output is normalized probabilities for multiple classes and you want to convert this into class labels, you just take the label with the highest assigned probability.
Technically you could design some custom schema such as
if class1 has probability of 10% or more, choose class1 label, otherwise
pick a class with the highest assigned probability
which would be sort of a cutoff for class 1 but this is rather arbitrary and I have not seen anyone doing this in practice. If you have some deep insight into your problem which is suggesting that something like this may be useful then go ahead and build your own "cutoff" formula, otherwise you should just stick with the general approach (argmax of the normalized probabilities).
From the business perspective, false negatives lead to about tenfold higher costs (real money) than false positives. Given my standard binary classification models (logit, random forest, etc.), how can I incorporate this into my model?
Do I have to change (weight) the loss function in favor of the 'preferred' error (FP) ? If so, how to do that?
There are several options for you:
As suggested in the comments, class_weight should boost the loss function towards the preferred class. This option is supported by various estimators, including sklearn.linear_model.LogisticRegression,
sklearn.svm.SVC, sklearn.ensemble.RandomForestClassifier, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.
You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to 1.0 effectively means false_negatives close to 0.0, which is what to want. For that, use sklearn.model_selection.cross_val_predict and sklearn.metrics.precision_recall_curve functions:
y_scores = cross_val_predict(classifier, x_train, y_train, cv=3,
method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
If you plot the precisions and recalls against the thresholds, you should see the picture like this:
After picking the best threshold, you can use the raw scores from classifier.decision_function() method for your final classification.
Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).
As #Maxim mentioned, there are 2 stages to make this kind of tuning: in the model training stage (like custom weights) and the prediction stage (like lowering the decision threshold).
Another tuning for the model-training stage is using a recall scorer. you can use it in your grid-search cross-validation (GridSearchCV) for tuning your classifier with the best hyper-param towards high recall.
GridSearchCV scoring parameter can either accepts the 'recall' string or the function recall_score.
Since you're using a binary classification, both options should work out of the box, and call recall_score with its default values that suits a binary classification:
average: 'binary' (i.e. one simple recall value)
pos_label: 1 (like numpy's True value)
Should you need to custom it, you can wrap an existing scorer, or a custom one, with make_scorer, and pass it to the scoring parameter.
For example:
from sklearn.metrics import recall_score, make_scorer
recall_custom_scorer = make_scorer(
lambda y, y_pred, **kwargs: recall_score(y, y_pred, pos_label='yes')[1]
)
GridSearchCV(estimator=est, param_grid=param_grid, scoring=recall_custom_scorer, ...)
I am using sklearn.lda for a classification purpose and was a little puzzled about the score function that prints the mean classification error.
Is it determined by leave one out - jackknife?
How do I interpret the result? It's only a float value without much documentation.
Thanks in advance,
EL
The score method takes samples X and their true labels y and compares its own predictions with y. It returns the mean accuracy, which is always a single figure. For example,
lda = LDA().fit(X, y)
print(lda.score(X, y))
will print the accuracy of the classifier on its own training set.
Every classifier has a score method, which usually (though not necessarily) returns mean accuracy. The method is used by the GridSearchCV model selection algorithm to determine the quality of the classifier if you don't explicitly give it a scoring argument.