How to penalize False Negatives more than False Positives

How to penalize False Negatives more than False Positives - python

From the business perspective, false negatives lead to about tenfold higher costs (real money) than false positives. Given my standard binary classification models (logit, random forest, etc.), how can I incorporate this into my model?
Do I have to change (weight) the loss function in favor of the 'preferred' error (FP) ? If so, how to do that?

There are several options for you:
As suggested in the comments, class_weight should boost the loss function towards the preferred class. This option is supported by various estimators, including sklearn.linear_model.LogisticRegression,
sklearn.svm.SVC, sklearn.ensemble.RandomForestClassifier, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.
You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to 1.0 effectively means false_negatives close to 0.0, which is what to want. For that, use sklearn.model_selection.cross_val_predict and sklearn.metrics.precision_recall_curve functions:
y_scores = cross_val_predict(classifier, x_train, y_train, cv=3,
method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
If you plot the precisions and recalls against the thresholds, you should see the picture like this:
After picking the best threshold, you can use the raw scores from classifier.decision_function() method for your final classification.
Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).

As #Maxim mentioned, there are 2 stages to make this kind of tuning: in the model training stage (like custom weights) and the prediction stage (like lowering the decision threshold).
Another tuning for the model-training stage is using a recall scorer. you can use it in your grid-search cross-validation (GridSearchCV) for tuning your classifier with the best hyper-param towards high recall.
GridSearchCV scoring parameter can either accepts the 'recall' string or the function recall_score.
Since you're using a binary classification, both options should work out of the box, and call recall_score with its default values that suits a binary classification:
average: 'binary' (i.e. one simple recall value)
pos_label: 1 (like numpy's True value)
Should you need to custom it, you can wrap an existing scorer, or a custom one, with make_scorer, and pass it to the scoring parameter.
For example:
from sklearn.metrics import recall_score, make_scorer
recall_custom_scorer = make_scorer(
lambda y, y_pred, **kwargs: recall_score(y, y_pred, pos_label='yes')[1]
)
GridSearchCV(estimator=est, param_grid=param_grid, scoring=recall_custom_scorer, ...)

Related

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

Gridsearch technique in sklearn, python

I am working on a supervised machine learning algorithm and it seems to have a curious behavior.
So, let me start:
I have a function where I pass different classifiers, their parameters, training data and their labels:
def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm,
param_grid=parameters,scoring=scorer, cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results
return pipeline
To this function I pass parameters like:
clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]),
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})
Ok, so here is where things start to get strange. This functions is returning a f1_score but it is different from the score I am computing manually using the formula:
F1 = 2 * (precision * recall) / (precision + recall)
There are pretty big differences (0.68 compared with 0.89)
I am doing something wrong in the function ?
The score computed by grid_search (grid_search.best_score_) should be the same with the score on the whole training set (grid_search.best_estimator_.predict(train_new)) ?
Thanks

The score that you are manually calculating takes into account the global true positives and negatives for all classes. But in scikit, f1_score, the default approach is to calculate the binary average (i.e only for the positive class).
So, in order to achieve the same scores, use the f1_score as specified below:
scorer=make_scorer(f1_score, average='micro')
Or simply, in the gridSearchCV, use:
scoring = 'f1_micro'
More information about how the averaging of scores is done is given on:
- http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
You may also want to take a look at the following answer which describes the calculation of scores in scikit in detail:-
https://stackoverflow.com/a/31575870/3374996
EDIT:
Changed macro to micro. As written in documentation:
'micro': Calculate metrics globally by counting the total true
positives, false negatives and false positives.

How to set a threshold for a sklearn classifier based on ROC results?

I trained an ExtraTreesClassifier (gini index) using scikit-learn and it suits my needs fairly. Not so good accuracy, but using a 10-fold cross validation, AUC is 0.95. I would like to use this classifier on my work. I am quite new to ML, so please forgive me if I'm asking you something conceptually wrong.
I plotted some ROC curves, and by it, its seems I have a specific threshold where my classifier starts performing well. I'd like to set this value on the fitted classifier, so everytime I'd call predict, the classifiers use that threshold and I could believe in the FP and TP rates.
I also came to this post (scikit .predict() default threshold), where its stated that a threshold is not a generic concept for classifiers. But since the ExtraTreesClassifier has the method predict_proba, and the ROC curve is also related to thresdholds definition, it seems to me I should be available to specify it.
I did not find any parameter, nor any class/interface to use to do it. How can I set a threshold for it for a trained ExtraTreesClassifier (or any other one) using scikit-learn?
Many Thanks,
Colis

This is what I have done:
model = SomeSklearnModel()
model.fit(X_train, y_train)
predict = model.predict(X_test)
predict_probabilities = model.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, predict_probabilities)
However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:
predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)
and there you go:
confusion_matrix(y_test, predict_mine)
returns what I wanted:
array([[6927, 309],
[ 621, 121]])

It's difficult to provide an exact answer without any specific code examples. If you're already doing cross validation, you might consider specifying the AUC as the parameter to optimize:
shuffle = cross_validation.KFold(len(X_train), n_folds=10, shuffle=True)
scores = cross_val_score(classifier, X_train, y_train, cv=shuffle, scoring='roc_auc')

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
First:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
Class weights
The weights from the class_weight parameter are used to train the classifier.
They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.
Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
The metrics
Once you have a classifier, you want to know how well it is performing.
Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...
Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.
precision recall f1-score support
0 0.65 1.00 0.79 17
1 0.57 0.75 0.65 16
2 0.33 0.06 0.10 17
avg / total 0.52 0.60 0.51 50
The warning
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The
default `weighted` averaging is deprecated, and from version 0.18,
use of precision, recall or F-score with multiclass or multilabel data
or pos_label=None will result in an exception. Please set an explicit
value for `average`, one of (None, 'micro', 'macro', 'weighted',
'samples'). In cross validation use, for instance,
scoring="f1_weighted" instead of scoring="f1".
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed!
The question could be rephrased: from the above classification report, how do you output one global number for the f1-score?
You could:
Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.
These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.
Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
Computing scores
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen.
This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))

Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
How to I score a multiclass problem?
How do I deal with unbalanced data?
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
from sklearn.metrics import precision_recall_fscore_support as score
predicted = [1,2,3,4,5,1,2,1,1,4,5]
y_test = [1,2,3,4,5,1,2,1,1,4,1]
precision, recall, fscore, support = score(y_test, predicted)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
This way you end up with tangible and interpretable numbers for each of the classes.
| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1 | 94% | 83% | 0.88 | 204 |
| 2 | 71% | 50% | 0.54 | 127 |
| ... | ... | ... | ... | ... |
| 4 | 80% | 98% | 0.89 | 838 |
| 5 | 93% | 81% | 0.91 | 1190 |
Then...
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread.
However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.

Posed question
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure.
Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of
performance measures for classification tasks." Information Processing
& Management 45.4 (2009): 427-437.
Application-specific question
However, returning to your task, I'd research 2 topics:
metrics commonly used for your specific task - it lets (a) to
compare your method with others and understand if you do something
wrong, and (b) to not explore this by yourself and reuse someone
else's findings;
cost of different errors of your methods - for
example, use-case of your application may rely on 4- and 5-star
reviewes only - in this case, good metric should count only these 2
labels.
Commonly used metrics.
As I can infer after looking through literature, there are 2 main evaluation metrics:
Accuracy, which is used, e.g. in
Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using
Yelp Business."
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class
relationships for sentiment categorization with respect to rating
scales." Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics. Association for Computational Linguistics,
2005.
(link)
MSE (or, less often, Mean Absolute Error - MAE) - see, for example,
Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with
restaurant reviews." Final Projects from CS N 224 (2010).
(link) - they explore both accuracy and MSE, considering the latter to be better
Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining
the Stars: Weighted Multiple-Instance Learning for Aspect-Based
Sentiment Analysis." Proceedings of the 2014 Conference on Empirical
Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
Cost of different errors.
If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE;
if difference matters, but not so much, try MAE, since it doesn't square diff;
otherwise stay with Accuracy.
About approaches, not metrics
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.

First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these
values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).
f1_score(y_test, prediction, average='weighted')
All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
final_prediction = (KNNprediction * RFprediction) ** 0.5

Linear Discriminant Analysis

I am using sklearn.lda for a classification purpose and was a little puzzled about the score function that prints the mean classification error.
Is it determined by leave one out - jackknife?
How do I interpret the result? It's only a float value without much documentation.
Thanks in advance,
EL

The score method takes samples X and their true labels y and compares its own predictions with y. It returns the mean accuracy, which is always a single figure. For example,
lda = LDA().fit(X, y)
print(lda.score(X, y))
will print the accuracy of the classifier on its own training set.
Every classifier has a score method, which usually (though not necessarily) returns mean accuracy. The method is used by the GridSearchCV model selection algorithm to determine the quality of the classifier if you don't explicitly give it a scoring argument.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.