Scikit-learn and Yellowbrick giving different scores - python

I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed), y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed)), y_train)
viz3.score(X_test, y_test)
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed)), y_train)
viz4.score(X_test, y_test)
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.

Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88


Can't replicate SKLearn Multi-layer Perceptron Loss function calculation

I am trying to use SKLearn's MLPRegressor model, however, I am unable to replicate the final loss value that the model is providing.
I have turned off regularisation by setting alpha to 0, and turned off early-stopping and validation. But my loss calculation differs from what SKLearn is providing by about 1%.
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
def squared_loss(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean() / 2
X, y = make_regression(n_samples=200, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
regr = MLPRegressor(random_state=1, max_iter=500, alpha=0, validation_fraction=0).fit(X_train, y_train)
# sklearn loss calc
regr.loss_ # 354.05262744103453
# manual loss calc
squared_loss(y_train, regr.predict(X_train)) # 350.25399153165534
Any pointers would be great, I've taken my squared_loss function directly from SKLearn docs and I can't find anything else that could be causing the difference when going through the code - but must be missing something. My guess is that there is some sample-size-related adjustment somewhere, because whatever value I change the first 'random_state' to, the difference between the two calculations is always around 1%.

ValueError in kNN metrics

I have a project that consists of utilizing the kNN algorithm in a csv file and show selected metrics. But when I try to present some metrics it throws a few errors.
When trying to use: sensitivity, f1_Score and Precision:
sensitivity - print(metrics.recall_score(y_test, y_pred_class))
F1_score - print(metrics.f1_score(y_test, y_pred_class))
Presicion - print(metrics.precision_score(y_test, y_pred_class))
Pycharm throws the following error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting
The error when trying to print the ROC curve's a little different:
ValueError: multiclass format is not supported
import matplotlib
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib.dviread import Text
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
#Tools para teste
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
def main():
dataset = pd.read_csv('filetestKNN.csv')
X = dataset.drop(columns=['Label'])
y = dataset['Label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.34)
Classifier = KNeighborsClassifier(n_neighbors=2, p=2, metric='euclidean'), y_train)
y_pred_class = Classifier.predict(X_test)
y_pred_prob = Classifier.predict_proba(X_test)[:, 1]
accuracy = Classifier.score(X_test, y_test)
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(metrics.accuracy_score(y_test, y_pred_class))
print("Classification Error")
print(1 - metrics.accuracy_score(y_test, y_pred_class))
print("Confusion matrix")
print(metrics.confusion_matrix(y_test, y_pred_class))
print(metrics.recall_score(y_test, y_pred_class))
print(metrics.roc_curve(y_test, y_pred_class))
print(metrics.f1_score(y_test, y_pred_class))
print(metrics.precision_score(y_test, y_pred_class))
I just wanted to show the algorithm metrics on the screen.
You need to set the average keyword argument to these sklearn.metrics functions. For an example, look at the documentation of f1_score. Here is the part corresponding to the average keyword arg:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
Calculate metrics globally by counting the total true positives, false negatives and false positives.
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).
This alters ‘macro’ to account for label imbalance; it can result in
an F-score that is not between precision and recall.
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
Here we can see that this describes how results are aggregated over the different labels on your multiclass task. I'm not sure which one you'd like to use, but micro seems nice. Here's how your call to f1_score would look with this choice:
print(metrics.f1_score(y_test, y_pred_class, average='micro'))
You can adjust the other metrics similarly. Hope this helps.

How to get accuracy for all the predicted class labels

How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier(),y_train)
print("output from DTC:")
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:,y_train)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X =
y =
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred =, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Difference Between Python's Functions `cls.score` and `cls.cv_result_`

I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.
My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.
Please, advise me in the following:
(1) Is my code correct? Please, check each part.
(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
I will appreciate any advise.
First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.
Quick Example
Let me hereby introduce a toy example for better understanding:
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid), y_train)
clf.best_estimator_.score(X_train, y_train)
This code yields the following:
0.98107957707289928 # which is the best possible accuracy score
{'mean_fit_time': array([ 0.15465896, 0.23701136]),
'mean_score_time': array([ 0.0006465 , 0.00065773]),
'mean_test_score': array([ 0.934335 , 0.9376739]),
'mean_train_score': array([ 0.96475625, 0.98225632]),
'param_C': masked_array(data = [0.001 0.01],
'params': ({'C': 0.001}, {'C': 0.01})
mean_train_score has two mean values as I grid over two options for C parameter.
I hope that helps!

Evaluating Logistic regression with cross validation

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).
These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.
Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?
Thank you
import pandas as pd
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
Y1=Data['Status1'] # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)
from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())
from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))
# sensitivity:
print (metrics.recall_score(y, predicted) )
import matplotlib.pyplot as plt
probs = logreg.predict_proba(X)[:, 1]
# use 0.5 cutoff for predicting 'default'
import numpy as np
preds = np.where(probs > 0.5, 1, 0)
print (ConfusionMatrix(list(y), list(preds)))
# check accuracy, sensitivity, specificity
print (metrics.accuracy_score(y, predicted))
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate)')
# calculate AUC
print (metrics.roc_auc_score(y, probs))
# use AUC as evaluation metric for cross-validation
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression()
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
You got it almost right. cross_validation.cross_val_predict gives you predictions for the entire dataset. You just need to remove earlier in the code. Specifically, what it does is the following:
It divides your dataset in to n folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1 folds). So, in the end you will get predictions for the entire data.
Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data'] is X and iris['target'] is y
In [15]: iris['data'].shape
Out[15]: (150, 4)
To get predictions on the entire set with cross validation you can do the following:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)
Out [1] : 0.9537
print metrics.classification_report(iris['target'], predicted)
Out [2] :
precision recall f1-score support
0 1.00 1.00 1.00 50
1 0.96 0.90 0.93 50
2 0.91 0.96 0.93 50
avg / total 0.95 0.95 0.95 150
So, back to your code. All you need is this:
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted)
For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:
In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.

