I was trying to plot a confusion matrix nicely, so I followed scikit-learn's newer version 0.22's in built plot confusion matrix function. However, one value of my confusion matrix value is 153, but it appears as 1.5e+02 in the confusion matrix plot:
Following the scikit-learn's documentation, I spotted this parameter called values_format, but I do not know how to manipulate this parameter so that it can suppress the scientific notation. My code is as follows.
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
# import some data to play with
X = pd.read_csv("datasets/X.csv")
y = pd.read_csv("datasets/y.csv")
class_names = ['Not Fraud (positive)', 'Fraud (negative)']
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(logreg, X_test, y_test,
display_labels=class_names,
cmap=plt.cm.Greens,
normalize=normalize, values_format = '{:.5f}'.format)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
plt.show()
Just remove ".format" and the {} brackets from your call parameter declaration:
disp = plot_confusion_matrix(logreg, X_test, y_test,
display_labels=class_names,
cmap=plt.cm.Greens,
normalize=normalize, values_format = '.5f')
In addition, you can use '.5g' to avoid decimal 0's
Taken from source
In case anyone using seabornĀ“s heatmap to plot the confusion matrix, and none of the answers above worked. You should turn off scientific notation in confusion matrix seaborn with fmt='g', like so:
sns.heatmap(conf_matrix,annot=True, fmt='g')
Simply pass values_format=''
Example:
plot_confusion_matrix(clf, X_test, Y_test, values_format = '')
Related
I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.
To understand the plot the library says:
The above explanation shows features each contributing to push the
model output from the base value (the average model output over the
training dataset we passed) to the model output. Features pushing the
prediction higher are shown in red, those pushing the prediction lower
are in blue (these force plots are introduced in our Nature BME
paper).
So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]
The relevant plot for 0th data point in raw space:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
Should you wish to switch to sigmoid probability space (link="logit"):
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
The relevant plot for 0th data point in probability space:
Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)
Full reproducible example:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
Output:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522
I'm working on a K-nearest neighbor classifier and I want to add a confusion matrix to my report.
neigh = KNeighborsClassifier(n_neighbors = 3)
neigh.fit(X_train,y_train)
y_pre = neigh.predict(X_train)
yhat = neigh.predict(X_test)
I try using confusion_matrix function like this
confusion_matrix(y_test, yhat)
It works fine but I want more photogenic representation so I switched to plot_confusion_matrix
matrix = plot_confusion_matrix(neigh, X_test, yhat)
Turns out that the values that came from these functions are not equal. What happened here? What's the right result? And what shoud I do to fix this?
It looks to me you are passing y_test data set with y_hat data set and in plot_confusion_matrix you are passing X_test and y_hat. You need to make sure which array of data you want to pass into the each method.
Take a look at the scikit-learn documentation (sklearn.metrics.plot_confusion_matrix)
y_hat is the prediction on X_test and you are passing it as the true labels to plot_confusion_matrix instead of passing y_test`. All you need to do is changing that line:
matrix = plot_confusion_matrix(neigh, X_test, y_test)
I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.
To understand the plot the library says:
The above explanation shows features each contributing to push the
model output from the base value (the average model output over the
training dataset we passed) to the model output. Features pushing the
prediction higher are shown in red, those pushing the prediction lower
are in blue (these force plots are introduced in our Nature BME
paper).
So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]
The relevant plot for 0th data point in raw space:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
Should you wish to switch to sigmoid probability space (link="logit"):
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
The relevant plot for 0th data point in probability space:
Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)
Full reproducible example:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
Output:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522
I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix
The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)
I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.