How to interpret base_value of GBT classifier when using SHAP? - python

I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.
To understand the plot the library says:
The above explanation shows features each contributing to push the
model output from the base value (the average model output over the
training dataset we passed) to the model output. Features pushing the
prediction higher are shown in red, those pushing the prediction lower
are in blue (these force plots are introduced in our Nature BME
paper).
So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]
The relevant plot for 0th data point in raw space:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
Should you wish to switch to sigmoid probability space (link="logit"):
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
The relevant plot for 0th data point in probability space:
Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)
Full reproducible example:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
Output:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

Related

How can I explain a part of my predictions as a whole with SHAP values? (and not every single prediction)

Background information
I fit a classifier on my training data. When testing my fitted best estimator, I predict the probabilities for one of the classes. I order both my X_test and my y_test by the probabilites in a descending order.
Question
I want to understand which features were important (and to what extend) for the classifier to predict only the 500 predictions with the highest probability as a whole, not for each prediction. Is the following code correct for this purpose?
y_test_probas = clf.predict_proba(X_test)[:, 1]
explainer = shap.Explainer(clf, X_train) # <-- here I put the X which the classifier was trained on?
top_n_indices = np.argsort(y_test_probas)[-500:]
shap_values = explainer(X_test.iloc[top_n_indices]) # <-- here I put the X I want the SHAP values for?
shap.plots.bar(shap_values)
Unfortunately, the shap documentation (bar plot) does not cover this case. Two things are different there:
They use the data the classifier was trained on (I want to use the data the classifier is tested on)
They use the whole X and not part of it (I want to use only part of the data)
Minimal reproducible example
import numpy as np
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the Titanic Survival dataset
data = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# Preprocess the data
data = data.drop(["Name"], axis=1)
data = data.dropna()
data["Sex"] = (data["Sex"] == "male").astype(int)
# Split the data into predictors (X) and response variable (y)
X = data.drop("Survived", axis=1)
y = data["Survived"]
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression classifier
clf = LogisticRegression().fit(X_train, y_train)
# Get the predicted class probabilities for the positive class
y_test_probas = clf.predict_proba(X_test)[:, 1]
# Select the indices of the top 500 test samples with the highest predicted probability of the positive class
top_n_indices = np.argsort(y_test_probas)[-500:]
# Initialize the Explainer object with the classifier and the training set
explainer = shap.Explainer(clf, X_train)
# Compute the SHAP values for the top 500 test samples
shap_values = explainer(X_test.iloc[top_n_indices, :])
# Plot the bar plot of the computed SHAP values
shap.plots.bar(shap_values)
I don't want to know how the classifier decides all the predictions, but on the predictions with the highest probability. Is that code suitable to answer this question? If not, how would a suitable code look like?

Why LightGBM Python-package gives bad prediction using for regression task?

I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?
Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.
In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219

Understand SHAP values for classification [duplicate]

I recently discovered this amazing library for ML interpretability. I decided to build a simple xgboost classifier using a toy dataset from sklearn and to draw a force_plot.
To understand the plot the library says:
The above explanation shows features each contributing to push the
model output from the base value (the average model output over the
training dataset we passed) to the model output. Features pushing the
prediction higher are shown in red, those pushing the prediction lower
are in blue (these force plots are introduced in our Nature BME
paper).
So it looks to me as the base_value should be the same as clf.predict(X_train).mean()which equals 0.637. However this is not the case when looking at the plot, the number is actually not even within [0,1]. I tried doing the log in different basis (10, e, 2) assuming it would be some kind of monotonic transformation... but still not luck. How can I get to this base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])
To get base_value in raw space (when link="identity") you need to unwind class labels --> to probabilities --> to raw scores. Note, the default loss is "deviance", so the raw is inverse sigmoid:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]
The relevant plot for 0th data point in raw space:
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")
Should you wish to switch to sigmoid probability space (link="logit"):
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
The relevant plot for 0th data point in probability space:
Note, the probability base_value from shap's perspective, what they call a baseline probability if no data is available, is not what a reasonable person would define by having no independent variables (0.6373626373626373 in this case)
Full reproducible example:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")
Output:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

Should I use feature scaling with polynomial regression with scikit-learn?

I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.

How to get accuracy for all the predicted class labels

How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Categories

Resources