I have been trying to compute SHAP values for a Gradient Boosting Classifier in H2O module in Python. Below there is the adapted example in the documentation for the predict_contibutions method (adapted from https://github.com/h2oai/h2o-3/blob/master/h2o-py/demos/predict_contributionsShap.ipynb).
import h2o
import shap
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o import H2OFrame
# initialize H2O
h2o.init()
# load JS visualization code to notebook
shap.initjs()
# Import the prostate dataset
h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")
# Split the data into Train/Test/Validation with Train having 70% and test and validation 15% each
train,test,valid = h2o_df.split_frame(ratios=[.7, .15])
# Convert the response column to a factor
h2o_df["CAPSULE"] = h2o_df["CAPSULE"].asfactor()
# Generate a GBM model using the training dataset
model = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=100,
max_depth=4,
learn_rate=0.1)
model.train(y="CAPSULE", x=["AGE","RACE","PSA","GLEASON"],training_frame=h2o_df)
# calculate SHAP values using function predict_contributions
contributions = model.predict_contributions(h2o_df)
# convert the H2O Frame to use with shap's visualization functions
contributions_matrix = contributions.as_data_frame().to_numpy() # the original method is as_matrix()
# shap values are calculated for all features
shap_values = contributions_matrix[:,0:4]
# expected values is the last returned column
expected_value = contributions_matrix[:,4].min()
# force plot for one observation
X=["AGE","RACE","PSA","GLEASON"]
shap.force_plot(expected_value, shap_values[0,:], X)
The image I get from the code above is:
force plot for one observation
What does the output means? Considering the problem above is a classification problem, the predicted value should be a probability (or even the category predicted - 0 or 1), right? Both the base value and the predicted value are negative.
Can anyone help me with this?
What you got is most likely log-odds and not a probability itself.
In order to get a probability, you need to transform each log-odds to the probability space, i.e.
p=e^x/(1 + e^x)
when you use SHAP directly you can achieve this by specifying model_output parameter:
shap.TreeExplainer(model, data, model_output='probability')
Related
I fitted an ARIMA model to a time series. Now I would like to use the model to forecast the next steps, for example 1 test, given a certain input series.
Usually I find that fit.forecast() is used (as below), but this forecast works on the series it was used for fitting, while I want to get the forecast for a different part of the same series.
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(series, order=(2,0,0))
fit = model.fit()
forecast = fit.forecast()[0] # this forecast the next value given the last 2 step in 'series'
There are a variety of ways to use the model and fitted parameters to produce forecasts from (a) different starting points within the original dataset, (b) after adding new observations, or (c) a completely different dataset.
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(series, order=(2,0,0))
fit = model.fit()
# Forecast five steps from the end of `series`
fit.forecast(5)
# Forecast five steps starting after the tenth observation in `series`
# Note that the `dynamic=True` argument specifies that it only uses the
# actual data through the tenth observation to produce each of the
# five forecasts
fit.predict(10, 14, dynamic=True)
# Add new observations (`new_obs`) to the end of the dataset
# *without refitting the parameters* and then forecast
# five steps from the end of the new observations
fit_newobs = fit.append(new_obs, refit=False)
fit_newobs.forecast(5)
# Apply the model and the fitted parameters to an
# entirely different dataset (`series2`) and then forecast
# five steps from the end of that new dataset
fit_newdata = fit.apply(series2)
fit_newdata.forecast(5)
You may find the following notebook helpful: https://www.statsmodels.org/devel/examples/notebooks/generated/statespace_forecasting.html
I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier, Pool
train_pool = Pool(pd.DataFrame({'size': [1,1,2,1],
'shape': ['square','square','square', 'circle']}),
[1,1,0,1],
feature_names = ['size','shape'],
cat_features= ['shape'])
model = CatBoostClassifier(iterations=2,
cat_features = ['shape'],
ctr_leaf_count_limit=1)
model.fit(train_pool, plot=False)
I would now like to run a function on the model object to obtain the following:
Numerical Feature size has minimum value 0, and max value 1 (this should be part of CatBoosts split logic for numerical features)
Categorical Feature shape has the following training values:
values=['square', None].
Notice that circle is not in values because the car_leaf_count_limit=1 would have selected the most occurring value, which in this case is 'square'. I've put None here because I'm pretty sure cat boost will assign None to any unseen classes.
Next, I've chosen the above data example to make sure that CatBoost decides to split on shape=='square'. Ideally I'd like to see an array used_values=['square'] which emphasizes that there was at least one split on this square value.
It's important to emphasize here that I want to operate on the model object only. Obviously, one can get some of these details by running functions onto of the training data. My motivation is to double-make-sure that I completely understand the training-range of inputs into the model, and what it may do to them in preprocessing.
I have created a KNN model in Python (Module = Scikitlearn) by using three variables (Age, Distance, Travel Allowance) as my predictor variables, with the aim of using them to predict an outcome for the target variable (Method of Travel).
When constructing the model, I had to normalize the data for the three predictor variables (Age, Distance, Travel Allowance). This increased the accuracy of my model compared to not normalizing the data.
Now that I have constructed the model, I want to make a prediction. But how would I enter the predictor variables to make the prediction as the model has been trained on normalized data.
I want to enter KNN.predict([[30,2000,40]]) to carry out a prediction where Age = 30; Distance = 2000; Allowance = 40. But as the data has been normalized, I can't think of a way on how to do this. I used the following code to normalize the data:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
Actually, the answer is buried in the code you provided!
Once you fit the instance of preprocessing.StandardScaler() it remembers how to scale data. Try this
scaler = preprocessing.StandardScaler().fit(X)
# scaler is an object that knows how to normalize data points
X_normalized = scaler.transform(X.astype(float))
# used scalar to normalize the data points in X
# Note, this is what you have done, just in two steps.
# I just capture the scaler object
#
# ... Train your model on X_normalized
#
# Now predict
other_data = [[30,2000,40]]
other_data_normalized = scaler.transform(other_data)
KNN.predict(other_data_normalized)
Notice that I used scaler.transform twice in the same way
See StandardScaler.transform
For some objects from catboost library (like the python code export model - https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_save_model-docpage/) predictions (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_apply_catboost_model-docpage/) will only give a so called raw score per record (parameter values is called "RawFormulaVal").
Other API functions also allow the result of a prediction to be a probability for the target class (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_predict-docpage/) - parameter value is called "Probability".
I would like to know
how this is related to probabilities (in case of a binary classification) and
if it can be transformed in such a one using the python API (https://tech.yandex.com/catboost/doc/dg/concepts/python-quickstart-docpage/)?
The raw score from the catboost prediction function with type "RawFormulaVal" are the log-odds (https://en.wikipedia.org/wiki/Logit).
So if we apply the function "exp(score) / (1+ exp(score))" we get the probabilities as if we would have used the prediction formula with type "Probability".
The line of code model.predict_proba(evaluation_dataset) will compute probabilities directly.
Following is a sample code to understand:
from catboost import Pool, CatBoostClassifier, cv
train_dataset = Pool(data=X_train,
label=y_train,
cat_features=cat_features)
eval_dataset = Pool(data=X_valid,
label=y_valid,
cat_features=cat_features)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=30,
learning_rate=1,
depth=2,
loss_function='MultiClass')
# Fit model
model.fit(train_dataset)
# Get predicted classes
preds_class = model.predict(eval_dataset)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(eval_dataset)
# Get predicted RawFormulaVal
preds_raw = model.predict(eval_dataset,
prediction_type='RawFormulaVal')
model.fit(train_dataset,
use_best_model=True,
eval_set=eval_dataset)
print("Count of trees in model = {}".format(model.tree_count_))
print(preds_proba)
print(preds_raw)
I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.