I am training XGBoost Classifier on Big Query. The model is trained fine and then the bst (saved model) file is imported to a python notebook for plotting. I want to plot the trees present in the model to get an idea of how it is predicted.
When I plot the model, I get the results that are given below:
I am doing it like this:
import xgboost as xgb
bst = xgb.Booster(model_file='model.bst')
fig, ax = plt.subplots(figsize=(30, 30))
xgb.plot_tree(bst, num_trees=4, ax=ax)
plt.show()
I have come to know that the column names are masked like f182 stands for the 182nd feature that the model was trained on. I would like to create a mapping for these trees, with the actual column names that were used for training the model. The query used to train the model is given below:
CREATE OR REPLACE MODEL `d1.boost_clf1`
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
INPUT_LABEL_COLS=['y'],
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='isTrain',
AUTO_CLASS_WEIGHTS=TRUE,
EARLY_STOP=TRUE,
L2_REG = 0.3,
ENABLE_GLOBAL_EXPLAIN = TRUE
) AS
SELECT
* except(isTrain, x1,x2,x3_timestamp,x4_timestamp, y)
,isTrain = 1 as isTrain
FROM d1.t1_preprocessed;
I have tried to print bst.feature_names but that doesn't print anything.
Any help in finding a way to plot the trees of XGBoost with actual column names would be highly appreciated. Thanks!
Related
I am trying to carry out stock market prediction using LSTM model of RNN. I am following this article, however, I am not able to understand, in this particular code snippet,
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)
y_test_scaled = scaler.inverse_transform(y_test.reshape(-1, 1))
fig, ax = plt.subplots(figsize=(16,8))
ax.set_facecolor('#000041')
ax.plot(y_test_scaled, color='red', label='Original price')
plt.plot(predictions, color='cyan', label='Predicted price')
plt.legend()
where x_test is being entered into model.predict(). x_test essentially consists of values of the time series data that we are trying to predict. If we insert the x_test, into model.predict(), then we are essentially entering the values that we are tryng to predict in the model. Then so to say, we are not carrying out any prediction. If this is the case, then the method given in the article is wrong. Is my conclusion about the article correct?
Why are we entering x_test data (that is to be predicted) into the model to carry out the prediction of future values?
On the website you linked the author makes a split of the data he has avaliable into train and test data. Here he defines his feature sets as x_train and x_test and his label sets as y_train and y_test. The train data is what you use to train your models, the test data is data the model won't see in the training process.
The features are the data you want to feed into your model so that it can predict more or less the labels. In the training process you do the same, you show your model the features and the corrosponding labels so that the model can learn how the data is connected and hopefully generalize it without overfitting.
So what you have there in your code snippet doesn't look wrong, the author uses his test set of labels and feeds them into his model, so that the model can create the labels for it. He then plots the predicted labels against the true labels (y_test) to see how well the model did its job.
I haven't been able to find much in the way of examples on SHAP values with PyTorch. I've used two techniques to generate SHAP values, however, their results don't appear to agree with each other.
SHAP KernelExplainer with PyTorch
import torch
from torch.autograd import Variable
import shap
import numpy
import pandas
torch.set_grad_enabled(False)
# Get features
train_features_df = ... # pandas dataframe
test_features_df = ... # pandas dataframe
# Define function to wrap model to transform data to tensor
f = lambda x: model_list[0]( Variable( torch.from_numpy(x) ) ).detach().numpy()
# Convert my pandas dataframe to numpy
data = test_features_df.to_numpy(dtype=np.float32)
# The explainer doesn't like tensors, hence the f function
explainer = shap.KernelExplainer(f, data)
# Get the shap values from my test data
shap_values = explainer.shap_values(data)
# Enable the plots in jupyter
shap.initjs()
feature_names = test_features_df.columns
# Plots
#shap.force_plot(explainer.expected_value, shap_values[0], feature_names)
#shap.dependence_plot("b1_price_avg", shap_values[0], data, feature_names)
shap.summary_plot(shap_values[0], data, feature_names)
SHAP DeepExplainer with PyTorch
# It wants gradients enabled, and uses the training set
torch.set_grad_enabled(True)
e = shap.DeepExplainer(model, Variable( torch.from_numpy( train_features_df.to_numpy(dtype=np.float32) ) ) )
# Get the shap values from my test data (this explainer likes tensors)
shap_values = e.shap_values( Variable( torch.from_numpy(data) ) )
# Plots
#shap.force_plot(explainer.expected_value, shap_values, feature_names)
#shap.dependence_plot("b1_price_avg", shap_values, data, feature_names)
shap.summary_plot(shap_values, data, feature_names)
Comparing results
As you can see from the summary plots, the value given to the features from the same PyTorch model, with the same test data, are noticeably different.
For example the feature b1_addresses_avg has value one from last with the KernelExplainer. But with the DeepExplainer is ranked third from top.
I'm not sure where to go from here.
Shapley values are very difficult to calculate exactly. Kernel SHAP and Deep SHAP are two different approximation methods to calculate the Shapley values efficiently, and so one shouldn't expect them to necessarily agree.
You can read the authors' paper for more details.
While Kernel SHAP can be used on any model, including deep models, it is natural to ask whether
there is a way to leverage extra knowledge about the compositional nature of deep networks to improve
computational performance. [...] This motivates our adapting DeepLIFT to become a compositional approximation
of SHAP values, leading to Deep SHAP.
In section 5, they compare the performance of Kernel SHAP and Deep SHAP. From their example it seems like Kernel SHAP performs better than Deep SHAP. So I guess if you aren't running into computational issues, you can stick with Kernel SHAP.
p.s. Just to make sure, you're inputting the exact same trained model to SHAP right? You shouldn't be training separate models, because they'll learn different weights.
I have extracted word embeddings of 2 different texts (title and description) and want to train an XGBoost model on both embeddings. The embeddings are 200 in dimension each as can be seen below:
Now I was able to train the model on 1 embedding data and it worked perfectly like this:
x=df['FastText'] #training features
y=df['Category'] # target variable
#Defining Model
model = XGBClassifier(objective='multi:softprob')
#Evaluation metrics
score=['accuracy','precision_macro','recall_macro','f1_macro']
#Model training with 5 Fold Cross Validation
scores = cross_validate(model, np.vstack(x), y, cv=5, scoring=score)
Now I want to use both the features for training but it gives me an error if I pass 2 columns of df like this:
x=df[['FastText_Title','FastText']]
One solution I tried is adding both the embeddings like x1+x2 but it decreases accuracy significantly. How do I use both features in cross_validate function?
In the past for multiple inputs, I've done this:
features = ['FastText_Title', 'FastText']
x = df[features]
y = df['Category']
It is creating an array containing both datasets.
I usually need to scale the data as well using MinMaxScaler once the new array has been made.
According to the error you are getting, it seems that there is something wrong with types. Try this, it will convert your features to numeric and it should work:
df['FastText'] = pd.to_numeric(df['FastText'])
df['FastText_Title'] = pd.to_numeric(df['FastText_Title'])
I am learning Linear regression, I wrote this Linear Regression code using scikit-learn , after making the prediction, how to do prediction for new data points which are not there in my original data set.
In this data set you are given the salaries of people according to their work experience.
For example , The predicted salary for a person with work experience of 15 years should be [167005.32889087]
Here is image of data set
Here is my code ,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = pd.read_csv('project_1_dataset.csv')
X = data.iloc[:,0].values.reshape(-1,1)
Y = data.iloc[:,1].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X,Y)
Y_pred = linear_regressor.predict(X)
plt.scatter(X,Y)
plt.plot(X, Y_pred, color = 'red')
plt.show()
After fitting and training your model with your existed dataset (i.e. after linear_regressor.fit(X,Y)), you could make predictions in new instances in the same way:
new_prediction = linear_regressor.predict(new_data)
print(new_prediction)
where new_data is your new data point.
If you want to make predictions on particular random new data points, the above way should be enough. If your new data points belong to another dataframe, then you could replace new_data with the respective dataframe containing the new instances to be predicted.
I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?
Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .