How to get CORRECT feature importance plot in XGBOOST?

How to get CORRECT feature importance plot in XGBOOST? - python

Using two different methods in XGBOOST feature importance, gives me two different most important features, which one should be believed?
Which method should be used when? I am confused.
Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
df = sns.load_dataset('mpg')
df = df.drop(['name','origin'],axis=1)
X = df.iloc[:,1:]
y = df.iloc[:,0]
Numpy arrays
# fit the model
model_xgb_numpy = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_numpy.fit(X.to_numpy(), y.to_numpy())
plt.bar(range(len(model_xgb_numpy.feature_importances_)), model_xgb_numpy.feature_importances_)
Pandas dataframe
# fit the model
model_xgb_pandas = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_pandas.fit(X, y)
axsub = xgb.plot_importance(model_xgb_pandas)
Problem
Numpy method shows 0th feature cylinder is most important. Pandas method shows model year is most important. Which one is the CORRECT most important feature?
References
How to get feature importance in xgboost?
Feature importance 'gain' in XGBoost

It is hard to define THE correct feature importance measure. Each has pros and cons. It is a wide topic with no golden rule as of now and I personally would suggest to read this online book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/. The book has an excellent overview of different measures and different algorithms.
As a rule of thumb, if you can not use an external package, i would choose gain, as it is more representative of what one is interested in (one typically is not interested in raw occurrence of splits on a particular features, but rather how much those splits helped), see this question for a good summary: https://datascience.stackexchange.com/q/12318/53060. If you can use other tools, shap exhibits very good behaviour and I would always choose it over build-in xgb tree measures, unless computation time is strongly constrained.
As for the difference that you directly pointed at in your question, the root of the difference comes from the fact that xgb.plot_importance uses weight as the default extracted feature importance type, while the XGBModel itself uses gain as the default type. If you configure them to use the same importance type, then you will get similar distributions (up to additional normalisation in feature_importance_ and sorting in plot_importance).

There are 3 ways to get feature importance from Xgboost:
use built-in feature importance (I prefer gain type),
use permutation-based feature importance
use SHAP values to compute feature importance
In my post I wrote code examples for all 3 methods. Personally, I'm using permutation-based feature importance. In my opinion, the built-in feature importance can show features as important after overfitting to the data(this is just an opinion based on my experience). SHAP explanations are fantastic, but sometimes computing them can be time-consuming (and you need to downsample your data).

From the answer here, which gives a neat explanation:
feature_importances_ returns weights - what we usually think of as "importance".
plot_importance returns the number of occurrences in splits.
Note: I think that the selected answer above does not actually cover the point.

Related

Pandas info for 100+ features

I have the dataset in my disposal which consists of around 500 columns which I need to explore and keep only relevant columns. Pandas info(verbose = True) method does not even display this number properly. I also used missingno library to visualise nulls. However, it uses a lot of RAM. What to use instead of matplotlib here?
How do you approach datasets with a lot of features (more than 100)? Any useful workflow to eliminate useless features? How to use info() or any alternative?
Yeah, also used expand options to view everything. Question here is how to set it locally?
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
UPDATE:
Methods or solutions to explore initial raw data are of interest. For instance one cell script which summarises numerical features as distributions, categorical as counts and possibly something else. This can be written by myself, however, maybe there is a library or just your function which does so?

Regarding the issue of useless features, you could easily estimate some metrics associated with feature effectiveness and filter it out using some threshold. Check out the sklearn feature selection docs.
Of course before doing that you'll have to make sure features are numeric and their representation is fit for the tests of your choice. To do that I suggest you check out sklearn pipelines (optional) and preprocessing docs.
Before estimating feature usefulness, make sure you cover missing data handling, encoding categorical variables and feature scaling.

You can use XGBoost's feature_importance attribute. Though, you first need to train your data using XGB & then using feature_importance, consider only important features (by setting a threshold of your choice)
Dimension reduction can come handy using PCA or some other algorithm.

Is it necessary to use StandardScaler on y_train and y_test? If yes, cases?

Have read multiple cases where StandardScaler is used on y_train and y_test and also where it is not used. Is there any specific rules where it should be used on them?

Quoting from here:
Standardization of a dataset is a common requirement for many machine
learning estimators: they might behave badly if the individual
features do not more or less look like standard normally distributed
data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a
learning algorithm (such as the RBF kernel of Support Vector Machines
or the L1 and L2 regularizers of linear models) assume that all
features are centered around 0 and have variance in the same order. If
a feature has a variance that is orders of magnitude larger that
others, it might dominate the objective function and make the
estimator unable to learn from other features correctly as expected.
So probably When your features has different scales/distributions you should standardize/scale their values.

Determine WHY Features Are Important in Decision Tree Models

Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()
Update:
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Update2:
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .

tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.

The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.

OLS Regression: Scikit vs. Statsmodels? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).
Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.
I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.
For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.
For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709
That makes the warning go away but the results are exactly the same.
Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.
R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.
The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.
Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.
The dependent variable is how many levels each character gained during that week (int).
Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.
I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm
I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.
I would love to know:
Which output might be accurate? (Granted they might both be if I missed a kwarg.)
If I made a mistake, what is it and how to fix it?
Could I have figured this out without asking here, and if so how?
I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.
(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)
Thanks!

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# Generate artificial data (2 regressors + constant)
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e
# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 , 0.08612654, 0.60129898])
LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 , 0.08612654, 0.60129898])
As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).
I recommend you use pandas and patsy to take care of this:
import pandas as pd
from patsy import dmatrices
dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)
Or, alternatively, the statsmodels formula interface:
import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()
Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.
smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())
When in doubt, please
try reading the source code
try a different language for benchmark, or
try OLS from scratch, which is basic linear algebra.

i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

Classification tree in sklearn giving inconsistent answers

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter

The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).
Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state parameter.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

I don't know anything about sklearn but...
I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.
You should create a new one?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.