Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()
Update:
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Update2:
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .
tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.
The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.
Related
I have written code to find the importance of each feature in the entire dataset for multiclass classification. Now I want to find feature importance for each class in multiclass classification, i.e. I want to find the list of features (for each class) that are more important to classify that individual classes.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
model = DecisionTreeClassifier()
model.fit(x3, y3)
importance = model.feature_importances_
for i,v in enumerate(importance):
print('Feature[%0d]:%s, Score: %.6f' % (i,df.columns[i],v))
plt.subplots(figsize=(15,7))
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Feature index')
plt.ylabel('Feature importance score')
plt.xticks(rotation=90)
plt.xticks(np.arange(0,len(df.columns)-2, 2.0))
plt.show()
EDIT (28-04-2022):
I read a paper titled Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization; quoting:
On the evaluate section, we fist extract the 80 traffic features from the dataset and clarify the best short feature set to detect each attack family using RandomForestRegressor algorithm. Afterwards, we examine the performance and accuracy of the selected features
with seven common machine learning algorithms.
Can anyone explain how this is done?click for picture from that paper
The decision trees are split into nodes that maximise information gain. Each split is based on the Gini index or entropy values. So the only way I think what you want to do can be achieved is by printing out the tree and examining it yourself visually, provided there are not too many nodes.
You can't say with certainty that one of your features is very important in discriminating against a certain class because suppose you have two classes, A and B. The feature that discriminates class A against class B is also discriminating class B against class A. So the importance of that feature is for both classes. In general, you can only get the overall feature importance not specific to any of your classes but the features that help get the work done.
Trees are highly unstable, and a slight change in your dataset will build an entirely new different tree from the first.
EDIT(28-04-2022):
The paper says they used Random-ForestRegressor, different from the decision tree you used. Random-ForestRegressor meant they had a regression task. The paper used the algorithm as a feature selection technique to reduce the 80 features. The few features selected (based on feature importance) were then used to train seven other different models. Using fewer features instead of the whole 80 will make the resulting models more elegant and less prone to overfitting.
It is important to know that Random forest is an ensemble method and has a lot of random happenings in the background such as bagging and bootstrapping. Feature importance is a form of model interpretation. It is difficult to interpret Ensemble algorithms the way you have described. Such a way would be too detailed. So, definitely, what they wrote in the paper is different from what you think.
Decision trees are a lot more interpretable. If you want to understand causality in your decision tree model, you can click here to see how the model can be converted into rules or as suggested earlier, observe the tree with your naked eyes.
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
I am using decision trees from Scikit Learn to do regression on a data set.
I am getting very good results, but one issue that concerns me is that the relative uncertainty on many of the features is very high.
I have tried just dropping the cases with high uncertainty, but that reduces the performance of the model significantly.
The features themselves are experimentally determined, so they have associated experimental uncertainty. The data itself is not noisy.
So my question, is there a good way to incorporate the uncertainty associated with the features to machine learning algorithms?
Thanks for all the help!
If the uncertain features are improving the algorithm that suggests that together, they are useful. However, some of them may not be. My suggestion would be to get rid of those features that don't improve the algorithm. You could use a greedy feature elimination algorithm.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
This begins by training a model on all the features in the model and then gets rid of the feature deemed to be the least useful. It trains the model again but with one less feature.
Hope that helps
I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.
To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!
That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.
If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.
This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).
Resource for simple theory? Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?
How to use in sklearn? Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:
from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?
The documentation is lacking. I created a new issue to improve it. As Andreas said the best resource is probably ESL II freely available online as PDF.
To automatically tune the value of alpha it is indeed possible to use ElasticNetCV which will spare redundant computation as apposed to using GridSearchCV in the ElasticNet class for tuning alpha. In complement, you can use a regular GridSearchCV for finding the optimal value of rho. See the docstring of ElasticNetCV fore more details.
As for Lasso vs ElasticNet, ElasticNet will tend to select more variables hence lead to larger models (also more expensive to train) but also be more accurate in general. In particular Lasso is very sensitive to correlation between features and might select randomly one out of 2 very correlated informative features while ElasticNet will be more likely to select both which should lead to a more stable model (in terms of generalization ability so new samples).
I would point you towards this blog post: http://www.datarobot.com/blog/regularized-linear-regression-with-scikit-learn/.
I will try helping you out with the question 'What is ElasticNet?'
The Elastic-Net is a regularised regression method that linearly combines both penalties (i.e.) L1 and L2 of the Lasso and Ridge regression methods.
It is useful when there are multiple correlated features. The difference between Lass and Elastic-Net lies in the fact that Lasso is likely to pick one of these features at random while elastic-net is likely to pick both at once.
The below listed two links have got wonderful explanations for ElasticNet.
ElasticNet- TutorialsPoint
Lasso, Ridge and Elastic Net Regularization