I am trying to find out the importance of my features and wanted to understand how the forest of trees works?
To my understanding, it makes decision trees and the bar graphs show how much variance is explained by the feature which in turn shows the importance of the feature.
I also wanted to undestand what the lines at the end of the graph mean?
Link to the method:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py
Is this the correct understanding?
Thanks
Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
It is however important to note that feature_importances_ in Random Forests don't necessarily predict the correct rank of each feature. Two highly correlated features may be on opposite sides of rank table. This won't affect performance of the model if you drop the mistakenly ranked feature though.However it isn't a reliable method to know the importance of each feature. To get around this limitation, I use Sequential Backward Selection.
Related
from what i understand about random forest alogirthm is that the algorithm randomly samples the original dataset to build a new sampled/bootstrapped dataset. the sampled dataset then turned into decision trees.
in scikit learn, you can visualize each individual trees in random forest. but my question is, How to show the sampled/bootstrapped dataset from each of those trees?
i want to see the features and the rows of data used to build each individual trees.
I am not aware of a way to see the bootstrapped rows (samples) of your data, as this is a random process. Neither do I think it is of much importance, if the algorithm trained well.
Nevertheless, for the features: In your visualization of the tree you might already see which features were used by the splits that have been done. But you can also access them directly via the attribute feature_names_in_ (see here) for each DecisionTree in the forest, e.g.
print(rf.estimators_[0].feature_names_in_)
If your feature names in the data have not been defined as given in the documentation, a workaround is to use the feature_importances_ (see here) as a proxy instead - the features that were unused obviously have an importance of 0. You can contrast this with the number of max_features_ that you define for the training to see if you caught all.
I have written code to find the importance of each feature in the entire dataset for multiclass classification. Now I want to find feature importance for each class in multiclass classification, i.e. I want to find the list of features (for each class) that are more important to classify that individual classes.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
model = DecisionTreeClassifier()
model.fit(x3, y3)
importance = model.feature_importances_
for i,v in enumerate(importance):
print('Feature[%0d]:%s, Score: %.6f' % (i,df.columns[i],v))
plt.subplots(figsize=(15,7))
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Feature index')
plt.ylabel('Feature importance score')
plt.xticks(rotation=90)
plt.xticks(np.arange(0,len(df.columns)-2, 2.0))
plt.show()
EDIT (28-04-2022):
I read a paper titled Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization; quoting:
On the evaluate section, we fist extract the 80 traffic features from the dataset and clarify the best short feature set to detect each attack family using RandomForestRegressor algorithm. Afterwards, we examine the performance and accuracy of the selected features
with seven common machine learning algorithms.
Can anyone explain how this is done?click for picture from that paper
The decision trees are split into nodes that maximise information gain. Each split is based on the Gini index or entropy values. So the only way I think what you want to do can be achieved is by printing out the tree and examining it yourself visually, provided there are not too many nodes.
You can't say with certainty that one of your features is very important in discriminating against a certain class because suppose you have two classes, A and B. The feature that discriminates class A against class B is also discriminating class B against class A. So the importance of that feature is for both classes. In general, you can only get the overall feature importance not specific to any of your classes but the features that help get the work done.
Trees are highly unstable, and a slight change in your dataset will build an entirely new different tree from the first.
EDIT(28-04-2022):
The paper says they used Random-ForestRegressor, different from the decision tree you used. Random-ForestRegressor meant they had a regression task. The paper used the algorithm as a feature selection technique to reduce the 80 features. The few features selected (based on feature importance) were then used to train seven other different models. Using fewer features instead of the whole 80 will make the resulting models more elegant and less prone to overfitting.
It is important to know that Random forest is an ensemble method and has a lot of random happenings in the background such as bagging and bootstrapping. Feature importance is a form of model interpretation. It is difficult to interpret Ensemble algorithms the way you have described. Such a way would be too detailed. So, definitely, what they wrote in the paper is different from what you think.
Decision trees are a lot more interpretable. If you want to understand causality in your decision tree model, you can click here to see how the model can be converted into rules or as suggested earlier, observe the tree with your naked eyes.
Can anyone explain how use of forests of trees to evaluate the importance of features (feature_importances_) works?
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
It is basically a random Forest Implementation. Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?
A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.
Given a classification problem, sometimes we do not just predict a class, but need to return the probability that it is a class.
i.e. P(y=0|x), P(y=1|x), P(y=2|x), ..., P(y=C|x)
Without building a new classifier to predict y=0, y=1, y=2... y=C respectively. Since training C classifiers (let's say C=100) can be quite slow.
What can be done to do this? What classifiers naturally can give all probabilities easily (one I know is using neural network with 100 out nodes)? But if I use traditional random forests, I can't do that, right? I use the Python Scikit-Learn library.
If you want probabilities, look for sklearn-classifiers that have method: predict_proba()
Sklearn documentation about multiclass:[http://scikit-learn.org/stable/modules/multiclass.html]
All scikit-learn classifiers are capable of multiclass classification. So you don't need to build 100 models yourself.
Below is a summary of the classifiers supported by scikit-learn grouped by strategy:
Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees,
Random Forests, Nearest Neighbors, setting multi_class='multinomial'
in sklearn.linear_model.LogisticRegression.
Support multilabel: Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression.
One-Vs-One: sklearn.svm.SVC.
One-Vs-All: all linear models exceptsklearn.svm.SVC.
Random forests do indeed give P(Y/x) for multiple classes. In most cases
P(Y/x) can be taken as:
P(Y/x)= the number of trees which vote for the class/Total Number of trees.
However you can play around with this, for example in one case if the highest class has 260 votes, 2nd class 230 votes and other 5 classes 10 votes, and in another case class 1 has 260 votes, and other classes have 40 votes each, you migth feel more confident in your prediction in 2nd case as compared to 1st case, so you come up with a confidence metric according to your use case.