How does Feature importances with forests of trees work? - python

Can anyone explain how use of forests of trees to evaluate the importance of features (feature_importances_) works?
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

It is basically a random Forest Implementation. Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Related

How does the sklearn Decision Tree Regressor determines the optimal threshold for splitting a continuous feature?

I would like to know more about the mathematics behind the sklearn Decision Tree Regressor. I know which methods are used by a decision tree to evaluate splits. However, what if the feature we would like to split is continuous? In that case, the decision tree makes a split based on feature threshold.
How does the sklearn Decision Tree Regressor determine the optimal threshold for splitting a continuous feature?

Feature importances with forests of trees

I am trying to find out the importance of my features and wanted to understand how the forest of trees works?
To my understanding, it makes decision trees and the bar graphs show how much variance is explained by the feature which in turn shows the importance of the feature.
I also wanted to undestand what the lines at the end of the graph mean?
Link to the method:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py
Is this the correct understanding?
Thanks
Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
It is however important to note that feature_importances_ in Random Forests don't necessarily predict the correct rank of each feature. Two highly correlated features may be on opposite sides of rank table. This won't affect performance of the model if you drop the mistakenly ranked feature though.However it isn't a reliable method to know the importance of each feature. To get around this limitation, I use Sequential Backward Selection.

Make graphviz from sklearn RandomForestClassifier (not from individual clf.estimators_)

Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?
A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.

How do you get a probability of all classes to predict without building a classifier for each single class?

Given a classification problem, sometimes we do not just predict a class, but need to return the probability that it is a class.
i.e. P(y=0|x), P(y=1|x), P(y=2|x), ..., P(y=C|x)
Without building a new classifier to predict y=0, y=1, y=2... y=C respectively. Since training C classifiers (let's say C=100) can be quite slow.
What can be done to do this? What classifiers naturally can give all probabilities easily (one I know is using neural network with 100 out nodes)? But if I use traditional random forests, I can't do that, right? I use the Python Scikit-Learn library.
If you want probabilities, look for sklearn-classifiers that have method: predict_proba()
Sklearn documentation about multiclass:[http://scikit-learn.org/stable/modules/multiclass.html]
All scikit-learn classifiers are capable of multiclass classification. So you don't need to build 100 models yourself.
Below is a summary of the classifiers supported by scikit-learn grouped by strategy:
Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees,
Random Forests, Nearest Neighbors, setting multi_class='multinomial'
in sklearn.linear_model.LogisticRegression.
Support multilabel: Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression.
One-Vs-One: sklearn.svm.SVC.
One-Vs-All: all linear models exceptsklearn.svm.SVC.
Random forests do indeed give P(Y/x) for multiple classes. In most cases
P(Y/x) can be taken as:
P(Y/x)= the number of trees which vote for the class/Total Number of trees.
However you can play around with this, for example in one case if the highest class has 260 votes, 2nd class 230 votes and other 5 classes 10 votes, and in another case class 1 has 260 votes, and other classes have 40 votes each, you migth feel more confident in your prediction in 2nd case as compared to 1st case, so you come up with a confidence metric according to your use case.

Random Forest pruning

I have sklearn random forest regressor. It's very heavy, 1.6 GBytes, and works very long time when predicting values.
I want to prune it to make lighter. As I know pruning is not implemented for decision trees and forests. I can't implement it by myself since tree code is written on C and I don't know it.
Does anyone know the solution?
The size of the trees can be a solution for you. Try to limit the size of the trees in the forest (max leaf noders, max depth, min samples split...).
You could try ensemble pruning. This boils down to removing from your random forest a number of the decision trees that make it up.
If you remove trees at random, the expected outcome is that the performance of the ensemble will gradually deteriorate with the number of removed trees. However, you can do something more clever like removing those trees whose predictions are highly correlated with the predictions of the rest of the ensemble, and thus do to significantly modify the outcome of the whole ensemble.
Alternatively, you can train a linear classifier that uses as inputs the outputs of the individual ensembles, and include some kind of l1 penalty in the training to enforce sparse weights on the classifier. The weights with 0 or very small value will hint which trees could be removed from the ensemble with a small impact on accuracy.

Categories

Resources