Random Forest pruning

Random Forest pruning - python

I have sklearn random forest regressor. It's very heavy, 1.6 GBytes, and works very long time when predicting values.
I want to prune it to make lighter. As I know pruning is not implemented for decision trees and forests. I can't implement it by myself since tree code is written on C and I don't know it.
Does anyone know the solution?

The size of the trees can be a solution for you. Try to limit the size of the trees in the forest (max leaf noders, max depth, min samples split...).

You could try ensemble pruning. This boils down to removing from your random forest a number of the decision trees that make it up.
If you remove trees at random, the expected outcome is that the performance of the ensemble will gradually deteriorate with the number of removed trees. However, you can do something more clever like removing those trees whose predictions are highly correlated with the predictions of the rest of the ensemble, and thus do to significantly modify the outcome of the whole ensemble.
Alternatively, you can train a linear classifier that uses as inputs the outputs of the individual ensembles, and include some kind of l1 penalty in the training to enforce sparse weights on the classifier. The weights with 0 or very small value will hint which trees could be removed from the ensemble with a small impact on accuracy.

Related

How to achieve regression model without underfitting or overfitting

I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?

Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.

Feature importances with forests of trees

I am trying to find out the importance of my features and wanted to understand how the forest of trees works?
To my understanding, it makes decision trees and the bar graphs show how much variance is explained by the feature which in turn shows the importance of the feature.
I also wanted to undestand what the lines at the end of the graph mean?
Link to the method:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py
Is this the correct understanding?
Thanks

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
It is however important to note that feature_importances_ in Random Forests don't necessarily predict the correct rank of each feature. Two highly correlated features may be on opposite sides of rank table. This won't affect performance of the model if you drop the mistakenly ranked feature though.However it isn't a reliable method to know the importance of each feature. To get around this limitation, I use Sequential Backward Selection.

How does Feature importances with forests of trees work?

Can anyone explain how use of forests of trees to evaluate the importance of features (feature_importances_) works?
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

It is basically a random Forest Implementation. Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Make graphviz from sklearn RandomForestClassifier (not from individual clf.estimators_)

Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?

A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.

random_state parameter in classification models

Can someone explain why does the random_state parameter affects the model so much?
I have a RandomForestClassifier model and want to set the random_state (for reproducibility pourpouses), but depending on the value I use I get very different values on my overall evaluation metric (F1 score)
For example, I tried to fit the same model with 100 different random_state values and after the training ad testing the smallest F1 was 0.64516129 and the largest 0.808823529). That is a huge difference.
This behaviour also seems to make very hard to compare two models.
Thoughts?

If the random_state affects your results it means that your model has a high variance. In case of Random Forest this simply means that you use too small forest and should increase number of trees (which due to bagging - reduce variance). In scikit-learn this is controlled by n_estimators parameters in the constructor.
Why this happens? Each ML method tries to minimize the error, which from matematial perspective can be usually decomposed to bias and variance [+noise] (see bias variance dillema/tradeoff). Bias is simply how far from true values your model has to end up in the expectation - this part of an error usually comes from some prior assumptions, such as using linear model for nonlinear problem etc. Variance is how much your results differ when you train on different subsets of data (or use different hyperparameters, and in case of randomized methods random seed is a parameter). Hyperparameters are initialized by us and Parameters are learnt by the model itself in the training process. Finally - noise is not reducible error coming from the problem itself (or data representation). Thus, in your case - you simply encountered model with high variance, decision trees are well known for their extremely high variance (and small bias). Thus to reduce variance, Breiman proposed the specific bagging method, known today as Random Forest. The larger the forest - stronger the effect of variance reduction. In particular - forest with 1 tree has huge variance, forest of 1000 trees is nearly deterministic for moderate size problems.
To sum up, what you can do?
Increase number of trees - this has to work, and is well understood and justified method
treat random_seed as a hyperparameter during your evaluation, because this is exactly this - a meta knowledge you need to fix before hand if you do not wish to increase size of the forest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.