RandomForestRegressor for classification problems - python

I've been doing Applied Machine Learing in Python course on coursera and on Assignment of week 4 I`ve found something interesting. During my first attempt to complete the assignment I tried using RandomForestClassifier from sklearn to predict labels, but the model was overfitting and was showing poor test accuracy results. As an experiment I switched to RandomForestRegressor and, guess what, not only did it not overfit, but test accurary was also a lot higher. So, why does RandomForestRegressor perform a lot better on a binary classification problem?

The Random Forest regressor does differ somewhat from the Random Forest classifier when it comes to ensembling the decision trees:
The classifier uses the mode of the predicted classes of the decision trees
The regressor uses the mean of the predicted values of the decision trees
Due to this difference the models can have different results. And in some cases this might result in the regressor performing better than the classifier.
In addition to that I would say that if you tune your hyperparameters correctly, the classifier should perform better on a classification problem than the regressor.

Related

How to have regression model predict ranges

I'm trying to make a model which can predict test scores. I'm currently using a simple linear regression model but receiving an accuracy score of close to 0 due to the fact that it's guessing a single number as the score. I was wondering if there was a way to have the model predict a range of about 10 numbers and if the true number is in that range it is marked as a correct guess.
The dataset I am using
Github page with notebook
It seems like you are using a LogisticRegression, LogisticRegression is in fact not for regression, it is for classification (for example, is the input data class a or b).
use sklearn.linear_model.LinearRegression for linear regression, read this for more details
There are also many other regression algorithms that I cannot list all in an answer. If you want to use regressions other than simple naive linear regression, read this for all available supervised learning algorithms scikit-learn provides, Ridge regression and SVR might be good places to start with.

Sklearn AdaBoost with Polynomial SVM predicts all 1s

I am trying to create a model using AdaBoost with Polynomial SVM as the base classifier.
The code snippet is as follows :
base_clf = SVC(kernel='poly', degree=3, class_weight='balanced', gamma='scale', probability=True)
model = AdaBoostClassifier(base_estimator=base_clf, n_estimators=10)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
The problem that I am facing is that the model is always predicting only 1
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
In practice, we never use SVMs as base classifiers for Adaboost.
Adaboost (and similar ensemble methods) were conceived ~ 20 years ago using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1), i.e. a decision stump.
DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer anything when used as base classifiers for boosting algorithms.

Finding label-specific top features for non-linear classifier

Is there any function that gives the top features of each label in a Random Forest/ XG Boost classifier? The classifier.feature_importances_ only gives top features for the classifier as a whole.
Looking for something similar to the classifier.coef_ that gives label-specific top features for SVM and Naive Bayes classifiers in sklearn.
import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance',ascending=False)
Try with this!
Or 1 vs Rest is also an good option but take lot of time.
Firstly, Random Forest / Xgboost or even a simple DecisionTree/ any Tree ensemble is a inherent multi-class classification model. Hence it will predict the multi-class output without using any wrapper ( 1 vs 1 / 1 vs Rest) on top of binary classifier (which is what the logistic regression/SVM/SGDClassifier would do).
Hence, you can get the feature importance for the overall multi-class classification alone and not for individual labels.
If you really want to know the feature importance for individual labels, then use onevsRest wrapper with decisionTree/ RandomForest/ Xgboost as the estimator.
This is not the recommended approach because the results could be suboptimal when compared with single decision Tree.
Some examples here.

Make graphviz from sklearn RandomForestClassifier (not from individual clf.estimators_)

Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?
A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.

optimizing RandomForestRegressor for other metrics

Documentation page for sklearn random forest says
The only supported criterion is “mse” for the mean squared error.
My data is messy and has outliers and I feel that MAE or some robust penalty function would perform much better.
Is there are a way to fit random forest regressor for other metric, for example iteratively, or is there other python open source alternative, or is my assumption on requiring other metrics wrong on itself? Sklearn is very well developed in other areas, so this seems strange to me that only mse supported for such important approach as random forest.
You can use a GridSearchCV or RandomizedSearchCV to optimize for another criterion in a cross-validation loop. The forests themselves will still optimize for MSE, but the CV loop find the forest among the chosen parameter settings that optimizes the actual criterion that you're interested in. (And it optimizes for CV score, not training set score.)

Categories

Resources