basic theory: adaboost classifier in scikit - python

my understanding is that AdaBoostClassifier uses a collection of different classifiers to train data, and outputs a meta classifier that's composed of the differently weighted classifiers that gives highest accuracy. I see in scikit 's page on adaboost there's base_estimator parameter, does this only choose the first/starting classifier? does the n_estimator parameter pick how many classifiers adaboost will try before ending? if so how does it know which of the say n=10 classifiers to pick when I call the function?
is there a complete list of classifiers AdaBoost uses?

Related

RandomForestRegressor for classification problems

I've been doing Applied Machine Learing in Python course on coursera and on Assignment of week 4 I`ve found something interesting. During my first attempt to complete the assignment I tried using RandomForestClassifier from sklearn to predict labels, but the model was overfitting and was showing poor test accuracy results. As an experiment I switched to RandomForestRegressor and, guess what, not only did it not overfit, but test accurary was also a lot higher. So, why does RandomForestRegressor perform a lot better on a binary classification problem?
The Random Forest regressor does differ somewhat from the Random Forest classifier when it comes to ensembling the decision trees:
The classifier uses the mode of the predicted classes of the decision trees
The regressor uses the mean of the predicted values of the decision trees
Due to this difference the models can have different results. And in some cases this might result in the regressor performing better than the classifier.
In addition to that I would say that if you tune your hyperparameters correctly, the classifier should perform better on a classification problem than the regressor.

Sklearn AdaBoost with Polynomial SVM predicts all 1s

I am trying to create a model using AdaBoost with Polynomial SVM as the base classifier.
The code snippet is as follows :
base_clf = SVC(kernel='poly', degree=3, class_weight='balanced', gamma='scale', probability=True)
model = AdaBoostClassifier(base_estimator=base_clf, n_estimators=10)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
The problem that I am facing is that the model is always predicting only 1
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
In practice, we never use SVMs as base classifiers for Adaboost.
Adaboost (and similar ensemble methods) were conceived ~ 20 years ago using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1), i.e. a decision stump.
DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer anything when used as base classifiers for boosting algorithms.

Finding label-specific top features for non-linear classifier

Is there any function that gives the top features of each label in a Random Forest/ XG Boost classifier? The classifier.feature_importances_ only gives top features for the classifier as a whole.
Looking for something similar to the classifier.coef_ that gives label-specific top features for SVM and Naive Bayes classifiers in sklearn.
import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance',ascending=False)
Try with this!
Or 1 vs Rest is also an good option but take lot of time.
Firstly, Random Forest / Xgboost or even a simple DecisionTree/ any Tree ensemble is a inherent multi-class classification model. Hence it will predict the multi-class output without using any wrapper ( 1 vs 1 / 1 vs Rest) on top of binary classifier (which is what the logistic regression/SVM/SGDClassifier would do).
Hence, you can get the feature importance for the overall multi-class classification alone and not for individual labels.
If you really want to know the feature importance for individual labels, then use onevsRest wrapper with decisionTree/ RandomForest/ Xgboost as the estimator.
This is not the recommended approach because the results could be suboptimal when compared with single decision Tree.
Some examples here.

Make graphviz from sklearn RandomForestClassifier (not from individual clf.estimators_)

Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?
A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.

Comparing confidence scores from decision_function() for scikit-learn LinearSVC

I am using scikit-learn's LinearSVC SVM implementation to perform tagging of text. I have about 100 classifiers trained for different tags. I now want to rank the tags in order of their similarity with the model. I will need to compare confidence scores from decision_function() across these classifiers. Can anyone confirm if I can do that? What do the scores from decision_function() actually mean?

Categories

Resources