I am trying to create a model using AdaBoost with Polynomial SVM as the base classifier.
The code snippet is as follows :
base_clf = SVC(kernel='poly', degree=3, class_weight='balanced', gamma='scale', probability=True)
model = AdaBoostClassifier(base_estimator=base_clf, n_estimators=10)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
The problem that I am facing is that the model is always predicting only 1
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
Is it incorrect to use AdaBoost with SVM as base classifier? Please guide.
In practice, we never use SVMs as base classifiers for Adaboost.
Adaboost (and similar ensemble methods) were conceived ~ 20 years ago using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1), i.e. a decision stump.
DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer anything when used as base classifiers for boosting algorithms.
Related
I'm working on data where I'm trying different classification algorithms and see which one performs best as a baseline model. The code for that is as follows:
# Trying out different classifiers and selecting the best
## Creat list of classifiers we're going to loop through
classifiers = [
KNeighborsClassifier(),
SVC(),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier()
]
classifier_names = [
'kNN',
'SVC',
'DecisionTree',
'RandomForest',
'AdaBoost',
'GradientBoosting'
]
model_scores = []
## Looping through the classifiers
for classifier, name in zip(classifiers, classifier_names):
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('selector', SelectKBest(k=len(X.columns))),
('classifier', classifier)])
score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
model_scores.append(score)
print("Model score for {}: {}".format(name, score))
The output is:
Model score for kNN: 0.7472524440239673
Model score for SVC: 0.7896621728161464
Model score for DecisionTree: 0.7302148734267939
Model score for RandomForest: 0.779058799919727
Model score for AdaBoost: 0.7949635904933918
Model score for GradientBoosting: 0.7930712637252372
Turns out the best model is the AdaBoostClassifier(). I would normally pick the best baseline model and perform a GridSearchCV on it to further improve its baseline performance.
However, what if, hypothetically speaking, the model that performed best as a baseline model (in this case AdaBoost), only improves 1% through hyperparameter tuning while a model that initially performed not as well (for example the SCV()), would have more "potential", to improve through hyperparameter tuning (i.e. would improve by for example 4%) and would - after tuning - end up being the better model?
Is there a way to know this "potential" beforehand, without performing a GridSearch for all classifiers?
No, there is no way to know for 100 % certain before hyperparameter tuning which classifier will end up performing best on any given problem. However, in practice what Kaggle competitions have shown on tabular data classification problems (as opposed to text or image-based) is that in almost every case, a gradient-boosted decision tree-based model (like XGBoost or LightGBM) works best. Given this, it's likely that GradientBoosting will perform better under hyperparamter tuning since it's based off LightGBM.
What you are doing in the above code is to use simply all default values of the hyper parameters and for those algorithms which are more sensitive to hyperparameter tuning, it's not necessarily indicative of the final (fine-tuned) performance, as you've suggested.
Yes, there are ways like Univariate, Bivariate and Multivariate analysis to look at the data and then decide which model you can start as baseline.
you can also use the sklearn way to choose the right estimator.
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
I've been doing Applied Machine Learing in Python course on coursera and on Assignment of week 4 I`ve found something interesting. During my first attempt to complete the assignment I tried using RandomForestClassifier from sklearn to predict labels, but the model was overfitting and was showing poor test accuracy results. As an experiment I switched to RandomForestRegressor and, guess what, not only did it not overfit, but test accurary was also a lot higher. So, why does RandomForestRegressor perform a lot better on a binary classification problem?
The Random Forest regressor does differ somewhat from the Random Forest classifier when it comes to ensembling the decision trees:
The classifier uses the mode of the predicted classes of the decision trees
The regressor uses the mean of the predicted values of the decision trees
Due to this difference the models can have different results. And in some cases this might result in the regressor performing better than the classifier.
In addition to that I would say that if you tune your hyperparameters correctly, the classifier should perform better on a classification problem than the regressor.
Is there any function that gives the top features of each label in a Random Forest/ XG Boost classifier? The classifier.feature_importances_ only gives top features for the classifier as a whole.
Looking for something similar to the classifier.coef_ that gives label-specific top features for SVM and Naive Bayes classifiers in sklearn.
import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance',ascending=False)
Try with this!
Or 1 vs Rest is also an good option but take lot of time.
Firstly, Random Forest / Xgboost or even a simple DecisionTree/ any Tree ensemble is a inherent multi-class classification model. Hence it will predict the multi-class output without using any wrapper ( 1 vs 1 / 1 vs Rest) on top of binary classifier (which is what the logistic regression/SVM/SGDClassifier would do).
Hence, you can get the feature importance for the overall multi-class classification alone and not for individual labels.
If you really want to know the feature importance for individual labels, then use onevsRest wrapper with decisionTree/ RandomForest/ Xgboost as the estimator.
This is not the recommended approach because the results could be suboptimal when compared with single decision Tree.
Some examples here.
my understanding is that AdaBoostClassifier uses a collection of different classifiers to train data, and outputs a meta classifier that's composed of the differently weighted classifiers that gives highest accuracy. I see in scikit 's page on adaboost there's base_estimator parameter, does this only choose the first/starting classifier? does the n_estimator parameter pick how many classifiers adaboost will try before ending? if so how does it know which of the say n=10 classifiers to pick when I call the function?
is there a complete list of classifiers AdaBoost uses?
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)