For a classification task, I am using voting classifier to ensemble logistic regression and SVM with voting parameter set to soft. The result is clearly better than each individual model. I am not sure if I understand how it works though. How can the model find the majority vote between only two models?
Assuming you have two classes class-A and class-B
Logistic Regression( has an inbuilt predict_proba() method) and SVC(set probability=True) both are able to estimate class probabilities on their outputs i.e. they predict if input is class-A with probability a and class-B with probability b. If a>b then it outputs predicted class is A otherwise B .In a voting classifier setting the voting parameter to soft enables them(SVM and LogiReg) to calculate their probability(also known as confidence score) individually and present it to the voting classifier, then the voting classifier averages them and outputs the class with the highest probability.
Make sure that if you set voting=soft then the classifiers you provide can also calculate this confidence score.
To see the confidence of each classifier you can do:
from sklearn.metrics import accuracy_score
y_pred=classifer_name.predict(X_test) #classifier_name=trained SVM/LogiReg/VotingClassifier
print(classifier_name.__class__.__name__,accuracy_score(y_true,y_pred))
NOTE: a+b may not appear to be 1 due to computer floating point round off. But it is 1. I can't say about other confidence scores like decision functions, but with predict_proba() it is the case.
Related
I have an imbalanced binary classification problem. The ratio for positive to negative class is about 1:10. I trained an XGBoost tree model to predict these two classes using continuous and categorical data as input.
Using this XGBoost library, I predict the probability of new inputs using predict_proba. I am assuming the probability values output here is the likelihood of these new test data being the positive class? Say I have an entire test set with test labels, how am I able to evaluate the quality of these (I'm assuming) likelihoods?
When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?
This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)
I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.
I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.
Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.
The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.
Thanks for possible help.
RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.
RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:
Let us consider binary classification for simplicity.
Convert target variable into +1 or -1 based on the class in which it belongs to.
Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty
If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.
For multi-class classification:
Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).
Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.
Given a classification problem, sometimes we do not just predict a class, but need to return the probability that it is a class.
i.e. P(y=0|x), P(y=1|x), P(y=2|x), ..., P(y=C|x)
Without building a new classifier to predict y=0, y=1, y=2... y=C respectively. Since training C classifiers (let's say C=100) can be quite slow.
What can be done to do this? What classifiers naturally can give all probabilities easily (one I know is using neural network with 100 out nodes)? But if I use traditional random forests, I can't do that, right? I use the Python Scikit-Learn library.
If you want probabilities, look for sklearn-classifiers that have method: predict_proba()
Sklearn documentation about multiclass:[http://scikit-learn.org/stable/modules/multiclass.html]
All scikit-learn classifiers are capable of multiclass classification. So you don't need to build 100 models yourself.
Below is a summary of the classifiers supported by scikit-learn grouped by strategy:
Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees,
Random Forests, Nearest Neighbors, setting multi_class='multinomial'
in sklearn.linear_model.LogisticRegression.
Support multilabel: Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression.
One-Vs-One: sklearn.svm.SVC.
One-Vs-All: all linear models exceptsklearn.svm.SVC.
Random forests do indeed give P(Y/x) for multiple classes. In most cases
P(Y/x) can be taken as:
P(Y/x)= the number of trees which vote for the class/Total Number of trees.
However you can play around with this, for example in one case if the highest class has 260 votes, 2nd class 230 votes and other 5 classes 10 votes, and in another case class 1 has 260 votes, and other classes have 40 votes each, you migth feel more confident in your prediction in 2nd case as compared to 1st case, so you come up with a confidence metric according to your use case.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)