I've got a question about where the sklearn SVM classifier, on default settings, will be on a ROC curve or, failing that, how to find out. I've been of the assumption that the ROC curve was a description of general performance, so trying to find the exact position of the classifier was new to me.
Assume that the ROC curve looks like the mean on the graph provided
here.
Assuming you train a SVM on the entire dataset at default settings, where will it lie on the ROC curve
EDIT: Clarification
Assume I train a SVM at default values (sklearn), how would I determine where on the ROC curve it was. Alternatively, which setting on the SVC class allows me to set ROC position?
I think you're misunderstanding the concept of an ROC. A model doesn't "lie on the ROC", a model has an ROC curve. This can be used for evaluating your model, or for deciding how you're going to use your model.
Evaluating your model's performance
To calculate the ROC of your model, use the roc_curve function, with inputs as the predicted probabilities from your model, and the actual results:
from sklearn.metrics import roc_curve
roc = roc_curve(model.predict_proba(X), y)
If you want a single measure of your model's performance, you can use the area under the ROC; this can be useful if you're trying to tune hyperparamaters of your model, optimise your feature selection, etc. A typical way to calculate this (with k-fold cross validation) in sklearn would be:
from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, scoring = 'roc_auc')
Using your model to predict.
If you just call model.predict(X), the model will predict based on a probability threshold of 0.5. This is probably not what you want: as #AndreHolzner pointed out in the comment to your question, you'll want to use your ROC curve to decide the false positive rate that you're willing to accept. After this you just check whether your predicted probabilities are above this threshold or not:
thresh = 0.8
predictions = model.predict_proba(X) > thresh
Related
I'm using BayesSearchCV from scikit-optimize to train a model on a fairly imbalanced dataset. From what I'm reading precision or ROC AUC would be the best metrics for imbalanced dataset. In my code:
knn_b = BayesSearchCV(estimator=pipe, search_spaces=search_space, n_iter=40, random_state=7, scoring='roc_auc')
knn_b.fit(X_train, y_train)
The number of iterations is just a random value I chose (although I get a warning saying I already reached the best result, and there is not a way to early stop as far as I'm aware?). For the scoring parameter, I specified roc_auc, which I'm assuming it will be the primary metric to monitor for the best parameter in the results. So when I call knn_b.best_params_, I should have the parameters where the roc_auc metrics is higher. Is that correct?
My confusion is when I look at the results using knn_b.cv_results_. Shouldn't the mean_test_score be the roc_auc score because of the scoring param in the BayesSearchCV class? What I'm doing it plotting the results and seeing how each combination of params performed.
sns.relplot(
data=knn_b.cv_results_, kind='line', x='param_classifier__n_neighbors', y='mean_test_score',
hue='param_scaler', col='param_classifier__p',
)
When I try to use to roc_auc_score() function on the true and predicted values, I get something completely different.
Is the mean_test_score here different? How would I be able to get the individual/mean roc_auc score of each CV/split of each iteration? Similarly for when I want to use RandomizedSearchCV or GridSearchCV.
EDIT: tldr; I want to know what's being computed exactly in mean_test_score. I thought it was roc_auc because of the scoring param, or accuracy, but it seems to be neither.
mean_test_score is the AUROC, because of your scoring parameter, yes.
Your main problem is that the ROC curve (and the area under it) require the probability predictions (or other continuous score), not the hard class predictions. Your manual calculation is thus incorrect.
You shouldn't expect exactly the same score anyway. Your second score is on the test set, and the first score is optimistically biased by the hyperparameter selection.
I have a mix of labeled and unlabeled data, this last one I would like to classify it by using semi-supervised learning. Suppose I have already an algorithm that gives me the best accuracy at predicting the labels of the training subsample. I want to use that algorithm to predict the labels of the unlabeled subsample. In semi-supervised learning, the pseudo-labeled data is added to the labeled (training) one. I would like to select from the pseudo-labeled data only those points that the probability of being well classified is higher than, let's say, 0.8, and repeat the procedure till all the unlabeled data is pseudo-labeled with high probability.
How could I achieve this? Is there a code or built-in function that helps me to compute such a probability?
All these algorithms
AdaBoostClassifier, BaggingClassifier, BayesianGaussianMixture, BernoulliNB, CalibratedClassifierCV, ComplementNB, DecisionTreeClassifier, ExtraTreeClassifier, ExtraTreesClassifier, GaussianMixture, GaussianNB, GaussianProcessClassifier,
GradientBoostingClassifier, KNeighborsClassifier, LabelPropagation, LabelSpreading, LinearDiscriminantAnalysis, LogisticRegression, LogisticRegressionCV, MLPClassifier, MultinomialNB, NuSVC, QuadraticDiscriminantAnalysis, RandomForestClassifier, SGDClassifier, SVC, _BinaryGaussianProcessClassifierLaplace, _ConstantPredictor
support a method called predict_proba(self, X), that doest precisely that.
I have written a code that performs logistic regression with leave one out cross validation. I need to know the value of coefficients for logistic regression. But the attribute model. Coefficients_ work only after the model have used fit function. But as I have performed Cross validation so I have not used fit function to train the model.
Here is the code:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
reg=LogisticRegression()
loo=LeaveOneOut()
scores=cross_val_score(reg,train1,labels,cv=loo)
print(scores)
print(scores.mean())
coef = classifier.coef_
I want to know coefficient values for my features in train1 but as I have not used fit method, How can I get the values of these coefficients?
I just run a random forest model on a imbalance dataset. I got the set of AUC and the confusion matrix. The AUC seemed not bad but actually the model predict every instance as positive. So how it happened and how to use AUC properly?
The ROC Curve as below:
You can have this problem when your data is skewed in one direction or the other (sort of similar to a small false positive rate being terrible for medical tests for rare conditions). It might be helpful to look at the entire receiver operating characteristic curve (ROC curve) instead of just the AUC summary score.
I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.
scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.
I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.
If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.
Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)