I'm using sklearn linear implementation of SVM classifier LinearSVM.
I didn't use it directly but I wrap it with CalibratedClassifierCV to get the probabilities in the prediction time, like:
model = CalibratedClassifierCV(LinearSVC(random_state=0))
After fitting the model, I tried to get the coef_ to print the Top features, following this post Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib, but this I got this error:
coef = classifier.coef_.ravel()
AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'
How can I get the coef in the case I wrap the classifier with a calibrator?, I'm not totally interested in this way, thus if there is another way to get the features importance, it will be welcomed.
coef_ is not an attribute of CalibratedClassifierCV however, it is an attribute of the base_estimator which is a LinearSVC in your case. You can access your base estimator via the calibrated_classifiers_ which is a list of the fitted models (which depends on the number of models you fit based on your cv value). I have shown a sample code which you can refer to for your need.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
model = CalibratedClassifierCV(LinearSVC(random_state=0))
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
[<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57c18>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0aec080>]
In this case my cv is three so I have three models built, so I would simple loop through them and taken an average.
coef_avg = 0
for i in model.calibrated_classifiers_:
coef_avg = coef_avg + i.base_estimator.coef_
coef_avg = coef_avg/len(model.calibrated_classifiers_)
array([[ 0.16464871, 0.45680981, -0.77801375, -0.4170196 ],
[ 0.1238834 , -0.89117967, 0.35451826, -0.89231957],
[-0.83826029, -0.9237139 , 1.30772955, 1.67592916]])
Note: Starting from sklearn version 0.24, CalibratedClassifierCV constructor exposes an ensemble argument, that, if set to False (assuming cv is not set to "prefit"), makes CalibratedClassifierCV expose only one calibrated classifier trained using all training data. This means we no longer need to loop over all calibrated_classifiers_ at prediction time:
model = CalibratedClassifierCV(LinearSVC(random_state=0), ensemble=False)
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
# Returns a list with one element, [<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>]
(using an example above, given by Parthasarathy)
Related
from sklearn.linear_model import LogisticRegression
pipe4 = Pipeline([('ss', StandardScaler()), ('clf', knn)])
grid2 = GridSearchCV(pipe4, {'clf':[ knn, LogisticRegression()]})
grid2.fit(X_train, y_train)
pd.DataFrame(grid2.cv_results_).T
I made a knn classifier and logistic regression model and wanted to check which model is better through pipeline method.
as you can see the code above I put the knn only in the pipe4 but in grid search, both knn and logsistic regression are working and I could check the result
does it mean I can add the models in Gridseacrh even though I put the one model in pipeline?
Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).
I use LinearSVC for a multi-label classification problem. Since LinearSVC does not provide a predict_proba method, I decided to use CalibratedClassifierCV to scale the decision function into [0, 1] probabilities.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
classifier = CalibratedClassifierCV(LinearSVC(class_weight = 'balanced', max_iter = 100000)
classifier.fit(X_train, y_train)
However, I also need to access the weights coef_, but classifier.base_estimator.coef_ raise the following error:
AttributeError: 'LinearSVC' object has no attribute 'coef_'
I thought classifier.base_estimator returned the calibrated classifier and allowed to access all its attributes. Thanks in advance for explaining me what I missunderstood.
I want to make sure that cross_validate is using a stratified CV. In the documentation for cross_validate, there is written that
For int/None inputs, if the estimator is a classifier and y is either
binary or multiclass, StratifiedKFold is used. In all other cases,
KFold is used.
My estimator is a classifier and my dependent variable is binary. So in theory also by setting cv=None I should obtain a stratified CV.
How can I be sure of that? How to check whether cross_validate here:
rfc_score = cross_validate(rfc, desc_tfidf, labels, scoring=metrics)
is really using a stratified CV?
From the source code of cross-validate, we can see that the very first thing run by the method is:
cv = check_cv(cv, y, classifier=is_classifier(estimator))
And in check_cv, we have:
cv = 5 if cv is None else cv
if isinstance(cv, numbers.Integral):
if (classifier and (y is not None) and
(type_of_target(y) in ('binary', 'multiclass'))):
return StratifiedKFold(cv)
else:
return KFold(cv)
which is exactly what the documentation claims.
Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.
It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...
Problem
I am trying to use scikit-learn's LogisticRegressionCV with roc_auc_score as the scoring metric.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
clf = LogisticRegressionCV(scoring=roc_auc_score)
But when I attempt to fit the model (clf.fit(X, y)), it throws an error.
ValueError: average has to be one of (None, 'micro', 'macro', 'weighted', 'samples')
That's cool. It's clear what's going on: roc_auc_score needs to be called with the average argument specified, per its documentation and the error above. So I tried that.
clf = LogisticRegressionCV(scoring=roc_auc_score(average='weighted'))
But it turns out that roc_auc_score can't be called with an optional argument alone, because this throws another error.
TypeError: roc_auc_score() takes at least 2 arguments (1 given)
Question
Any thoughts on how I can use roc_auc_score as the scoring metric for LogisticRegressionCV in a way that I can specify an argument for the scoring function?
I can't find an SO question on this issue or a discussion of this issue in scikit-learn's GitHub repo, but surely someone has run into this before?
You can use make_scorer, e.g.
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.datasets import make_classification
# some example data
X, y = make_classification()
# little hack to filter out Proba(y==1)
def roc_auc_score_proba(y_true, proba):
return roc_auc_score(y_true, proba[:, 1])
# define your scorer
auc = make_scorer(roc_auc_score_proba, needs_proba=True)
# define your classifier
clf = LogisticRegressionCV(scoring=auc)
# train
clf.fit(X, y)
# have look at the scores
print clf.scores_
I found a way to solve this problem!
scikit-learn offers a make_scorer function in its metrics module that allows a user to create a scoring object from one of its native scoring functions with arguments specified to non-default values (see here for more information on this function from the scikit-learn docs).
So, I created a scoring object with the average argument specified.
roc_auc_weighted = sk.metrics.make_scorer(sk.metrics.roc_auc_score, average='weighted')
Then, I passed that object in the call to LogisticRegressionCV and it ran without any issues!
clf = LogisticRegressionCV(scoring=roc_auc_weighted)
A bit late (4 years later). But today you can use:
clf = LogisticRegressionCV(scoring='roc_auc')
Also, all other scoring keys can be obtained through:
from sklearn.metrics import SCORERS
print(SCORERS.keys())