sklearn pipeline and grid search

sklearn pipeline and grid search - python

from sklearn.linear_model import LogisticRegression
pipe4 = Pipeline([('ss', StandardScaler()), ('clf', knn)])
grid2 = GridSearchCV(pipe4, {'clf':[ knn, LogisticRegression()]})
grid2.fit(X_train, y_train)
pd.DataFrame(grid2.cv_results_).T
I made a knn classifier and logistic regression model and wanted to check which model is better through pipeline method.
as you can see the code above I put the knn only in the pipe4 but in grid search, both knn and logsistic regression are working and I could check the result
does it mean I can add the models in Gridseacrh even though I put the one model in pipeline?

Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).

Related

AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'

I'm using sklearn linear implementation of SVM classifier LinearSVM.
I didn't use it directly but I wrap it with CalibratedClassifierCV to get the probabilities in the prediction time, like:
model = CalibratedClassifierCV(LinearSVC(random_state=0))
After fitting the model, I tried to get the coef_ to print the Top features, following this post Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib, but this I got this error:
coef = classifier.coef_.ravel()
AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'
How can I get the coef in the case I wrap the classifier with a calibrator?, I'm not totally interested in this way, thus if there is another way to get the features importance, it will be welcomed.

coef_ is not an attribute of CalibratedClassifierCV however, it is an attribute of the base_estimator which is a LinearSVC in your case. You can access your base estimator via the calibrated_classifiers_ which is a list of the fitted models (which depends on the number of models you fit based on your cv value). I have shown a sample code which you can refer to for your need.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
model = CalibratedClassifierCV(LinearSVC(random_state=0))
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
[<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57c18>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0aec080>]
In this case my cv is three so I have three models built, so I would simple loop through them and taken an average.
coef_avg = 0
for i in model.calibrated_classifiers_:
coef_avg = coef_avg + i.base_estimator.coef_
coef_avg = coef_avg/len(model.calibrated_classifiers_)
array([[ 0.16464871, 0.45680981, -0.77801375, -0.4170196 ],
[ 0.1238834 , -0.89117967, 0.35451826, -0.89231957],
[-0.83826029, -0.9237139 , 1.30772955, 1.67592916]])

Note: Starting from sklearn version 0.24, CalibratedClassifierCV constructor exposes an ensemble argument, that, if set to False (assuming cv is not set to "prefit"), makes CalibratedClassifierCV expose only one calibrated classifier trained using all training data. This means we no longer need to loop over all calibrated_classifiers_ at prediction time:
model = CalibratedClassifierCV(LinearSVC(random_state=0), ensemble=False)
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
# Returns a list with one element, [<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>]
(using an example above, given by Parthasarathy)

Automatically selecting the best of several estimators in scikit-learn

Does scikit-learn have an estimator that runs several other estimators and automatically selects the one with the best performance (e.g. according to their cross-validation scores)?
I believe there must be something like this in a class that conforms to the estimator interface, so that it can be combined in a pipeline - correct?

You can use GridSearchCV, not only to choose the best estimator but also to tune its hyperparameters, for example, I'm using this for finding the best text classifier:
pipeline = Pipeline([
('vect', CountVectorizer(ngram_range=(2,2))),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SVC())
])
parameters = {'clf': [
SVC(),
MultinomialNB(),
BernoulliNB(),
MLPClassifier(max_iter=1000),
KNeighborsClassifier(),
SGDClassifier(max_iter=1000),
RandomForestClassifier()
]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X, y)
print("Best score", gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
Example from the official docs: http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#sphx-glr-auto-examples-plot-compare-reduction-py
You can even define your own scoring function, to define what "best" means to you:
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

scikit-learn itself currently doesn't have what you are looking for. However, there are libraries like TPOT and automl-learn with sklearn-like interface for automatic selection of best estimator or even constructing the whole pipeline.

How to pass argument to scoring function in scikit-learn's LogisticRegressionCV call

Problem
I am trying to use scikit-learn's LogisticRegressionCV with roc_auc_score as the scoring metric.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
clf = LogisticRegressionCV(scoring=roc_auc_score)
But when I attempt to fit the model (clf.fit(X, y)), it throws an error.
ValueError: average has to be one of (None, 'micro', 'macro', 'weighted', 'samples')
That's cool. It's clear what's going on: roc_auc_score needs to be called with the average argument specified, per its documentation and the error above. So I tried that.
clf = LogisticRegressionCV(scoring=roc_auc_score(average='weighted'))
But it turns out that roc_auc_score can't be called with an optional argument alone, because this throws another error.
TypeError: roc_auc_score() takes at least 2 arguments (1 given)
Question
Any thoughts on how I can use roc_auc_score as the scoring metric for LogisticRegressionCV in a way that I can specify an argument for the scoring function?
I can't find an SO question on this issue or a discussion of this issue in scikit-learn's GitHub repo, but surely someone has run into this before?

You can use make_scorer, e.g.
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.datasets import make_classification
# some example data
X, y = make_classification()
# little hack to filter out Proba(y==1)
def roc_auc_score_proba(y_true, proba):
return roc_auc_score(y_true, proba[:, 1])
# define your scorer
auc = make_scorer(roc_auc_score_proba, needs_proba=True)
# define your classifier
clf = LogisticRegressionCV(scoring=auc)
# train
clf.fit(X, y)
# have look at the scores
print clf.scores_

I found a way to solve this problem!
scikit-learn offers a make_scorer function in its metrics module that allows a user to create a scoring object from one of its native scoring functions with arguments specified to non-default values (see here for more information on this function from the scikit-learn docs).
So, I created a scoring object with the average argument specified.
roc_auc_weighted = sk.metrics.make_scorer(sk.metrics.roc_auc_score, average='weighted')
Then, I passed that object in the call to LogisticRegressionCV and it ran without any issues!
clf = LogisticRegressionCV(scoring=roc_auc_weighted)

A bit late (4 years later). But today you can use:
clf = LogisticRegressionCV(scoring='roc_auc')
Also, all other scoring keys can be obtained through:
from sklearn.metrics import SCORERS
print(SCORERS.keys())

TypeError: If no scoring is specified, the estimator passed should have a 'score' method, when using CountVectorizer in a GridSearch

I'm practicing with some text using scikit-learn.
Towards getting more familiar with GridSearch, I am starting with some example code found here:
###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0)
}
grid_search.fit(X_train, y_train)
print("Best score: %0.3f" % grid_search.best_score_)
Notice I am being very careful here, and I've only got one estimator and one parameter!
I'm finding that when I run this, I get the error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None))]) does not.
Hummmm...why am I missing some sort of 'score' attribute?
When I check the possible parameters,
print CountVectorizer().get_params().keys()
I don't see anything where I can score, as was implied by this answer.
The documentation says By default, parameter search uses the score function of the estimator to evaluate a parameter setting. So why do I need to specify a score method?
Regardless, I thought I might need to explicity pass a scoring argument, but this didn't help and gave me an error: grid_search.fit(X_train, y_train, scoring=None)
I don't understand this error!

GridSearch maximizes a score over the grid of parameters. You have to specify what kind of score to use because there are many different types of scores possible. For example, for classification problems, you could use accuracy, f1-score, etc. Usually, score type is specified by passing a string in the scoring argument (see scoring parameter). Alternatively, model classes, like SVC or RandomForestRegressor, will have a .score() method. GridSearch will call that if no scoring argument is provided. However, that may or may not be the type of score that you want to optimize. There is also an option of passing in a function as the scoring argument if you have an unusual metric that you want GridSearch to use.
Transformers, like CountVectorizer, do not implement a score method, because they are just deterministic feature transformations. For the same reason, there aren't any scoring methods that make sense to apply to that type of object. You need a model class (or possibly a clustering algorithm) at the end of your pipeline for scoring to make sense.

Aha! I figured it out.
I wasn't understanding how the pipeline works. Sure, I could create a CountVectorizer, but why? There is no way you can get a score out of it, or basically do anything with it other than have a sparse matrix just sitting there.
I need to create a classifier (SGDRegressor) or a regressor (SGDClassifier).
I didn't realize that the pipeline will go
CV --> Regressor
or
CV --> Classifier
The pipeline does what it's name implies...pipes the objects together in series.
In other words, this works:
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', SGDRegressor())
])

How to get the folds themselves that are partitioned internally in sklearn.cross_validation.cross_val_score?

I'm using:
sklearn.cross_validation.cross_val_score
to make a cross validation and get the results of each run.
The output of this function is the scores.
Is there a method to get the folds (partitions) themselves that are partitioned internally in the cross_val_score function?

There isn't a way to extract the internal cross validation splits used in the cross_val_score, as this function does not expose any state about it. As mentioned in the documentation, either a k-fold or stratified k-fold with k=3 will be used.
However, if you need to keep track of the cross validation splits used, you can explicitly pass in the cv argument of cross_val_score by creating your own cross validation iterators:
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
kf = KFold(len(iris.target), 5, random_state=0)
clf = SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=kf)
so that it uses the splits you specified exactly instead of rolling its own.

The default cross validator for cross_val_score is a StratifiedKFold with K=3 for classification. You can get a cross validation iterator instead, by using the StratifiedKFold and looping over the splits as shown in the example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sklearn pipeline and grid search - python

Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).

Related

AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'

Automatically selecting the best of several estimators in scikit-learn

How to pass argument to scoring function in scikit-learn's LogisticRegressionCV call

TypeError: If no scoring is specified, the estimator passed should have a 'score' method, when using CountVectorizer in a GridSearch

How to get the folds themselves that are partitioned internally in sklearn.cross_validation.cross_val_score?

Categories

Resources