Automatically selecting the best of several estimators in scikit-learn - python

Does scikit-learn have an estimator that runs several other estimators and automatically selects the one with the best performance (e.g. according to their cross-validation scores)?
I believe there must be something like this in a class that conforms to the estimator interface, so that it can be combined in a pipeline - correct?

You can use GridSearchCV, not only to choose the best estimator but also to tune its hyperparameters, for example, I'm using this for finding the best text classifier:
pipeline = Pipeline([
('vect', CountVectorizer(ngram_range=(2,2))),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SVC())
])
parameters = {'clf': [
SVC(),
MultinomialNB(),
BernoulliNB(),
MLPClassifier(max_iter=1000),
KNeighborsClassifier(),
SGDClassifier(max_iter=1000),
RandomForestClassifier()
]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X, y)
print("Best score", gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
Example from the official docs: http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#sphx-glr-auto-examples-plot-compare-reduction-py
You can even define your own scoring function, to define what "best" means to you:
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

scikit-learn itself currently doesn't have what you are looking for. However, there are libraries like TPOT and automl-learn with sklearn-like interface for automatic selection of best estimator or even constructing the whole pipeline.

Related

sklearn pipeline and grid search

from sklearn.linear_model import LogisticRegression
pipe4 = Pipeline([('ss', StandardScaler()), ('clf', knn)])
grid2 = GridSearchCV(pipe4, {'clf':[ knn, LogisticRegression()]})
grid2.fit(X_train, y_train)
pd.DataFrame(grid2.cv_results_).T
I made a knn classifier and logistic regression model and wanted to check which model is better through pipeline method.
as you can see the code above I put the knn only in the pipe4 but in grid search, both knn and logsistic regression are working and I could check the result
does it mean I can add the models in Gridseacrh even though I put the one model in pipeline?
Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).

How to know for sure whether cross_validate is using stratified K-fold?

I want to make sure that cross_validate is using a stratified CV. In the documentation for cross_validate, there is written that
For int/None inputs, if the estimator is a classifier and y is either
binary or multiclass, StratifiedKFold is used. In all other cases,
KFold is used.
My estimator is a classifier and my dependent variable is binary. So in theory also by setting cv=None I should obtain a stratified CV.
How can I be sure of that? How to check whether cross_validate here:
rfc_score = cross_validate(rfc, desc_tfidf, labels, scoring=metrics)
is really using a stratified CV?
From the source code of cross-validate, we can see that the very first thing run by the method is:
cv = check_cv(cv, y, classifier=is_classifier(estimator))
And in check_cv, we have:
cv = 5 if cv is None else cv
if isinstance(cv, numbers.Integral):
if (classifier and (y is not None) and
(type_of_target(y) in ('binary', 'multiclass'))):
return StratifiedKFold(cv)
else:
return KFold(cv)
which is exactly what the documentation claims.

Is it necesary to use make_scorer in the scoring argument for pre-defined score object in cross-validation?

I am using different Hyperparameter tunning functions in a classification problem with cross-validation. In particular, I am comparing performance of GridSearchCV, RandomizedSearchCV and BayesSearchCV.
All these functions have the parameter "scoring" where you can specify a string with a predefined scoring parameter, or a callable to evaluate the predictions on the test set. I understand that sometimes you need to define your own scoring function via this callable using make_scorer. This is fine, no problem.
My question is whether it is preferible converting a given pre-defined score (e.g. average_precision_score, f1_score, ...) into a scorer via make_scorer suitable for model selection.
For instance, do this two chunks of code do the same?
1) Using a string in "scoring" argument:
opt = BayesSearchCV(clf,
search_spaces,
scoring='average_precision',
cv=4,
n_iter=40,
n_jobs=-1)
2) Using make_scorer in "scoring" argument:
# define scorer
avg_prec = make_scorer(average_precision_score, greater_is_better=True, needs_proba=True)
opt = BayesSearchCV(clf,
search_spaces,
scoring=avg_prec,
cv=4,
n_iter=40,
n_jobs=-1)
No need to do it yourself. Scikit-learn internally does the same thing. So when you provide a string value in 'scoring' parameter, it is internally matched with a dict of pre-defined scoring methods which contain the make_scorer(scorer, ...) as values. See the source code here:
SCORERS = dict(explained_variance=explained_variance_scorer,
...
...
average_precision=average_precision_scorer,
...
...

Pulling Hyperparameters from a Pipeline Object

I'm using LogisticRegressionCV on my data in a pipeline. After fitting to the data, I'd like to return my optimal C value. How do I do this since I can't use .best_params_ since that is a feature of GridSearchCV. I know that .C_ is the correct feature of LogisticRegressionCV, but my estimator is in a pipeline, so that doesn't work right now.
lr_cv2 = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(solver='liblinear', cv=10, Cs=np.logspace(-5, 8, 15) ))])
lr_cv2.fit(X_train, y_train)
lr_cv2.C_
AttributeError: 'Pipeline' object has no attribute 'C_'
By using the named_steps method of your instance of Pipeline, you can access to the methods composing the single elements of your pipeline:
print(lr_cv2.named_steps['classifier'].C_ )

TypeError: If no scoring is specified, the estimator passed should have a 'score' method, when using CountVectorizer in a GridSearch

I'm practicing with some text using scikit-learn.
Towards getting more familiar with GridSearch, I am starting with some example code found here:
###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0)
}
grid_search.fit(X_train, y_train)
print("Best score: %0.3f" % grid_search.best_score_)
Notice I am being very careful here, and I've only got one estimator and one parameter!
I'm finding that when I run this, I get the error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None))]) does not.
Hummmm...why am I missing some sort of 'score' attribute?
When I check the possible parameters,
print CountVectorizer().get_params().keys()
I don't see anything where I can score, as was implied by this answer.
The documentation says By default, parameter search uses the score function of the estimator to evaluate a parameter setting. So why do I need to specify a score method?
Regardless, I thought I might need to explicity pass a scoring argument, but this didn't help and gave me an error: grid_search.fit(X_train, y_train, scoring=None)
I don't understand this error!
GridSearch maximizes a score over the grid of parameters. You have to specify what kind of score to use because there are many different types of scores possible. For example, for classification problems, you could use accuracy, f1-score, etc. Usually, score type is specified by passing a string in the scoring argument (see scoring parameter). Alternatively, model classes, like SVC or RandomForestRegressor, will have a .score() method. GridSearch will call that if no scoring argument is provided. However, that may or may not be the type of score that you want to optimize. There is also an option of passing in a function as the scoring argument if you have an unusual metric that you want GridSearch to use.
Transformers, like CountVectorizer, do not implement a score method, because they are just deterministic feature transformations. For the same reason, there aren't any scoring methods that make sense to apply to that type of object. You need a model class (or possibly a clustering algorithm) at the end of your pipeline for scoring to make sense.
Aha! I figured it out.
I wasn't understanding how the pipeline works. Sure, I could create a CountVectorizer, but why? There is no way you can get a score out of it, or basically do anything with it other than have a sparse matrix just sitting there.
I need to create a classifier (SGDRegressor) or a regressor (SGDClassifier).
I didn't realize that the pipeline will go
CV --> Regressor
or
CV --> Classifier
The pipeline does what it's name implies...pipes the objects together in series.
In other words, this works:
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', SGDRegressor())
])

Categories

Resources