I'm using GridSearchCV and a pipeline to classify some text documents. A code snippet:
clf = Pipeline([('vect', TfidfVectorizer()), ('clf', SVC())])
parameters = {'vect__ngram_range' : [(1,2)], 'vect__min_df' : [2], 'vect__stop_words' : ['english'],
'vect__lowercase' : [True], 'vect__norm' : ['l2'], 'vect__analyzer' : ['word'], 'vect__binary' : [True],
'clf__kernel' : ['rbf'], 'clf__C' : [100], 'clf__gamma' : [0.01], 'clf__probability' : [True]}
grid_search = GridSearchCV(clf, parameters, n_jobs = -2, refit = True, cv = 10)
grid_search.fit(corpus, labels)
My problem is that when using grid_serach.predict_proba(new_doc) and then wanting to find out what classes the probabilities corresponds to with grid_search.classes_, I get the following error:
AttributeError: 'GridSearchCV' object has no attribute 'classes_'
What have I missed? I thought that if the last "step" in the pipeline was a classifier, then the return of GridSearchCV is also a classifier. Hence one can use the attributes of that classifier, e.g. classes_.
As mentioned in the comments above, the grid_search.best_estimator_.classes_ returned an error message since it returns a pipeline with no attribute .classes_. However, by first calling the step classifier of the pipeline I was able to use the classes attribute. Here is the solution
grid_search.best_estimator_.named_steps['clf'].classes_
Try grid_search.best_estimator_.classes_.
The return of GridSearchCV is a GridSearchCV instance which is not really an estimator itself. Rather, it instantiates a new estimator for each parameter combination it tries (see the docs).
You may think the return value is a classifier because you can use methods such as predict or predict_proba when refit=True, but the GridSearchCV.predict_proba actually looks like (spoiler from the source):
def predict_proba(self, X):
"""Call predict_proba on the estimator with the best found parameters.
Only available if ``refit=True`` and the underlying estimator supports
``predict_proba``.
Parameters
-----------
X : indexable, length n_samples
Must fulfill the input assumptions of the
underlying estimator.
"""
return self.best_estimator_.predict_proba(X)
Hope this helps.
Related
I have a sklearn pipeline with PolynomialFeatures() and LinearRegression() in series. My aim is to fit data to this using different degree of the polynomial features and measure the score. The following is the code I use -
steps = [('polynomials',preprocessing.PolynomialFeatures()),('linreg',linear_model.LinearRegression())]
pipeline = pipeline.Pipeline(steps=steps)
scores = dict()
for i in range(2,6):
params = {'polynomials__degree': i,'polynomials__include_bias': False}
#pipeline.set_params(**params)
pipeline.fit(X_train,y=yCO_logTrain,**params)
scores[i] = pipeline.score(X_train,yCO_logTrain)
scores
I receive the error - TypeError: fit() got an unexpected keyword argument 'degree'.
Why is this error thrown even though the parameters are named in the format <estimator_name>__<parameter_name>?
As per sklearn.pipeline.Pipeline documentation:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
This means that the parameters passed this way are directly passed to s step .fit() method. If you check PolynomialFeatures documentation, degree argument is used in construction of the PolynomialFeatures object, not in its .fit() method.
If you want to try different hyperparameters for estimators/transformators within a pipeline, you could use GridSearchCV as shown here. Here's an example code from the link:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('select', SelectKBest()),
('model', calibrated_forest)])
param_grid = {
'select__k': [1, 2],
'model__base_estimator__max_depth': [2, 4, 6, 8]}
search = GridSearchCV(pipe, param_grid, cv=5).fit(X, y)
My question seems to be similar to this one but there is no solid answer there.
I'm doing a multi-class multi-label classification, and for doing that I have defined my own scorers. However, in order to have the refit parameter and get the best parameters of the model at the end we need to introduce one of the scorer functions for the refit. If I do so, I get the error that missing 1 required positional argument: 'y_pred'. y_pred should be the outcome of fit. But not sure where this issue is coming from and how I can solve it.
Below is the code:
scoring = {'roc_auc_score':make_scorer(roc_auc_score),
'precision_score':make_scorer(precision_score, average='samples'),
'recall_score':make_scorer(recall_score, average='samples')}
params = {'estimator__n_estimators': [500,800],
'estimator__max_depth': [10,50],}
model = xgb.XGBClassifier(n_jobs=4)
model = MultiOutputClassifier(model)
cls = GridSearchCV(model, params, cv=3, refit=make_scorer(roc_auc_score), scoring = scoring, verbose=3, n_jobs= -1)
model = cls.fit(x_train_ups, y_train_ups)
print(model.best_params_)
You should use refit="roc_auc_score", the name of the scorer in your dictionary. From the docs:
For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Using a callable for refit has a different purpose: the callable should take the cv_results_ dict and return the best_index_. That explains the error message: sklearn is trying to pass cv_results_ to your auc scorer function, but that function should take parameters y_true and y_pred.
I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}
I have setup a sklearn.GridsearchCV with a Pipeline as the estimator. My problem is a multiclass classification. I clearly receive this error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
Which is because I use F1 score without setting the average argument. My question is where exactly should I pass this argument to the object?
my code:
estimator = GridSearchCV(
estimator=Pipeline(setting['layers']),
param_grid=setting['hyper_parameters'],
cv=cv,
scoring=self.scoring,
refit=self.refit_metric,
n_jobs=n_jobs,
return_train_score=True,
verbose=True
)
and then:
estimator.fit(
self.x_train,
self.y_train
)
The error is raised on the .fit() line, but I guess I should pass the parameter when instantiating the GridsearchCV.
For your scoring parameter of the GridSearchCV, you can just pass e.g. f1_weighted as a string. That should do the trick. You can have a look at the sklearn docs for possible values.
I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.
You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation