Error when using scikit-learn to use pipelines

Error when using scikit-learn to use pipelines - python

I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.

You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation

Related

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters

I am tuning a gradient boosted classifier using a pipeline and grid search
My pipeline is
pipe = make_pipeline(StandardScaler(with_std=True, with_mean=True), \
RFE(RandomForestClassifier(), n_features_to_select= 15), \
GradientBoostingClassifier(random_state=42, verbose=True))
The parameter gri is:
tuned_parameters = [{'gradientboostingclassifier__max_depth': range(3, 5),\
'gradientboostingclassifier__min_samples_split': range(4,6),\
'gradientboostingclassifier__learning_rate':np.linspace(0.1, 1, 10)}]
The grid search is done as
grid = GridSearchCV(pipe, tuned_parameters, cv=5, scoring='accuracy', refit=True)
grid.fit(X_train, y_train)
After fitting the model in train data, when I check the grid.best_estimator I can only find the 2 parameters(learning_rate and min_samples_split )that I am fitting. I don't find the max_depth parameter in the best estimator.
grid.best_estimator_.named_steps['gradientboostingclassifier'] =
GradientBoostingClassifier(learning_rate=0.9, min_samples_split=5,
random_state=42, verbose=True)
But, if I use the grid.cv_results to find the best 'mean_test_score' and find the corresponding parameters for that test score, then I can find the max_depth in it.
inde = np.where(grid.cv_results_['mean_test_score'] == max(grid.cv_results_['mean_test_score']))
grid.cv_results_['params'][inde[-1][0]]
{'gradientboostingclas...rning_rate': 0.9, 'gradientboostingclas..._max_depth': 3, 'gradientboostingclas...ples_split': 5}
special variables
function variables
'gradientboostingclassifier__learning_rate':0.9
'gradientboostingclassifier__max_depth':3
'gradientboostingclassifier__min_samples_split':5
My doubt now is, if I use the trained pipeline (name of the object is 'grid' in my case) will it still use the 'max_depth' parameter also or will it not?
Is it then better to use the 'best parameters' which gave me the best 'mean_test_score' taken from the grid.cv_results

Your pipeline has been tuned on all three parameters that you specified. It is just that the best value for max_depth happens to be the default value. When printing the classifier, default values will not be included. Compare the following outputs:
print(GradientBoostingClassifier(max_depth=3)) # default
# output: GradientBoostingClassifier()
print(GradientBoostingClassifier(max_depth=5)) # not default
# output: GradientBoostingClassifier(max_depth=5)
In general, it is best-practice to access the best parameters by the best_params_ attribute of the fitted GridSearchCV object since this will always include all parameters:
grid.best_params_

GridSearchCV with Scoring Function and Refit Parameter

My question seems to be similar to this one but there is no solid answer there.
I'm doing a multi-class multi-label classification, and for doing that I have defined my own scorers. However, in order to have the refit parameter and get the best parameters of the model at the end we need to introduce one of the scorer functions for the refit. If I do so, I get the error that missing 1 required positional argument: 'y_pred'. y_pred should be the outcome of fit. But not sure where this issue is coming from and how I can solve it.
Below is the code:
scoring = {'roc_auc_score':make_scorer(roc_auc_score),
'precision_score':make_scorer(precision_score, average='samples'),
'recall_score':make_scorer(recall_score, average='samples')}
params = {'estimator__n_estimators': [500,800],
'estimator__max_depth': [10,50],}
model = xgb.XGBClassifier(n_jobs=4)
model = MultiOutputClassifier(model)
cls = GridSearchCV(model, params, cv=3, refit=make_scorer(roc_auc_score), scoring = scoring, verbose=3, n_jobs= -1)
model = cls.fit(x_train_ups, y_train_ups)
print(model.best_params_)

You should use refit="roc_auc_score", the name of the scorer in your dictionary. From the docs:
For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Using a callable for refit has a different purpose: the callable should take the cv_results_ dict and return the best_index_. That explains the error message: sklearn is trying to pass cv_results_ to your auc scorer function, but that function should take parameters y_true and y_pred.

Problem with SelectKBest method in pipeline

I am trying to solve a problem where I use KNN algorithm for classification. While using pipeline, I decided to add SelectKBest but I get the error below :
All intermediate steps should be transformers and implement fit and transform.
I don't know if I can use this selection algorithm with KNN. But I tried with SVM as well and got the same result. Here is my code :
sel = SelectKBest('chi2',k = 3)
clf = kn()
s = ss()
step = [('scaler', s), ('kn', clf), ('sel',sel)]
pipeline = Pipeline(step)
parameter = {'kn__n_neighbors':range(1,40,1), 'kn__weights':['uniform','distance'], 'kn__p':[1,2] }
kfold = StratifiedKFold(n_splits=5, random_state=0)
grid = GridSearchCV(pipeline, param_grid = parameter, cv=kfold, scoring = 'accuracy', n_jobs = -1)
grid.fit(x_train, y_train)

The order of the operations in the pipeline, as determined in steps, matters; from the docs:
steps : list
List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
The error is due to adding SelectKBest as the last element of your pipeline:
step = [('scaler', s), ('kn', clf), ('sel',sel)]
which is not an estimator (it is a transformer), as well as to your intermediate step kn not being a transformer.
I guess you don't really want to perform feature selection after you have fitted the model...
Change it to:
step = [('scaler', s), ('sel', sel), ('kn', clf)]
and you should be fine.

So, I didn't think the order of the pipeline is important but then, I found out the last member of the pipeline has to be able to fit/transform. I changed the order of the pipeline by making clf the last. Problem is solved.

Invalid parameter for sklearn estimator pipeline

I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?

There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py

For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.

Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.

You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.

Get corresponding classes to predict_proba (GridSearchCV sklearn)

I'm using GridSearchCV and a pipeline to classify some text documents. A code snippet:
clf = Pipeline([('vect', TfidfVectorizer()), ('clf', SVC())])
parameters = {'vect__ngram_range' : [(1,2)], 'vect__min_df' : [2], 'vect__stop_words' : ['english'],
'vect__lowercase' : [True], 'vect__norm' : ['l2'], 'vect__analyzer' : ['word'], 'vect__binary' : [True],
'clf__kernel' : ['rbf'], 'clf__C' : [100], 'clf__gamma' : [0.01], 'clf__probability' : [True]}
grid_search = GridSearchCV(clf, parameters, n_jobs = -2, refit = True, cv = 10)
grid_search.fit(corpus, labels)
My problem is that when using grid_serach.predict_proba(new_doc) and then wanting to find out what classes the probabilities corresponds to with grid_search.classes_, I get the following error:
AttributeError: 'GridSearchCV' object has no attribute 'classes_'
What have I missed? I thought that if the last "step" in the pipeline was a classifier, then the return of GridSearchCV is also a classifier. Hence one can use the attributes of that classifier, e.g. classes_.

As mentioned in the comments above, the grid_search.best_estimator_.classes_ returned an error message since it returns a pipeline with no attribute .classes_. However, by first calling the step classifier of the pipeline I was able to use the classes attribute. Here is the solution
grid_search.best_estimator_.named_steps['clf'].classes_

Try grid_search.best_estimator_.classes_.
The return of GridSearchCV is a GridSearchCV instance which is not really an estimator itself. Rather, it instantiates a new estimator for each parameter combination it tries (see the docs).
You may think the return value is a classifier because you can use methods such as predict or predict_proba when refit=True, but the GridSearchCV.predict_proba actually looks like (spoiler from the source):
def predict_proba(self, X):
"""Call predict_proba on the estimator with the best found parameters.
Only available if ``refit=True`` and the underlying estimator supports
``predict_proba``.
Parameters
-----------
X : indexable, length n_samples
Must fulfill the input assumptions of the
underlying estimator.
"""
return self.best_estimator_.predict_proba(X)
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error when using scikit-learn to use pipelines - python

Related

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters

GridSearchCV with Scoring Function and Refit Parameter

Problem with SelectKBest method in pipeline

Invalid parameter for sklearn estimator pipeline

Get corresponding classes to predict_proba (GridSearchCV sklearn)

Categories

Resources