If I use a pipeline like below, I receive the error message
Invalid parameter C for estimator Pipeline(steps=[('robustscaler', RobustScaler()),
('svc', SVC())]). Check the list of available parameters with `estimator.get_params().keys()`
pipe = make_pipeline(RobustScaler(), SVC())
param_grid = {'C':[1,2,3,4]}
grid = GridSearchCV(pipe,param_grid=param_grid , cv=5)
mod1 = grid.fit(X_train, y_train)
On the contrary, the same param_grid outside the pipeline fits the model without error
param_grid = {'C':[1,2,3,4]}
grid = GridSearchCV(SVC(),param_grid=param_grid , cv=5)
mod1 = grid.fit(X_train, y_train)
This happens with other estimators & scalers I have tried too, regardless of the parameter I would like to specify. Similarly, using the same pipeline, I can fit a model directly without GridSearchCV (pipe.fit(X_train,y_train)). However, the two cannot function together.
Related
I am using SelectFromModel for feature selection and LogisticRegression as estimator. I have a preprocessing pipeline for tuning numerical and categorical columns. And using this feature selection pipe with a model in a pipeline in GridSearchCV.
In the param_grid I want to access the max_iter of LogisticRegression in the SelectFromModel method. So I tried 'selectfrommodel__logisticregression__max_iter': [400, 500],.
preprocessor = make_column_transformer(
(num_transformer, make_column_selector(dtype_include=np.number)),
(cat_transformer, make_column_selector(dtype_include=object))
)
fs_pipe = make_pipeline(
preprocessor,
SelectFromModel(estimator=LogisticRegression(solver='saga'))
)
lr_pipe = make_pipeline(fs_pipe, LogisticRegression(n_jobs=-1))
param_grid = {
'selectfrommodel__logisticregression__max_iter': [400, 500],
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__solver': ['saga'],
'logisticregression__max_iter': [400, 500],
}
lr_grid = GridSearchCV(
estimator=lr_pipe,
param_grid=param_grid,
verbose=1, scoring='f1_micro',
error_score='raise')
lr_grid.fit(trainX, trainY)
But it's throwing this error:
ValueError: Invalid parameter selectfrommodel for estimator Pipeline(steps=[('pipeline',
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('minmaxscaler',
MinMaxScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000027DBB9E1C48>),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000027DCA939D88>)])),
('selectfrommodel',
SelectFromModel(estimator=LogisticRegression(solver='saga')))])),
('logisticregression', LogisticRegression(n_jobs=-1))]). Check the list of available parameters with `estimator.get_params().keys()`.
How do I access the max_iter parameter of LogisticRegression(), used as estimator in SelectFromModel() from GridSearchCV param_grid?
Your selectfrommodel is nested under the lr_pipe. Following the error's suggestion to run lr_pipe.get_params().keys() should clue you in to that. It looks like you need pipeline__selectfrommodel__logisticregression__max_iter. (And maybe consider skipping make_pipeline and make_column_transformer, so you can give shorter names to the steps? You could also make one three-step pipeline instead of the nested pipes fs_pipe and lr_pipe.)
I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}
In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler(), is this the correct way to apply it to test set as well?
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.
What happens can be described as follows:
Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data
Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
[2] Introduction to Machine Learning with Python
[3] 3.1 Cross-validation: evaluating estimator performance
I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.
In the following code:
# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
clf = RandomForestClassifier(5000)
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
What should be used for a prediction?
gs?
gs.best_estimator_?
or
gs.best_estimator_.named_steps['clf']?
What is the difference between these 3?
gs.predict(X_test) is equivalent to gs.best_estimator_.predict(X_test). Using either, X_test will be passed through your entire pipeline and it will return the predictions.
gs.best_estimator_.named_steps['clf'].predict(), however is only the last phase of the pipeline. To use it, the feature selection step must already have been performed. This would only work if you have previously run your data through gs.best_estimator_.named_steps['fs'].transform()
Three equivalent methods for generating predictions are shown below:
Using gs directly.
pred = gs.predict(X_test)
Using best_estimator_.
pred = gs.best_estimator_.predict(X_test)
Calling each step in the pipeline individual.
X_test_fs = gs.best_estimator_.named_steps['fs'].transform(X_test)
pred = gs.best_estimator_.named_steps['clf'].predict(X_test_fs)
If you pass True to the value of refit parameter of GridSearchCV (which is the default value anyway), then the estimator with best parameters refits on the whole dataset, so you can use gs.fit(X_test) for prediction.
If the value of refit is equal to False while fitting the GridSearchCV object on your training set, then for prediction, you have only one option which is using gs.best_estimator_.predict(X_test).