I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.
Related
If I use a pipeline like below, I receive the error message
Invalid parameter C for estimator Pipeline(steps=[('robustscaler', RobustScaler()),
('svc', SVC())]). Check the list of available parameters with `estimator.get_params().keys()`
pipe = make_pipeline(RobustScaler(), SVC())
param_grid = {'C':[1,2,3,4]}
grid = GridSearchCV(pipe,param_grid=param_grid , cv=5)
mod1 = grid.fit(X_train, y_train)
On the contrary, the same param_grid outside the pipeline fits the model without error
param_grid = {'C':[1,2,3,4]}
grid = GridSearchCV(SVC(),param_grid=param_grid , cv=5)
mod1 = grid.fit(X_train, y_train)
This happens with other estimators & scalers I have tried too, regardless of the parameter I would like to specify. Similarly, using the same pipeline, I can fit a model directly without GridSearchCV (pipe.fit(X_train,y_train)). However, the two cannot function together.
I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}
I am trying to solve a problem where I use KNN algorithm for classification. While using pipeline, I decided to add SelectKBest but I get the error below :
All intermediate steps should be transformers and implement fit and transform.
I don't know if I can use this selection algorithm with KNN. But I tried with SVM as well and got the same result. Here is my code :
sel = SelectKBest('chi2',k = 3)
clf = kn()
s = ss()
step = [('scaler', s), ('kn', clf), ('sel',sel)]
pipeline = Pipeline(step)
parameter = {'kn__n_neighbors':range(1,40,1), 'kn__weights':['uniform','distance'], 'kn__p':[1,2] }
kfold = StratifiedKFold(n_splits=5, random_state=0)
grid = GridSearchCV(pipeline, param_grid = parameter, cv=kfold, scoring = 'accuracy', n_jobs = -1)
grid.fit(x_train, y_train)
The order of the operations in the pipeline, as determined in steps, matters; from the docs:
steps : list
List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
The error is due to adding SelectKBest as the last element of your pipeline:
step = [('scaler', s), ('kn', clf), ('sel',sel)]
which is not an estimator (it is a transformer), as well as to your intermediate step kn not being a transformer.
I guess you don't really want to perform feature selection after you have fitted the model...
Change it to:
step = [('scaler', s), ('sel', sel), ('kn', clf)]
and you should be fine.
So, I didn't think the order of the pipeline is important but then, I found out the last member of the pipeline has to be able to fit/transform. I changed the order of the pipeline by making clf the last. Problem is solved.
In the following code:
# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
clf = RandomForestClassifier(5000)
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
What should be used for a prediction?
gs?
gs.best_estimator_?
or
gs.best_estimator_.named_steps['clf']?
What is the difference between these 3?
gs.predict(X_test) is equivalent to gs.best_estimator_.predict(X_test). Using either, X_test will be passed through your entire pipeline and it will return the predictions.
gs.best_estimator_.named_steps['clf'].predict(), however is only the last phase of the pipeline. To use it, the feature selection step must already have been performed. This would only work if you have previously run your data through gs.best_estimator_.named_steps['fs'].transform()
Three equivalent methods for generating predictions are shown below:
Using gs directly.
pred = gs.predict(X_test)
Using best_estimator_.
pred = gs.best_estimator_.predict(X_test)
Calling each step in the pipeline individual.
X_test_fs = gs.best_estimator_.named_steps['fs'].transform(X_test)
pred = gs.best_estimator_.named_steps['clf'].predict(X_test_fs)
If you pass True to the value of refit parameter of GridSearchCV (which is the default value anyway), then the estimator with best parameters refits on the whole dataset, so you can use gs.fit(X_test) for prediction.
If the value of refit is equal to False while fitting the GridSearchCV object on your training set, then for prediction, you have only one option which is using gs.best_estimator_.predict(X_test).
I am attempting to tune an AdaBoost Classifier ("ABT") using a DecisionTreeClassifier ("DTC") as the base_estimator. I would like to tune both ABT and DTC parameters simultaneously, but am not sure how to accomplish this - pipeline shouldn't work, as I am not "piping" the output of DTC to ABT. The idea would be to iterate hyper parameters for ABT and DTC in the GridSearchCV estimator.
How can I specify the tuning parameters correctly?
I tried the following, which generated an error below.
[IN]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {dtc__criterion : ["gini", "entropy"],
dtc__splitter : ["best", "random"],
abc__n_estimators: [none, 1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
[OUT]
ValueError: Invalid parameter dtc for estimator AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight='auto', criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=11, splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=11)
There are several things wrong in the code you posted:
The keys of the param_grid dictionary need to be strings. You should be getting a NameError.
The key "abc__n_estimators" should just be "n_estimators": you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string "abc" represents your AdaBoostClassifier.
None (and not none) is not a valid value for n_estimators. The default value (probably what you meant) is 50.
Here's the code with these fixes.
To set the parameters of your Tree estimator you can use the "__" syntax that allows accessing nested parameters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators": [1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
Also, 1 or 2 estimators does not really make sense for AdaBoost. But I'm guessing this is not the actual code you're running.
Hope this helps.
Trying to provide a shorter (and hopefully generic) answer.
If you want to grid search within a BaseEstimator for the AdaBoostClassifier e.g. varying the max_depth or min_sample_leaf of a DecisionTreeClassifier estimator, then you have to use a special syntax in the parameter grid.
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__max_depth':[i for i in range(2,11,2)],
'base_estimator__min_samples_leaf':[5,10],
'n_estimators':[10,50,250,1000],
'learning_rate':[0.01,0.1]}
clf = GridSearchCV(abc, parameters,verbose=3,scoring='f1',n_jobs=-1)
clf.fit(X_train,y_train)
So, note the 'base_estimator__max_depth' and 'base_estimator__min_samples_leaf' keys in the parameters dictionary. That's the way to access the hyperparameters of a BaseEstimator for an ensemble algorithm like AdaBoostClassifier when you are doing a grid search. Note the __ double underscore notation in particular. Other two keys in the parameters are the regular AdaBoostClassifier parameters.