Invalid parameter for sklearn pipeline - python

I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose

Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}

Related

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters

I am tuning a gradient boosted classifier using a pipeline and grid search
My pipeline is
pipe = make_pipeline(StandardScaler(with_std=True, with_mean=True), \
RFE(RandomForestClassifier(), n_features_to_select= 15), \
GradientBoostingClassifier(random_state=42, verbose=True))
The parameter gri is:
tuned_parameters = [{'gradientboostingclassifier__max_depth': range(3, 5),\
'gradientboostingclassifier__min_samples_split': range(4,6),\
'gradientboostingclassifier__learning_rate':np.linspace(0.1, 1, 10)}]
The grid search is done as
grid = GridSearchCV(pipe, tuned_parameters, cv=5, scoring='accuracy', refit=True)
grid.fit(X_train, y_train)
After fitting the model in train data, when I check the grid.best_estimator I can only find the 2 parameters(learning_rate and min_samples_split )that I am fitting. I don't find the max_depth parameter in the best estimator.
grid.best_estimator_.named_steps['gradientboostingclassifier'] =
GradientBoostingClassifier(learning_rate=0.9, min_samples_split=5,
random_state=42, verbose=True)
But, if I use the grid.cv_results to find the best 'mean_test_score' and find the corresponding parameters for that test score, then I can find the max_depth in it.
inde = np.where(grid.cv_results_['mean_test_score'] == max(grid.cv_results_['mean_test_score']))
grid.cv_results_['params'][inde[-1][0]]
{'gradientboostingclas...rning_rate': 0.9, 'gradientboostingclas..._max_depth': 3, 'gradientboostingclas...ples_split': 5}
special variables
function variables
'gradientboostingclassifier__learning_rate':0.9
'gradientboostingclassifier__max_depth':3
'gradientboostingclassifier__min_samples_split':5
My doubt now is, if I use the trained pipeline (name of the object is 'grid' in my case) will it still use the 'max_depth' parameter also or will it not?
Is it then better to use the 'best parameters' which gave me the best 'mean_test_score' taken from the grid.cv_results
Your pipeline has been tuned on all three parameters that you specified. It is just that the best value for max_depth happens to be the default value. When printing the classifier, default values will not be included. Compare the following outputs:
print(GradientBoostingClassifier(max_depth=3)) # default
# output: GradientBoostingClassifier()
print(GradientBoostingClassifier(max_depth=5)) # not default
# output: GradientBoostingClassifier(max_depth=5)
In general, it is best-practice to access the best parameters by the best_params_ attribute of the fitted GridSearchCV object since this will always include all parameters:
grid.best_params_

Passing parameters to a pipeline's fit() in sklearn

I have a sklearn pipeline with PolynomialFeatures() and LinearRegression() in series. My aim is to fit data to this using different degree of the polynomial features and measure the score. The following is the code I use -
steps = [('polynomials',preprocessing.PolynomialFeatures()),('linreg',linear_model.LinearRegression())]
pipeline = pipeline.Pipeline(steps=steps)
scores = dict()
for i in range(2,6):
params = {'polynomials__degree': i,'polynomials__include_bias': False}
#pipeline.set_params(**params)
pipeline.fit(X_train,y=yCO_logTrain,**params)
scores[i] = pipeline.score(X_train,yCO_logTrain)
scores
I receive the error - TypeError: fit() got an unexpected keyword argument 'degree'.
Why is this error thrown even though the parameters are named in the format <estimator_name>__<parameter_name>?
As per sklearn.pipeline.Pipeline documentation:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
This means that the parameters passed this way are directly passed to s step .fit() method. If you check PolynomialFeatures documentation, degree argument is used in construction of the PolynomialFeatures object, not in its .fit() method.
If you want to try different hyperparameters for estimators/transformators within a pipeline, you could use GridSearchCV as shown here. Here's an example code from the link:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('select', SelectKBest()),
('model', calibrated_forest)])
param_grid = {
'select__k': [1, 2],
'model__base_estimator__max_depth': [2, 4, 6, 8]}
search = GridSearchCV(pipe, param_grid, cv=5).fit(X, y)

Error when using scikit-learn to use pipelines

I am trying to perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)
Finally, I want to create a Grid Search cross validator for the above where param_grid will be a dictionary containing n_neighbors as hyperparameter and k_vals as values.
def kNearest(k_vals):
skf = StratifiedKFold(n_splits=5, random_state=23)
svp = Pipeline([('ss', StandardScaler()),
('knc', neighbors.KNeighborsClassifier())])
parameters = {'n_neighbors': k_vals}
clf = GridSearchCV(estimator=svp, param_grid=parameters, cv=skf)
return clf
But doing this will give me an error saying that
Invalid parameter n_neighbors for estimator Pipeline. Check the list of available parameters with `estimator.get_params().keys()`.
I've read the documentation, but still don't quite get what the error indicates and how to fix it.
You are right, this is not exactly well-documented by scikit-learn. (Zero reference to it in the class docstring.)
If you use a pipeline as the estimator in a grid search, you need to use a special syntax when specifying the parameter grid. Specifically, you need to use the step name followed by a double underscore, followed by the parameter name as you would pass it to the estimator. I.e.
'<named_step>__<parameter>': value
In your case:
parameters = {'knc__n_neighbors': k_vals}
should do the trick.
Here knc is a named step in your pipeline. There is an attribute that shows these steps as a dictionary:
svp.named_steps
{'knc': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
'ss': StandardScaler(copy=True, with_mean=True, with_std=True)}
And as your traceback alludes to:
svp.get_params().keys()
dict_keys(['memory', 'steps', 'ss', 'knc', 'ss__copy', 'ss__with_mean', 'ss__with_std', 'knc__algorithm', 'knc__leaf_size', 'knc__metric', 'knc__metric_params', 'knc__n_jobs', 'knc__n_neighbors', 'knc__p', 'knc__weights'])
Some official references to this:
The user guide on pipelines
Sample pipeline for text feature extraction and evaluation

Invalid parameter for sklearn estimator pipeline

I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.

Using GridSearchCV with AdaBoost and DecisionTreeClassifier

I am attempting to tune an AdaBoost Classifier ("ABT") using a DecisionTreeClassifier ("DTC") as the base_estimator. I would like to tune both ABT and DTC parameters simultaneously, but am not sure how to accomplish this - pipeline shouldn't work, as I am not "piping" the output of DTC to ABT. The idea would be to iterate hyper parameters for ABT and DTC in the GridSearchCV estimator.
How can I specify the tuning parameters correctly?
I tried the following, which generated an error below.
[IN]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {dtc__criterion : ["gini", "entropy"],
dtc__splitter : ["best", "random"],
abc__n_estimators: [none, 1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
[OUT]
ValueError: Invalid parameter dtc for estimator AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight='auto', criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=11, splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=11)
There are several things wrong in the code you posted:
The keys of the param_grid dictionary need to be strings. You should be getting a NameError.
The key "abc__n_estimators" should just be "n_estimators": you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string "abc" represents your AdaBoostClassifier.
None (and not none) is not a valid value for n_estimators. The default value (probably what you meant) is 50.
Here's the code with these fixes.
To set the parameters of your Tree estimator you can use the "__" syntax that allows accessing nested parameters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators": [1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
Also, 1 or 2 estimators does not really make sense for AdaBoost. But I'm guessing this is not the actual code you're running.
Hope this helps.
Trying to provide a shorter (and hopefully generic) answer.
If you want to grid search within a BaseEstimator for the AdaBoostClassifier e.g. varying the max_depth or min_sample_leaf of a DecisionTreeClassifier estimator, then you have to use a special syntax in the parameter grid.
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__max_depth':[i for i in range(2,11,2)],
'base_estimator__min_samples_leaf':[5,10],
'n_estimators':[10,50,250,1000],
'learning_rate':[0.01,0.1]}
clf = GridSearchCV(abc, parameters,verbose=3,scoring='f1',n_jobs=-1)
clf.fit(X_train,y_train)
So, note the 'base_estimator__max_depth' and 'base_estimator__min_samples_leaf' keys in the parameters dictionary. That's the way to access the hyperparameters of a BaseEstimator for an ensemble algorithm like AdaBoostClassifier when you are doing a grid search. Note the __ double underscore notation in particular. Other two keys in the parameters are the regular AdaBoostClassifier parameters.

Categories

Resources