I am attempting to tune an AdaBoost Classifier ("ABT") using a DecisionTreeClassifier ("DTC") as the base_estimator. I would like to tune both ABT and DTC parameters simultaneously, but am not sure how to accomplish this - pipeline shouldn't work, as I am not "piping" the output of DTC to ABT. The idea would be to iterate hyper parameters for ABT and DTC in the GridSearchCV estimator.
How can I specify the tuning parameters correctly?
I tried the following, which generated an error below.
[IN]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {dtc__criterion : ["gini", "entropy"],
dtc__splitter : ["best", "random"],
abc__n_estimators: [none, 1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
[OUT]
ValueError: Invalid parameter dtc for estimator AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight='auto', criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=11, splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=11)
There are several things wrong in the code you posted:
The keys of the param_grid dictionary need to be strings. You should be getting a NameError.
The key "abc__n_estimators" should just be "n_estimators": you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string "abc" represents your AdaBoostClassifier.
None (and not none) is not a valid value for n_estimators. The default value (probably what you meant) is 50.
Here's the code with these fixes.
To set the parameters of your Tree estimator you can use the "__" syntax that allows accessing nested parameters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators": [1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
Also, 1 or 2 estimators does not really make sense for AdaBoost. But I'm guessing this is not the actual code you're running.
Hope this helps.
Trying to provide a shorter (and hopefully generic) answer.
If you want to grid search within a BaseEstimator for the AdaBoostClassifier e.g. varying the max_depth or min_sample_leaf of a DecisionTreeClassifier estimator, then you have to use a special syntax in the parameter grid.
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__max_depth':[i for i in range(2,11,2)],
'base_estimator__min_samples_leaf':[5,10],
'n_estimators':[10,50,250,1000],
'learning_rate':[0.01,0.1]}
clf = GridSearchCV(abc, parameters,verbose=3,scoring='f1',n_jobs=-1)
clf.fit(X_train,y_train)
So, note the 'base_estimator__max_depth' and 'base_estimator__min_samples_leaf' keys in the parameters dictionary. That's the way to access the hyperparameters of a BaseEstimator for an ensemble algorithm like AdaBoostClassifier when you are doing a grid search. Note the __ double underscore notation in particular. Other two keys in the parameters are the regular AdaBoostClassifier parameters.
Related
I am trying to determine which alpha is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'.
I have an array with some values for alpha ranging from 5e09 to 5e-03:
array([5.00000000e+09, 3.78231664e+09, 2.86118383e+09, 2.16438064e+09,
1.63727458e+09, 1.23853818e+09, 9.36908711e+08, 7.08737081e+08,
5.36133611e+08, 4.05565415e+08, 3.06795364e+08, 2.32079442e+08,
1.75559587e+08, 1.32804389e+08, 1.00461650e+08, 7.59955541e+07,
5.74878498e+07, 4.34874501e+07, 3.28966612e+07, 2.48851178e+07,
1.88246790e+07, 1.42401793e+07, 1.07721735e+07, 8.14875417e+06,
6.16423370e+06, 4.66301673e+06, 3.52740116e+06, 2.66834962e+06,
2.01850863e+06, 1.52692775e+06, 1.15506485e+06, 8.73764200e+05,
6.60970574e+05, 5.00000000e+05, 3.78231664e+05, 2.86118383e+05,
2.16438064e+05, 1.63727458e+05, 1.23853818e+05, 9.36908711e+04,
7.08737081e+04, 5.36133611e+04, 4.05565415e+04, 3.06795364e+04,
2.32079442e+04, 1.75559587e+04, 1.32804389e+04, 1.00461650e+04,
7.59955541e+03, 5.74878498e+03, 4.34874501e+03, 3.28966612e+03,
2.48851178e+03, 1.88246790e+03, 1.42401793e+03, 1.07721735e+03,
8.14875417e+02, 6.16423370e+02, 4.66301673e+02, 3.52740116e+02,
2.66834962e+02, 2.01850863e+02, 1.52692775e+02, 1.15506485e+02,
8.73764200e+01, 6.60970574e+01, 5.00000000e+01, 3.78231664e+01,
2.86118383e+01, 2.16438064e+01, 1.63727458e+01, 1.23853818e+01,
9.36908711e+00, 7.08737081e+00, 5.36133611e+00, 4.05565415e+00,
3.06795364e+00, 2.32079442e+00, 1.75559587e+00, 1.32804389e+00,
1.00461650e+00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])
Then, I used RidgeCV to try and determine which of these values would be best:
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error',
normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
and I got ridgecv.alpha_ = 0.006609705742330144
However, I received a warning that normalize = True is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline and StandardScaler instead. Then, following instructions of how to do a Pipeline, I did:
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
y_pred = ridge_pipe.predict(X_test)
ridge_pipe2.named_steps.model.alpha_
Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342
For a last check, I also used GridSearchCV as follows:
steps = [
('scalar', StandardScaler()),
('model',Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
parameters = [{'model__alpha':alphas}]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'neg_mean_squared_error',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_.get_params
where I got grid_search.best_estimator_.get_params = 1.328043891473342 (same as the other Pipeline approach).
Therefore, my question... why normalizing my dataset with normalize=True or with StandardScaler() yields different best alpha values?
The corresponding warning message for ordinary Ridge makes an additional mention:
Set parameter alpha to: original_alpha * n_samples.
(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV along these lines.) Changing your alphas parameter in the second approach to [alph * X.shape[0] for alph in alphas] should work. The selected alpha_ will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0] and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).
(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)
You need to ensure the same cross validation is used and scale without centering the data.
When you run with normalize=True, you get this as part of the warning :
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
Regarding the cv, if you check the documentation, RidgeCV by default performs leave-one-out cross validation :
Ridge regression with built-in cross-validation.
See glossary entry for cross-validation estimator.
By default, it performs efficient Leave-One-Out Cross-Validation.
So to get the same result, we can define a cross-validation to use :
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
kf = KFold(10)
X_train, y_train = datasets.make_regression()
alphas = [0.001,0.005,0.01,0.05,0.1]
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.001
And use it on pipeline :
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',cv=kf))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
ridge_pipe2.named_steps.model.alpha_
0.001
I am using SelectFromModel for feature selection and LogisticRegression as estimator. I have a preprocessing pipeline for tuning numerical and categorical columns. And using this feature selection pipe with a model in a pipeline in GridSearchCV.
In the param_grid I want to access the max_iter of LogisticRegression in the SelectFromModel method. So I tried 'selectfrommodel__logisticregression__max_iter': [400, 500],.
preprocessor = make_column_transformer(
(num_transformer, make_column_selector(dtype_include=np.number)),
(cat_transformer, make_column_selector(dtype_include=object))
)
fs_pipe = make_pipeline(
preprocessor,
SelectFromModel(estimator=LogisticRegression(solver='saga'))
)
lr_pipe = make_pipeline(fs_pipe, LogisticRegression(n_jobs=-1))
param_grid = {
'selectfrommodel__logisticregression__max_iter': [400, 500],
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__solver': ['saga'],
'logisticregression__max_iter': [400, 500],
}
lr_grid = GridSearchCV(
estimator=lr_pipe,
param_grid=param_grid,
verbose=1, scoring='f1_micro',
error_score='raise')
lr_grid.fit(trainX, trainY)
But it's throwing this error:
ValueError: Invalid parameter selectfrommodel for estimator Pipeline(steps=[('pipeline',
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('minmaxscaler',
MinMaxScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000027DBB9E1C48>),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x0000027DCA939D88>)])),
('selectfrommodel',
SelectFromModel(estimator=LogisticRegression(solver='saga')))])),
('logisticregression', LogisticRegression(n_jobs=-1))]). Check the list of available parameters with `estimator.get_params().keys()`.
How do I access the max_iter parameter of LogisticRegression(), used as estimator in SelectFromModel() from GridSearchCV param_grid?
Your selectfrommodel is nested under the lr_pipe. Following the error's suggestion to run lr_pipe.get_params().keys() should clue you in to that. It looks like you need pipeline__selectfrommodel__logisticregression__max_iter. (And maybe consider skipping make_pipeline and make_column_transformer, so you can give shorter names to the steps? You could also make one three-step pipeline instead of the nested pipes fs_pipe and lr_pipe.)
I want to perform a random search, in classification problem, where the scoring method will be chosen as AUC instead of accuracy score. Have a look at my code for reproducibility:
# Define imports and create data
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
x = np.random.normal(0, 1, 100)
y = np.random.binomial(0, 1, 100)
### Let's define parameter grid
rf = RandomForestClassifier(random_state=0)
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=4)]
min_samples_split = [2, 5, 10]
param_grid = {'n_estimators': n_estimators,
'min_samples_split': min_samples_split}
# Define model
clf = RandomizedSearchCV(rf,
param_grid,
random_state=0,
n_iter=3,
cv=5).fit(x.reshape(-1, 1), y)
And now, according to documentation of function RandomizedSearchCV I can pass another argument scoring which will choose metric to evaluate the model. I tried to pass scoring = auc, but I got an error that there is no such metric. Do you know what I have to do to have AUC instead of accuracy?
According to documentation of function RandomizedSearchCV scoring can be a string or a callable. Here you can find all possible string values for the score parameter. You can also try to set score as a callable auc.
As explained by Danylo and this answer you can specify the search optimal function to be the ROC-AUC, so as to pick the parameter value maximizing it:
clf = RandomizedSearchCV(rf,
param_grid,
random_state=0,
n_iter=3,
cv=5,
scoring='roc_auc').fit(x.reshape(-1, 1), y)
I know this appears to be a common problem and that it's based on the specific name of the parameters, but I'm still getting an error after looking at the keys.
steps=[('classifier', svm.SVC(decision_function_shape="ovo"))]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'estimator__classifier__C':[1, 10, 100],
'estimator__classifier__gamma':[0.001, 0.0001]}
# Instantiate the GridSearchCV object: cv
SVM = GridSearchCV(pipeline, parameters, cv = 5)
_ = SVM.fit(X_train,y_train)
Which I then get:
ValueError: Invalid parameter estimator for estimator ... Check the list of available parameters with `estimator.get_params().keys()`.
So I then look at SVM.get_params().keys() and get the following group, including the two I'm using. What am I missing?
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__preprocessor
estimator__classifier
estimator__preprocessor__n_jobs
estimator__preprocessor__remainder
estimator__preprocessor__sparse_threshold
estimator__preprocessor__transformer_weights
estimator__preprocessor__transformers
estimator__preprocessor__verbose
estimator__preprocessor__scale
estimator__preprocessor__onehot
estimator__preprocessor__scale__memory
estimator__preprocessor__scale__steps
estimator__preprocessor__scale__verbose
estimator__preprocessor__scale__scaler
estimator__preprocessor__scale__scaler__copy
estimator__preprocessor__scale__scaler__with_mean
estimator__preprocessor__scale__scaler__with_std
estimator__preprocessor__onehot__memory
estimator__preprocessor__onehot__steps
estimator__preprocessor__onehot__verbose
estimator__preprocessor__onehot__onehot
estimator__preprocessor__onehot__onehot__categories
estimator__preprocessor__onehot__onehot__drop
estimator__preprocessor__onehot__onehot__dtype
estimator__preprocessor__onehot__onehot__handle_unknown
estimator__preprocessor__onehot__onehot__sparse
estimator__classifier__C
estimator__classifier__break_ties
estimator__classifier__cache_size
estimator__classifier__class_weight
estimator__classifier__coef0
estimator__classifier__decision_function_shape
estimator__classifier__degree
estimator__classifier__gamma
estimator__classifier__kernel
estimator__classifier__max_iter
estimator__classifier__probability
estimator__classifier__random_state
estimator__classifier__shrinking
estimator__classifier__tol
estimator__classifier__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Your param grid should be classifier__C and classifier__gamma. You just need to get rid of estimator in the front because you named your SVC estimator as classifier in your pipeline.
parameters = {'classifier__C':[1, 10, 100],
'classifier__gamma':[0.001, 0.0001]}
I am implementing an example from the O'Reilly book "Introduction to Machine Learning with Python", using Python 2.7 and sklearn 0.16.
The code I am using:
pipe = make_pipeline(TfidfVectorizer(), LogisticRegression())
param_grid = {"logisticregression_C": [0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer_ngram_range": [(1,1), (1,2), (1,3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
The error being returned boils down to:
ValueError: Invalid parameter logisticregression_C for estimator Pipeline
Is this an error related to using Make_pipeline from v.0.16? What is causing this error?
There should be two underscores between estimator name and it's parameters in a Pipeline
logisticregression__C. Do the same for tfidfvectorizer
It is mentioned in the user guide here: https://scikit-learn.org/stable/modules/compose.html#nested-parameters.
See the example at https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
For a more general answer to using Pipeline in a GridSearchCV, the parameter grid for the model should start with whatever name you gave when defining the pipeline. For example:
# Pay attention to the name of the second step, i. e. 'model'
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('model', Lasso())
])
# Define the parameter grid to be used in GridSearch
param_grid = {'model__alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(pipeline, param_grid)
search.fit(X_train, y_train)
In the pipeline, we used the name model for the estimator step. So, in the grid search, any hyperparameter for Lasso regression should be given with the prefix model__. The parameters in the grid depends on what name you gave in the pipeline. In plain-old GridSearchCV without a pipeline, the grid would be given like this:
param_grid = {'alpha': np.arange(0, 1, 0.05)}
search = GridSearchCV(Lasso(), param_grid)
You can find out more about GridSearch from this post.
Note that if you are using a pipeline with a voting classifier and a column selector, you will need multiple layers of names:
pipe1 = make_pipeline(ColumnSelector(cols=(0, 1)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
SVC())
votingClassifier = VotingClassifier(estimators=[
('p1', pipe1), ('p2', pipe2)])
You will need a param grid that looks like the following:
param_grid = {
'p2__svc__kernel': ['rbf', 'poly'],
'p2__svc__gamma': ['scale', 'auto'],
}
p2 is the name of the pipe and svc is the default name of the classifier you create in that pipe. The third element is the parameter you want to modify.
You can always use the model.get_params().keys() [ in case you are using only model ] or pipeline.get_params().keys() [ in case you are using the pipeline] to get the keys to the parameters you can adjust.