Combine GridSearchCV and StackingClassifier - python

I want to use StackingClassifier to combine some classifiers and then use GridSearchCV to optimize the parameters:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
sclf = StackingClassifier(estimators=[clf1, clf2],final_estimator=dt)
params = {'randomforestclassifier__n_estimators': [10, 50],
'logisticregression__C': [1,2,3]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(x, y)
But this turns out an error:
'RandomForestClassifier' object has no attribute 'estimators_'
I have used n_estimators. Why it warns me that no estimators_?
Usually GridSearchCV is applied to single model so I just need to write the name of parameters of the single model in a dict.
I refer to this page https://groups.google.com/d/topic/mlxtend/5GhZNwgmtSg but it uses parameters of early version. Even though I change the newly parameters it doesn't work.
Btw, where can I learn the details of the naming rule of these params?

First of all, the estimators need to be a list containing the models in tuples with the corresponding assigned names.
estimators = [('model1', model()), # model() named model1 by myself
('model2', model2())] # model2() named model2 by myself
Next, you need to use the names as they appear in sclf.get_params().
Also, the name is the same as the one you gave to the specific model in the bove estimators list. So, here for model1 parameters you need:
params = {'model1__n_estimators': [5,10]} # model1__SOME_PARAM
Working toy example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
estimators = [('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('logreg', LogisticRegression())]
sclf = StackingClassifier(estimators= estimators , final_estimator=DecisionTreeClassifier())
params = {'rf__n_estimators': [5,10]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(X, y)

After some trial, maybe I find an available solution.
The key to solve this problem is to use get_params() to know the parameters of StackingClassifier.
I use another way to create sclf:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
estimators = [('rf', clf1),
('lr', clf2)]
sclf = StackingClassifier(estimators=estimators,final_estimator=dt)
params = {'rf__n_estimators': list(range(100,1000,100)),
'lr__C': list(range(1,10,1))}
grid = GridSearchCV(estimator=sclf, param_grid=params,verbose=2, cv=5,n_jobs=-1)
grid.fit(x, y)
In this way, I can name every basic classifiers and then set the params with their names.

Related

How to modify GridSearchCV fit() method

i would like to ask how could i modify the fit() method of GridSearchCV in order to be able to pass some parameters to the custom cross validator. Here is the code snippet that im using:
from sklearn.model_selection import GridSearchCV
skf = CombPurgedKFoldCV(n_splits=10, n_test_splits= 2 ,embargo_td=pd.Timedelta(minutes=100))
clf = DecisionTreeClassifier(criterion='entropy',max_features='auto',class_weight='balanced',min_weight_fraction_leaf=0.)
classifier=BaggingClassifier(base_estimator=clf,n_estimators=1000,max_features=1.,
max_samples=avgU,oob_score=True,n_jobs=1)
gs = GridSearchCV(estimator = classifier, param_grid = grid_param, scoring = 'f1',n_jobs = 1, cv=skf)
gs.fit(X,y,pred_times,eval_times)

set_params() in sklean pipeline not working with TransformTargetRegressor

I would like to make a prediction of a single tree of my random forest. However, if I wrap my pipeline around TransformedTargetRegressor .set_params does not seem to work.
Please find below an example:
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# loading data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# pipeline and training
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators = 100, max_depth = 4, random_state = 0))
])
treg = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())
treg.fit(X, Y)
# single tree from random forest
tree = treg.regressor_.named_steps['model'].estimators_[0]
x_sample = X[0:1]
print('baseline: ', treg.predict(x_sample))
x_scaled = treg.regressor_.named_steps['scaler'].transform(x_sample)
y_predicted = tree.predict(x_scaled)
y_transformed = treg.transformer_.inverse_transform([y_predicted])
print("internal pipeline changes: ", y_transformed)
new_model = treg.set_params(**{'regressor__model': tree})
y_predicted = new_model.predict(x_sample)
print('with set_params(): ', y_predicted)
The output that I am getting is shown below. I would expect 'with set_params()' to be the same like 'internal pipeline changes:
baseline: [26.41013313]
internal pipeline changes: [[30.02424242]]
with set_params(): [26.41013313]
TransformedTargetRegressor has a parameter regressor and an attribute regressor_. The former can be set with set_params and is considered a hyperparameter, but is not used in prediction; rather, it is cloned and fitted when the TTR is fitted, and stored in the regressor_ attribute.
So you cannot use set_params to update the fitted regressor attribute. (You can check that in your code, new_model.regressor_['model'] is still a random forest.) The best you can do is directly modify the attribute (though this is probably unorthodox, and in some situations may lead to other issues):
import copy
mod_model = copy.deepcopy(treg)
mod_model.regressor_.steps[-1] = ('model', tree)
y_predicted = mod_model.predict(x_sample)
print('with modifying regressor: ', y_predicted)
Apparently, scikit-learn TransformedTargetRegressor objects don't allow you to change the regressor used to predict, unless you re-fit the dataset on the new regressor in set_params. If you do this:
new_model = treg.set_params(**{'regressor__model': tree})
print(new_model)
you can see that the new parameters have been set. However, as you correctly discovered, the estimator used in predict is still the old one. If you want to change the estimator in the object, you can do:
new_model = treg.set_params(**{'regressor__model': tree})
new_model.fit(X, Y)
new_model.predict(x_sample)
And you can see that the prediction changes and uses the single tree to perform the estimation. If you are interested in the sinlge tree's prediciton and not re-fit on the whole dataset, you can just call, tree.predict() separately.

Cross validation with multiple parameters using f1-score

I am trying to do feature selection using SelectKBest and the best tree depth for binary classification using f1-score. I have created a scorer function to select the best features and to evaluate the grid search. An error of "call() missing 1 required positional argument: 'y_true'" pops up when the classifier is trying to fit to the training data.
#Define scorer
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f1_scorer, k=all)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
#initialize a grid search with features to be optimized
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
#order best selected features into a single variable
selector = SelectKBest(score_func=f1_scorer, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)
On the fit line I get a TypeError: __call__() missing 1 required positional argument: 'y_true'.
The problem is in the score_func which you have used for SelectKBest. score_func is a function which takes two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores, but in your code you have fed the callable f1_scorer as the score_func which just takes your y_true and y_pred and computes the f1 score. You can use one of chi2, f_classif or mutual_info_classif as your score_func for the classification task. Also, there is a minor bug in the parameter k for SelectKBest it should have been "all" instead of all. I have modified your code incorporating these changes,
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif
from sklearn.metrics import f1_score, make_scorer
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_classes=2,
n_informative=4, weights=[0.7, 0.3],
random_state=0)
f1_scorer = make_scorer(f1_score)
#Split data into training, CV and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state = 0)
#initialize tree and Select K-best features for classifier
kbest = SelectKBest(score_func=f_classif)
clf = DecisionTreeClassifier(random_state=0)
#create a pipeline for features to be optimized
pipeline = Pipeline([('kbest',kbest),('dt',clf)])
gs = GridSearchCV(pipeline,{'kbest__k': range(2,11), 'dt__max_depth':range(3,7)}, refit=True, cv=5, scoring = f1_scorer)
gs.fit(X_train,y_train)
gs.best_params_
OUTPUT
{'dt__max_depth': 6, 'kbest__k': 9}
Also modify your last two lines as below:
selector = SelectKBest(score_func=f_classif, k=gs.best_params_['kbest__k'])
X_new = selector.fit_transform(X_train,y_train)

Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn.
GridSearchCV and RFE with "bare" classifier works fine:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
Putting classifier in a pipeline returns an error: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
EDIT:
I have realised that I was not clear describing the problem. This is the clearer snippet:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)
# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)
As you can see, the only difference is putting the estimator in a pipeline. Pipeline, however, hides "coef_" or "feature_importances_" attributes. The questions are:
Is there a nice way of dealing with this in scikit-learn?
If not, is this behaviour desired for any reason?
EDIT2:
Updated, working snippet based on the answer provided by #Chris
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class MyPipe(pipeline.Pipeline):
def fit(self, X, y=None, **fit_params):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(MyPipe, self).fit(X, y, **fit_params)
self.coef_ = self.steps[-1][-1].coef_
return self
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
You have an issue with your use of pipeline.
A pipeline works as below:
first object is applied to data when you call .fit(x,y) etc. If that method exposes a .transform() method, this is applied and this output is used as the input for the next stage.
A pipeline can have any valid model as a final object, but all previous ones MUST expose a .transform() method.
Just like a pipe - you feed in data and each object in the pipeline takes the previous output and does another transform on it.
As we can see,
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform
RFE exposes a transform method, and so should be included in the pipeline itself.
E.g.
some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]
Your attempt has a few issues.
Firstly, you are trying to scale a slice of your data. Imagine I had two partitions [1,1], [10,10]. If I normalize by the mean of the partition I lose the information that my second partition is significantly above the mean. You should scale at the start, not in the middle.
Secondly, SVR does not implement a transform method, you cannot incorporate it as a non final element in a pipeline.
RFE takes in a model which it fits to the data and then evaluates the weight of each feature.
EDIT:
You can include this behaviour if you wish, by wrapping the sklearn pipeline in your own class. What we want to do is when we fit the data, retrieve the last estimators .coef_ method and store that locally in our derived class under the correct name.
I suggest you look into the sourcecode on github as this is only a first start and more error checking etc would probably be required. Sklearn uses a function decorator called #if_delegate_has_method which would be a handy thing to add to ensure the method generalises. I have run this code to make sure it works runs, but nothing more.
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class myPipe(pipeline.Pipeline):
def fit(self, X,y):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(myPipe, self).fit(X,y)
self.coef_=self.steps[-1][-1].coef_
return
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]
pipe = myPipe(pipe_params)
selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)
print clf.best_params_
if anything is not clear, please ask.
I think you had a slightly different way of constructing the pipeline than what was listed in the pipeline documentation.
Are you looking for this?
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_
Also see this useful example for combining things in a pipeline. For the RFE object, I just used the official documentation for constructing it with your SVR estimator - I then just put the RFE object into the pipeline in the same way as you had done with the scaler and estimator objects.

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

Categories

Resources