Problem passing Pipeline through nested RFECV and GridSearchCV

Problem passing Pipeline through nested RFECV and GridSearchCV - python

I'm trying to perform feature selection and grid search for the inner loop of a nested CV in sklearn. While I can pass a pipeline as the estimator to the RFECV, I then receive an error on fitting when I pass the RFECV as the estimator to GridSearchCV.
I have found that changing the name of the model in the pipeline to 'estimator' moves the error to the Pipeline with 'regression an invalid parameter' rather than in the RFECV where whatever I named the model was an invalid parameter.
I have verified using rfcv.get_params().keys() and pipeline.get_params().keys() that the parameters I'm calling do exist.
I do not receive this error if I name SGDRegressor() directly as ‘estimator’ and ignore the pipeline entirely, but this model requires feature scaling and log transforming of the Y variable.
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDRegressor
import numpy as np
# random sample data
X = np.random.rand(100,2)
y = np.random.rand(100)
#passing coef amd importance through pipe and TransformedTargetRegressor
class MyPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
class MyTransformedTargetRegressor(TransformedTargetRegressor):
#property
def feature_importances_(self):
return self.regressor_.feature_importances_
#property
def coef_(self):
return self.regressor_.coef_
# build pipeline
pipeline = MyPipeline([ ('scaler', MinMaxScaler()),
('estimator', MyTransformedTargetRegressor(regressor=SGDRegressor(), func=np.log1p, inverse_func=np.expm1))])
# define tuning grid
parameters = {"estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
# instantiate inner cv
inner_kv = KFold(n_splits=5, shuffle=True, random_state=42)
rfcv = RFECV(estimator=pipeline, step=1, cv=inner_kv, scoring="neg_mean_squared_error")
cv = GridSearchCV(estimator=rfcv, param_grid=parameters, cv=inner_kv, iid=True,
scoring= "neg_mean_squared_error", n_jobs=-1, verbose=True)
cv.fit(X,y)
I receive the following error and can confirm that regressor is a parameter for estimator pipeline:
ValueError: Invalid parameter regressor for estimator MyPipeline(memory=None,
steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('estimator',
MyTransformedTargetRegressor(check_inverse=True,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>,
regressor=SGDRegressor(alpha=0.0001,
average=False,
early_stopping=False,
epsilon=0.1,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.15,
learning_rate='invscaling',
loss='squared_loss',
max_iter=1000,
n_iter_no_change=5,
penalty='l2',
power_t=0.25,
random_state=None,
shuffle=True,
tol=0.001,
validation_fraction=0.1,
verbose=0,
warm_start=False),
transformer=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Thanks

It has to be estimator__estimator__regressor because you have the pipeline inside the rfecv.
Try this!
parameters = {"estimator__estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
Note: Having a nested CV is not a right approach. May be you can do the feature selection separately and then do the model training.

Related

Pipline with SMOTE and Imputer Errors

i am trying to create a pipeline that first impute missing data , do oversampling with the SMOTE and the the model
my code worked perfectly before i try smote not i cant find any solution
here is the code without smote
scoring = ['balanced_accuracy', 'f1_macro']
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
And here's the code after adding smote
Note: I tired importing make pipeline from imlearn
imputer = SimpleImputer(strategy='most_frequent')
pipeline = Pipeline(steps=[('i', imputer),('over', SMOTE()),('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
when i import Pipeline From SKLearn i got this error
All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
when i tried importing makepipeline from imlearn i get this error
Last step of Pipeline should implement fit or be the string 'passthrough'. '[('i', SimpleImputer(strategy='most_frequent')), ('over', SMOTE()), ('m', RandomForestClassifier())]' (type <class 'list'>) doesn't

Use the imblearn pipline:
from imblearn.pipeline import Pipeline
pipeline = Pipeline([('i', imputer),('over', SMOTE()),('m', model)])

How to test preprocessing combinations in nested pipeline using GridSearchCV?

I've been working on this classification problem using sklearn's Pipeline to combine the preprocessing step (scaling) and the cross validation step (GridSearchCV) using Logistic Regression.
Here is the simplified code:
# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
# scaler and encoder options
scaler = StandardScaler() # there are 3 options that I want to try
encoder = OneHotEncoder() # only one option, no need to GridSearch it
# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
('numerical', scaler, num_columns),
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())])
What I'm trying to do, is to try different scaling methods (e.g. standard scaling, robust scaling, etc.) and after trying all of those, pick the scaling method that yields the best metric (i.e. accuracy). However, I don't know how to do this using the GridSearchCV:
from sklearn.model_selection import GridSearchCV
# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)
# fit the data
grid_cv.fit(X_train, y_train)
I know that the code above won't work, particularly because of the scaler_options that I've set as param_grid. I realize that the scaler_options I set can't be processed by GridSearchCV. Why? Because it isn't a hyperparameter of the pipeline (unlike 'log_reg__C', a hyperparameter from LogisticRegression() than can be accessed by the GridSearchCV). But instead its a component of the ColumnTransformer which I have nested inside the full_pipeline.
So the main question is, how do I automate GridSearchCV to test all of my scaler options? Since the scaler is a component of a sub-pipeline (i.e. the previous ColumnTransformer).

As you suggested you could create a class that takes in its __init()__ parameters, the scaler you want to use.
Then you could specify in your grid search parameters the Scaler your class should use to initialize the class.
I wrote that i hope it helps :
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
Here you can find a full example that you can run to test :
# import dependencies
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import load_breast_cancer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()
# scaler and encoder options
my_scaler = ScalerSelector()
preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())
])
# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)
# fit the data
grid_cv.fit(data, target)
# best params :
grid_cv.best_params_

You can fulfill what you intend without creating a custom transformer. And you can even pass the 'passthrough' argument into param_grid to experiment with the scenario where you don't want to do any scaling in that step at all.
In this example, suppose we want to investigate whether it is better for the model to impose a Scaler transformer on numerical features, num_features.
cat_features = selector(dtype_exclude='number')(train.drop('target', axis=1))
num_features = selector(dtype_include='number')(train.drop('target', axis=1))
cat_preprocessor = Pipeline(steps=[
('oh', OneHotEncoder(handle_unknown='ignore')),
('ss', StandardScaler())
])
num_preprocessor = Pipeline(steps=[
('pt', PowerTransformer(method='yeo-johnson')),
('ss', StandardScaler()) # Create a place holder for your test here !!!
])
preprocessor = ColumnTransformer(transformers=[
('cat', cat_preprocessor, cat_features),
('num', num_preprocessor, num_features)
])
model = Pipeline(steps=[
('prep', preprocessor),
('clf', RidgeClassifier())
])
X = train.drop('target', axis=1)
y = train['target']
param_grid = {
'prep__cat__ss': ['passthrough', StandardScaler(with_mean=False)] # 'passthrough',
}
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=2
)
gs.fit(X, y)

Combine GridSearchCV and StackingClassifier

I want to use StackingClassifier to combine some classifiers and then use GridSearchCV to optimize the parameters:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
sclf = StackingClassifier(estimators=[clf1, clf2],final_estimator=dt)
params = {'randomforestclassifier__n_estimators': [10, 50],
'logisticregression__C': [1,2,3]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(x, y)
But this turns out an error:
'RandomForestClassifier' object has no attribute 'estimators_'
I have used n_estimators. Why it warns me that no estimators_?
Usually GridSearchCV is applied to single model so I just need to write the name of parameters of the single model in a dict.
I refer to this page https://groups.google.com/d/topic/mlxtend/5GhZNwgmtSg but it uses parameters of early version. Even though I change the newly parameters it doesn't work.
Btw, where can I learn the details of the naming rule of these params?

First of all, the estimators need to be a list containing the models in tuples with the corresponding assigned names.
estimators = [('model1', model()), # model() named model1 by myself
('model2', model2())] # model2() named model2 by myself
Next, you need to use the names as they appear in sclf.get_params().
Also, the name is the same as the one you gave to the specific model in the bove estimators list. So, here for model1 parameters you need:
params = {'model1__n_estimators': [5,10]} # model1__SOME_PARAM
Working toy example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
estimators = [('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('logreg', LogisticRegression())]
sclf = StackingClassifier(estimators= estimators , final_estimator=DecisionTreeClassifier())
params = {'rf__n_estimators': [5,10]}
grid = GridSearchCV(estimator=sclf, param_grid=params, cv=5)
grid.fit(X, y)

After some trial, maybe I find an available solution.
The key to solve this problem is to use get_params() to know the parameters of StackingClassifier.
I use another way to create sclf:
clf1 = RandomForestClassifier()
clf2 = LogisticRegression()
dt = DecisionTreeClassifier()
estimators = [('rf', clf1),
('lr', clf2)]
sclf = StackingClassifier(estimators=estimators,final_estimator=dt)
params = {'rf__n_estimators': list(range(100,1000,100)),
'lr__C': list(range(1,10,1))}
grid = GridSearchCV(estimator=sclf, param_grid=params,verbose=2, cv=5,n_jobs=-1)
grid.fit(x, y)
In this way, I can name every basic classifiers and then set the params with their names.

FeatureUnion : Sklearn FeatureUnion does not allows fit params

FeatureUnion is not able to fit. The last line of the following code fit() throws error as:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
pca = PCA(n_components=2)
selection = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
svm = SVC(kernel="linear")
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
pipeline.fit (X, y, features__univ_select__k=2)
Error Thrown:
TypeError: fit_transform() got an unexpected keyword argument 'univ_select__k'

The argument features__univ_select__k=2 is not used in fit; on the other hand, it is simply not necessary here.
If you are following the feature union example in the scikit-learn docs (as you seem to be), you should notice that there it is used as an argument in param_grid, and not in fit.
But here you don't perform any parameter grid search; and since you have already defined that
pca = PCA(n_components=2)
selection = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
there is not any other "choice" for the number of features to be used, hence you should not use any features__univ_select__k=2 argument. Simply giving
pipeline.fit (X, y)
does the job:
Pipeline(memory=None,
steps=[('features', FeatureUnion(n_jobs=None,
transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('univ_select', SelectKBest(k=1, score_func=<function f_classif at 0x000000000808AD08>))],
tran...r', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])

Optimising a meta estimator

I'm trying to use the GridSearchCV functions of scikit-learn to find the best parameters of some base models, which I then feed into a stacking estimator.
My code is based on this post (which I'm using to illustrate): https://stats.stackexchange.com/questions/139042/ensemble-of-different-kinds-of-regressors-using-scikit-learn-or-any-other-pytho/274147
I'd like to perform a grid search over the parameters of my estimators (mostly the ridge parameter, the number of neighbours in KNN, and the RF depth and spilt), but I can't get it working. I define the model, below:
from sklearn.base import TransformerMixin
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
class RidgeTransformer(Ridge, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
class RandomForestTransformer(RandomForestRegressor, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
class KNeighborsTransformer(KNeighborsRegressor, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
def build_model():
ridge_transformer = Pipeline(steps=[
('scaler', StandardScaler()),
('poly_feats', PolynomialFeatures()),
('ridge', RidgeTransformer())
])
pred_union = FeatureUnion(
transformer_list=[
('ridge', ridge_transformer),
('rand_forest', RandomForestTransformer()),
('knn', KNeighborsTransformer())
],
n_jobs=2
)
model = Pipeline(steps=[
('pred_union', pred_union),
('lin_regr', LinearRegression())
])
return model
Now, I'd like to run CV on the parameters of the forest. I can get the parameters with:
print(model.get_params().keys())
But when I run the code below, I still get an error:
pipe = Pipeline(steps=[('reg', model)])
parameters = {'pred_union__rand_forest__n_estimators':[20, 50, 100, 200]}
g_search = GridSearchCV(pipe, parameters)
X, y = make_regression(n_features=10, n_targets=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
g_search.fit(X_train, y_train)
Invalid parameter pred_union for estimator Pipeline(memory=None,
steps=[('reg', Pipeline(memory=None,
steps=[('pred_union', FeatureUnion(n_jobs=2,
transformer_list=[('ridge', Pipeline(memory=None,
steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('poly_feats', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('ridge', RidgeTransformer(...=None)), ('lin_regr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]))]). Check the list of available parameters with `estimator.get_params().keys()`.
What am I doing wrong?

Your model is actually already a pipeline, so why are you wrapping it again in a pipeline? No need for pipe = Pipeline(steps=[('reg', model)]). Just use model inside the grid-search.
But if you want to wrap it inside a pipeline and then work, then you need to update the parameters by appending the 'reg' to each name.
parameters = {'reg__pred_union__rand_forest__n_estimators':[20, 50, 100, 200]}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem passing Pipeline through nested RFECV and GridSearchCV - python

Related

Pipline with SMOTE and Imputer Errors

How to test preprocessing combinations in nested pipeline using GridSearchCV?

Combine GridSearchCV and StackingClassifier

FeatureUnion : Sklearn FeatureUnion does not allows fit params

Optimising a meta estimator

Categories

Resources