Optimising a meta estimator - python

I'm trying to use the GridSearchCV functions of scikit-learn to find the best parameters of some base models, which I then feed into a stacking estimator.
My code is based on this post (which I'm using to illustrate): https://stats.stackexchange.com/questions/139042/ensemble-of-different-kinds-of-regressors-using-scikit-learn-or-any-other-pytho/274147
I'd like to perform a grid search over the parameters of my estimators (mostly the ridge parameter, the number of neighbours in KNN, and the RF depth and spilt), but I can't get it working. I define the model, below:
from sklearn.base import TransformerMixin
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
class RidgeTransformer(Ridge, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
class RandomForestTransformer(RandomForestRegressor, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
class KNeighborsTransformer(KNeighborsRegressor, TransformerMixin):
def transform(self, X, *_):
return self.predict(X)
def build_model():
ridge_transformer = Pipeline(steps=[
('scaler', StandardScaler()),
('poly_feats', PolynomialFeatures()),
('ridge', RidgeTransformer())
])
pred_union = FeatureUnion(
transformer_list=[
('ridge', ridge_transformer),
('rand_forest', RandomForestTransformer()),
('knn', KNeighborsTransformer())
],
n_jobs=2
)
model = Pipeline(steps=[
('pred_union', pred_union),
('lin_regr', LinearRegression())
])
return model
Now, I'd like to run CV on the parameters of the forest. I can get the parameters with:
print(model.get_params().keys())
But when I run the code below, I still get an error:
pipe = Pipeline(steps=[('reg', model)])
parameters = {'pred_union__rand_forest__n_estimators':[20, 50, 100, 200]}
g_search = GridSearchCV(pipe, parameters)
X, y = make_regression(n_features=10, n_targets=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
g_search.fit(X_train, y_train)
Invalid parameter pred_union for estimator Pipeline(memory=None,
steps=[('reg', Pipeline(memory=None,
steps=[('pred_union', FeatureUnion(n_jobs=2,
transformer_list=[('ridge', Pipeline(memory=None,
steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('poly_feats', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('ridge', RidgeTransformer(...=None)), ('lin_regr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]))]). Check the list of available parameters with `estimator.get_params().keys()`.
What am I doing wrong?

Your model is actually already a pipeline, so why are you wrapping it again in a pipeline? No need for pipe = Pipeline(steps=[('reg', model)]). Just use model inside the grid-search.
But if you want to wrap it inside a pipeline and then work, then you need to update the parameters by appending the 'reg' to each name.
parameters = {'reg__pred_union__rand_forest__n_estimators':[20, 50, 100, 200]}

Related

How to tune a custom classifier in sklearn with additional required parameters in fit and predict methods?

For the purposes of this question, I have created a dummy dataset below:
import numpy as np
import pandas as pd
cities = ['Berlin', 'Frankfurt', 'Hamburg',
'Nuremberg', 'Munich', 'Stuttgart',
'Hanover', 'Saarbruecken', 'Cologne',
'Constance', 'Freiburg', 'Karlsruhe'
]
n= len(cities)
data = pd.DataFrame({
'City':cities,
'Temperature': np.random.normal(24, 3, n),
'Humidity': np.random.normal(78, 2.5, n),
'Wind': np.random.normal(15, 4, n),
'Target': np.random.randint(2, size=n)
})
I have a (simplified) custom classifier below which maps the text feature to a continuous value before combining it with the remaining features for classification:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
class CustomClassifier(ClassifierMixin, BaseEstimator):
def __init__(self, n_estimators=100):
self.n_estimators=100
self.NLP = Pipeline(
[
('preprocessor', TfidfVectorizer()),
('regressor', LogisticRegression())
]
)
self.Classifier = GradientBoostingClassifier(n_estimators=self.n_estimators)
def fit(self, X, y, text_data, **kwargs):
self.NLP.fit(text_data, y)
text_feature = self.NLP.predict_proba(text_data)
new_X = np.concatenate(
(text_feature[:,1, np.newaxis], X),
axis=1
)
self.Classifier.fit(new_X, y)
return self
def predict(self, X, text_data):
text_feature = self.NLP.predict_proba(text_data)
new_X = np.concatenate(
(text_feature[:,1, np.newaxis],X),
axis=1
)
y_pred = self.Classifier.predict(new_X)
return y_pred
I can run fit and predict without issue, but if I try to tune the pipeline using cross validation as follows:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import RandomizedSearchCV
custom_model = CustomClassifier()
pipe = Pipeline([
('scaling', MaxAbsScaler()),
('classifier', custom_model)
])
params = {'classifier__n_estimators':[100,200]}
tuner = RandomizedSearchCV(
pipe,
param_distributions=params,
cv=3,
n_iter=2
)
tuner.fit(X=data.drop(['City', 'Target'], axis=1), y=data.loc[:,'Target'], classifier__text_data=data.loc[:,'City'])
I receive the following error:
predict() missing 1 required positional argument: 'text_data'
Which also prevents me from using something like cross_val_score.
Any ideas on how I can approach tuning my custom pipeline?
This might be a limitation of the RandomizedSearchCV class (which does not accept additional parameters in predict), so I tried to write a custom class with a custom predict method:
class RandomizedSearchCVPlus(RandomizedSearchCV):
def predict(self, X, **predict_params):
return self.best_estimator_.predict(X, **predict_params)
But this did not solve the issue.

Using StandardScaler as Preprocessor in Mlens Pipeline generates Classification Warning

I am trying to scale my data within the crossvalidation folds of a MLENs Superlearner pipeline. When I use StandardScaler in the pipeline (as demonstrated below), I receive the following warning:
/miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details:
ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets")
(name, inst_name, exc), MetricWarning)
Of note, when I omit the StandardScaler() the warning disappears, but the data is not scaled.
breast_cancer_data = load_breast_cancer()
X = breast_cancer_data['data']
y = breast_cancer_data['target']
from sklearn.model_selection import train_test_split
X, X_val, y, y_val = train_test_split(X, y, test_size=.3, random_state=0)
from sklearn.base import BaseEstimator
class RFBasedFeatureSelector(BaseEstimator):
def __init__(self, n_estimators):
self.n_estimators = n_estimators
self.selector = None
def fit(self, X, y):
clf = RandomForestClassifier(n_estimators=self.n_estimators, random_state = RANDOM_STATE, class_weight = 'balanced')
clf = clf.fit(X, y)
self.selector = SelectFromModel(clf, prefit=True, threshold = 0.001)
def transform(self, X):
if self.selector is None:
raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
return self.selector.transform(X)
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
N_FOLDS = 5
RF_ESTIMATORS = 1000
N_ESTIMATORS = 1000
RANDOM_STATE = 42
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score, balanced_accuracy_score
accuracy_scorer = make_scorer(balanced_accuracy_score, average='micro', greater_is_better=True)
from mlens.ensemble.super_learner import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
ensemble = SuperLearner(folds=N_FOLDS, shuffle=True, random_state=RANDOM_STATE, n_jobs=10, scorer=balanced_accuracy_score, backend="multiprocessing")
preprocessing1 = {'pipeline-1': [StandardScaler()]
}
preprocessing2 = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS)]
}
estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE, class_weight='balanced'),
MLPClassifier(hidden_layer_sizes=(10, 10, 10), activation='relu', solver='sgd',
max_iter=5000)
]
}
ensemble.add(estimators, preprocessing2, preprocessing1)
ensemble.add_meta(LogisticRegression(solver='liblinear', class_weight = 'balanced'))
ensemble.fit(X,y)
yhat = ensemble.predict(X_val)
balanced_accuracy_score(y_val, yhat)```
>Error text: /miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details:
ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets")
(name, inst_name, exc), MetricWarning)
You are currently passing your preprocessing steps as two separate arguments when calling the add method.
You can instead combine them as follows:
preprocessing = {'pipeline-1': [RFBasedFeatureSelector(N_ESTIMATORS),StandardScaler()]}
Please refer to the documentation for the add method found here:
https://mlens.readthedocs.io/en/0.1.x/source/mlens.ensemble.super_learner/

How to test preprocessing combinations in nested pipeline using GridSearchCV?

I've been working on this classification problem using sklearn's Pipeline to combine the preprocessing step (scaling) and the cross validation step (GridSearchCV) using Logistic Regression.
Here is the simplified code:
# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
# scaler and encoder options
scaler = StandardScaler() # there are 3 options that I want to try
encoder = OneHotEncoder() # only one option, no need to GridSearch it
# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
('numerical', scaler, num_columns),
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())])
What I'm trying to do, is to try different scaling methods (e.g. standard scaling, robust scaling, etc.) and after trying all of those, pick the scaling method that yields the best metric (i.e. accuracy). However, I don't know how to do this using the GridSearchCV:
from sklearn.model_selection import GridSearchCV
# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)
# fit the data
grid_cv.fit(X_train, y_train)
I know that the code above won't work, particularly because of the scaler_options that I've set as param_grid. I realize that the scaler_options I set can't be processed by GridSearchCV. Why? Because it isn't a hyperparameter of the pipeline (unlike 'log_reg__C', a hyperparameter from LogisticRegression() than can be accessed by the GridSearchCV). But instead its a component of the ColumnTransformer which I have nested inside the full_pipeline.
So the main question is, how do I automate GridSearchCV to test all of my scaler options? Since the scaler is a component of a sub-pipeline (i.e. the previous ColumnTransformer).
As you suggested you could create a class that takes in its __init()__ parameters, the scaler you want to use.
Then you could specify in your grid search parameters the Scaler your class should use to initialize the class.
I wrote that i hope it helps :
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
Here you can find a full example that you can run to test :
# import dependencies
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import load_breast_cancer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()
# scaler and encoder options
my_scaler = ScalerSelector()
preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())
])
# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)
# fit the data
grid_cv.fit(data, target)
# best params :
grid_cv.best_params_
You can fulfill what you intend without creating a custom transformer. And you can even pass the 'passthrough' argument into param_grid to experiment with the scenario where you don't want to do any scaling in that step at all.
In this example, suppose we want to investigate whether it is better for the model to impose a Scaler transformer on numerical features, num_features.
cat_features = selector(dtype_exclude='number')(train.drop('target', axis=1))
num_features = selector(dtype_include='number')(train.drop('target', axis=1))
cat_preprocessor = Pipeline(steps=[
('oh', OneHotEncoder(handle_unknown='ignore')),
('ss', StandardScaler())
])
num_preprocessor = Pipeline(steps=[
('pt', PowerTransformer(method='yeo-johnson')),
('ss', StandardScaler()) # Create a place holder for your test here !!!
])
preprocessor = ColumnTransformer(transformers=[
('cat', cat_preprocessor, cat_features),
('num', num_preprocessor, num_features)
])
model = Pipeline(steps=[
('prep', preprocessor),
('clf', RidgeClassifier())
])
X = train.drop('target', axis=1)
y = train['target']
param_grid = {
'prep__cat__ss': ['passthrough', StandardScaler(with_mean=False)] # 'passthrough',
}
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=2
)
gs.fit(X, y)

Problem passing Pipeline through nested RFECV and GridSearchCV

I'm trying to perform feature selection and grid search for the inner loop of a nested CV in sklearn. While I can pass a pipeline as the estimator to the RFECV, I then receive an error on fitting when I pass the RFECV as the estimator to GridSearchCV.
I have found that changing the name of the model in the pipeline to 'estimator' moves the error to the Pipeline with 'regression an invalid parameter' rather than in the RFECV where whatever I named the model was an invalid parameter.
I have verified using rfcv.get_params().keys() and pipeline.get_params().keys() that the parameters I'm calling do exist.
I do not receive this error if I name SGDRegressor() directly as ‘estimator’ and ignore the pipeline entirely, but this model requires feature scaling and log transforming of the Y variable.
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDRegressor
import numpy as np
# random sample data
X = np.random.rand(100,2)
y = np.random.rand(100)
#passing coef amd importance through pipe and TransformedTargetRegressor
class MyPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
class MyTransformedTargetRegressor(TransformedTargetRegressor):
#property
def feature_importances_(self):
return self.regressor_.feature_importances_
#property
def coef_(self):
return self.regressor_.coef_
# build pipeline
pipeline = MyPipeline([ ('scaler', MinMaxScaler()),
('estimator', MyTransformedTargetRegressor(regressor=SGDRegressor(), func=np.log1p, inverse_func=np.expm1))])
# define tuning grid
parameters = {"estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
# instantiate inner cv
inner_kv = KFold(n_splits=5, shuffle=True, random_state=42)
rfcv = RFECV(estimator=pipeline, step=1, cv=inner_kv, scoring="neg_mean_squared_error")
cv = GridSearchCV(estimator=rfcv, param_grid=parameters, cv=inner_kv, iid=True,
scoring= "neg_mean_squared_error", n_jobs=-1, verbose=True)
cv.fit(X,y)
I receive the following error and can confirm that regressor is a parameter for estimator pipeline:
ValueError: Invalid parameter regressor for estimator MyPipeline(memory=None,
steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('estimator',
MyTransformedTargetRegressor(check_inverse=True,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>,
regressor=SGDRegressor(alpha=0.0001,
average=False,
early_stopping=False,
epsilon=0.1,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.15,
learning_rate='invscaling',
loss='squared_loss',
max_iter=1000,
n_iter_no_change=5,
penalty='l2',
power_t=0.25,
random_state=None,
shuffle=True,
tol=0.001,
validation_fraction=0.1,
verbose=0,
warm_start=False),
transformer=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Thanks
It has to be estimator__estimator__regressor because you have the pipeline inside the rfecv.
Try this!
parameters = {"estimator__estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
Note: Having a nested CV is not a right approach. May be you can do the feature selection separately and then do the model training.

How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
I am happy to provide more details if needed.
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
You can also, for example, append the scores to a list defined outside.
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....
I think you can also solve this with a pipeline from the imbalanced-learn library.
I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.
#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score = mean(scores))

Categories

Resources