Scaler fitted in a pipeline turns out to be not fitted yet - python

Please consider this code:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
# data
train_X = pd.DataFrame(data=np.random.rand(20, 3), columns=["a", "b", "c"])
train_y = pd.Series(data=np.random.randint(0,2, 20), name="y")
test_X = pd.DataFrame(data=np.random.rand(10, 3), columns=["a", "b", "c"])
test_y = pd.Series(data=np.random.randint(0,2, 10), name="y")
# scaler
scaler = StandardScaler()
# feature selection
p = Pipeline(steps=[("scaler0", scaler),
("model", SVC(kernel="linear", C=1))])
rfe = RFE(p, n_features_to_select=2, step=1,
importance_getter="named_steps.model.coef_")
rfe.fit(train_X, train_y)
# apply the scaler to the test set
scaled_test = scaler.transform(test_X)
I get this message:
NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Why is the scaler not fitted?

When passing a pipeline or an estimator to RFE, it essentially gets cloned by the RFE and fit until it finds the best fit with the reduced number of features.
To access this fit estimator you can use
fit_pipeline = rfe.estimator_
But note, this new pipeline uses the top n_features_to_select features.

Related

Extract feature names after Pipeline usage with ColumnTransformer (sklearn)

I have the following toy code.
I use a pipeline to automatically normalize numerical variables and apply one-hot-encoding to the categorical ones.
I can get the coefficients of the linear regression model easily using pipe['logisticregression'].coef_ but how can I get all the feature names in the right order as this appearing in the coef matrix?
from sklearn.compose import ColumnTransformer
import numpy as np, pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# data from https://www.kaggle.com/datasets/uciml/adult-census-income
data = pd.read_csv("adult.csv")
data = data.iloc[0:3000,:]
target = "workclass"
y = data[target]
X = data.drop(columns=target)
numerical_columns_selector = make_column_selector(dtype_exclude=object)
categorical_columns_selector = make_column_selector(dtype_include=object)
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
ct = ColumnTransformer([ ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_columns) ,
('std', StandardScaler(), numerical_columns)])
model = LogisticRegression(max_iter=500)
pipe = make_pipeline(ct, model)
data_train, data_test, target_train, target_test = train_test_split(
X, y, random_state=42)
pipe.fit(data_train, target_train)
pipe['logisticregression'].coef_.shape

Adding feature scaling to a nested cross-validation with randomized search and recursive feature elimination

I have a classification task and want to use a repeated nested cross-validation to simultaneously perform hyperparameter tuning and feature selection. For this, I am running RandomizedSearchCV on RFECV using Python's sklearn library, as suggested in this SO answer.
However, I additionally need to scale my features and impute some missing values first. Those two steps should also be included into the CV framework to avoid information leakage between training and test folds. I tried to create a Pipeline to get there but I think it "destroys" my CV-nesting (i.e., performs the RFECV and random search separately from each other):
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
# create example data with missings
Xtrain, ytrain = make_classification(n_samples = 500,
n_features = 150,
n_informative = 25,
n_redundant = 125,
random_state = 1897)
c = 10000 # number of missings
Xtrain.ravel()[np.random.choice(Xtrain.size, c, replace = False)] = np.nan # introduce random missings
folds = 5
repeats = 5
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
n_iter = 100
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 1897)
sel = RFECV(sgdc, cv = folds)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('selector', sel),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 1897, verbose = 1, n_jobs = -1)
rskfold_search.fit(Xtrain, ytrain)
Does anyone know how to include scaling and imputation into the CV framework without losing the nesting of my RandomizedSearchCV and RFECV?
Any help is highly appreciated!
You haven't lost the nested cv.
You have a search object at the top level; when you call fit, it splits the data into multiple folds. Let's focus on one such train fold. Your pipeline gets fitted on that, so you scale and impute, then the RFECV gets it to split into inner folds. Finally a new estimator gets fitted on the outer training fold, and scored on the outer testing fold.
That means the RFE is getting perhaps a little leakage, since scaling and imputing happen before its splits. You can add them in a pipeline before the estimator, and use that pipeline as the RFE estimator. And since RFECV refits its estimator using the discovered optimal number of features and exposes that for predict and so on, you don't really need the second copy of sgdc; using just the one copy has the side effect of hyperparameter-tuning the selection as well:
scl = StandardScaler()
imp = KNNImputer(n_neighbors=5, weights='uniform')
sgdc = SGDClassifier(loss='log', penalty='elasticnet', class_weight='balanced', random_state=1897)
base_pipe = Pipeline([
('scaler', scl),
('imputer', imp),
('clf', sgdc),
])
sel = RFECV(base_pipe, cv=folds)
param_rand = {'estimator__clf__l1_ratio': stats.uniform(0, 1),
'estimator__clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(sel, param_rand, n_iter=n_iter, cv=rskfold, scoring='accuracy', random_state=1897, verbose=1, n_jobs=-1)

How to test preprocessing combinations in nested pipeline using GridSearchCV?

I've been working on this classification problem using sklearn's Pipeline to combine the preprocessing step (scaling) and the cross validation step (GridSearchCV) using Logistic Regression.
Here is the simplified code:
# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
# scaler and encoder options
scaler = StandardScaler() # there are 3 options that I want to try
encoder = OneHotEncoder() # only one option, no need to GridSearch it
# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
('numerical', scaler, num_columns),
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())])
What I'm trying to do, is to try different scaling methods (e.g. standard scaling, robust scaling, etc.) and after trying all of those, pick the scaling method that yields the best metric (i.e. accuracy). However, I don't know how to do this using the GridSearchCV:
from sklearn.model_selection import GridSearchCV
# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)
# fit the data
grid_cv.fit(X_train, y_train)
I know that the code above won't work, particularly because of the scaler_options that I've set as param_grid. I realize that the scaler_options I set can't be processed by GridSearchCV. Why? Because it isn't a hyperparameter of the pipeline (unlike 'log_reg__C', a hyperparameter from LogisticRegression() than can be accessed by the GridSearchCV). But instead its a component of the ColumnTransformer which I have nested inside the full_pipeline.
So the main question is, how do I automate GridSearchCV to test all of my scaler options? Since the scaler is a component of a sub-pipeline (i.e. the previous ColumnTransformer).
As you suggested you could create a class that takes in its __init()__ parameters, the scaler you want to use.
Then you could specify in your grid search parameters the Scaler your class should use to initialize the class.
I wrote that i hope it helps :
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
Here you can find a full example that you can run to test :
# import dependencies
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import load_breast_cancer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()
# scaler and encoder options
my_scaler = ScalerSelector()
preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())
])
# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)
# fit the data
grid_cv.fit(data, target)
# best params :
grid_cv.best_params_
You can fulfill what you intend without creating a custom transformer. And you can even pass the 'passthrough' argument into param_grid to experiment with the scenario where you don't want to do any scaling in that step at all.
In this example, suppose we want to investigate whether it is better for the model to impose a Scaler transformer on numerical features, num_features.
cat_features = selector(dtype_exclude='number')(train.drop('target', axis=1))
num_features = selector(dtype_include='number')(train.drop('target', axis=1))
cat_preprocessor = Pipeline(steps=[
('oh', OneHotEncoder(handle_unknown='ignore')),
('ss', StandardScaler())
])
num_preprocessor = Pipeline(steps=[
('pt', PowerTransformer(method='yeo-johnson')),
('ss', StandardScaler()) # Create a place holder for your test here !!!
])
preprocessor = ColumnTransformer(transformers=[
('cat', cat_preprocessor, cat_features),
('num', num_preprocessor, num_features)
])
model = Pipeline(steps=[
('prep', preprocessor),
('clf', RidgeClassifier())
])
X = train.drop('target', axis=1)
y = train['target']
param_grid = {
'prep__cat__ss': ['passthrough', StandardScaler(with_mean=False)] # 'passthrough',
}
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=2
)
gs.fit(X, y)

How to combine SGD like SVM Kernel Approximation with Feature Selection on a Multiclass Dataset

I want to train a relatively large recordset. (200000 rows and 400 columns) in a pipeline. Only a weak notebook is available for the task.
This dataset has 15 independent classes and mixed categorical and numerical features. An SVM-like algorithm should be chosen.
I already tried to put some code together.
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelBinarizer,StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.multiclass import OneVsRestClassifier
X, y= make_classification(n_samples=200000, n_features=130, n_informative=105,
n_redundant=25, n_classes=15, n_clusters_per_class=15)
#add some categorical columns
X [:,:2]= np.abs(X[:,:2]).astype(int)
X = pd.DataFrame(X, columns=[f'F{i}' for i in range(X.shape[1])])
cols = X.columns.tolist()
y = LabelBinarizer().fit_transform(y)
#%%Transformation
full_pipeline = ColumnTransformer([
('numerical', StandardScaler(), cols[2:]),
('categorical', OneHotEncoder(categories='auto'), cols[:2])
])
#Sparse matrix
X = full_pipeline.fit_transform(X)
#set start
rbf = RBFSampler(gamma=0.1, random_state=42)
semi_svm = SGDClassifier(loss="hinge", penalty="l2", max_iter=50)
clf_pipe = Pipeline([
('rbf', rbf),
('svm', semi_svm)
])
cv = StratifiedShuffleSplit(n_splits=5)
grid_search = RFECV(estimator=OneVsRestClassifier(clf_pipe), step=3, cv=cv,
scoring='accuracy', n_jobs=-1, verbose=10)
grid_search.fit(X, y)
ValueError: bad input shape (200000, 15)
How to handle the multiclass error in this case?
The following solution worked for me:
...
y = LabelEncoder().fit_transform(y)
...
rbf = RBFSampler(gamma=0.1, random_state=42)
semi_svm = OneVsOneClassifier(SGDClassifier(loss="hinge", penalty="l2", max_iter=5000))
selection = SelectKBest(k=1)
clf_pipe = Pipeline([
('rbf', rbf),
('features', selection ),
('svm', semi_svm)
])
cv = StratifiedShuffleSplit(n_splits=5)
param_grid = dict(features__k=np.logspace(1,6, num=5, base=2).round().astype(int),
rbf__gamma = [0.1,1])
grid_search = GridSearchCV(estimator=clf_pipe, cv=cv, param_grid = param_grid,
scoring='f1', n_jobs=-1, verbose=10)

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

Categories

Resources