Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns - python

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])
But now assume I have column 'C' in df of type string and the following pipeline definition
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('standard', StandardScaler())
])
df_scaled = pipeline.fit_transform(df)
How can I tell StandardScaler to only scale columns A and B?
I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:
mapper = DataFrameMapper([
... (list_of_columnnames, StandardScaler())
... ])
I if you don't need external dependencies, you could use a simple own transformer, as I answered here:
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = make_pipeline(Columns(names=list_of_columnnames),StandardScaler())

In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:
pipeline = Pipeline([
('scale_sum', feature_union(...))
])
where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.
Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:
from ibex.sklearn.preprocessing import StandardScaler
from ibex import trans
pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

Related

How to specify the parameter for FeatureUnion to let it pass to underlying transformer

In my code, I am trying to access the sample_weight of the StandardScaler. However, this StandardScaler is within a Pipeline which again is within a FeatureUnion. I can't seem to get this parameter name correct: scaler_pipeline__scaler__sample_weight which should be specified in the fit method of the preprocessor object.
I get the following error: KeyError: 'scaler_pipeline
What should this parameter name be? Alternatively, if there is a generally better way to do this, feel free to propose it.
The code below is a standalone example.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
import pandas as pd
class ColumnSelector(BaseEstimator, TransformerMixin):
"""Select only specified columns."""
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns]
def set_output(self, *, transform=None):
return self
df = pd.DataFrame({'ds':[1,2,3,4],'y':[1,2,3,4],'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
sample_weight=[0,1,1,1]
scaler_pipeline = Pipeline(
[
(
"selector",
ColumnSelector(['a','b']),
),
("scaler", StandardScaler()),
]
)
remaining_pipeline = Pipeline([("selector", ColumnSelector(["ds","y"]))])
# Featureunion fitting training data
preprocessor = FeatureUnion(
transformer_list=[
("scaler_pipeline", scaler_pipeline),
("remaining_pipeline", remaining_pipeline),
]
).set_output(transform="pandas")
df_training_transformed = preprocessor.fit_transform(
df, scaler_pipeline__scaler__sample_weight=sample_weight
)

fit_transform has no parameter called scaler_pipeline__scaler__sample_weight.
Instead, it is expecting to receive "parameters passed to the fit method of each step" as a dict of string, "where each parameter name is prefixed such that parameter p for step s has key s__p".
So, in your example, it should be:
df_training_transformed = preprocessor.fit_transform(
df, {"scaler_pipeline__scaler__sample_weight":sample_weight}
)

How can I exclude specific columns from a ColumnTransformer/Pipeline in sklearn?

I'm trying to build a data preprocessing pipeline with sklearn Pipeline and ColumnTransformer. The preprocessing steps consist of parallel imputting values and transforming (power transform, scaling or OHE) to specific columns. This preprocessing ColumnTransform works perfectly.
However, after doing some analysis on the preprocessed result, I decided to exclude some columns from the final result. My goal is to have one pipeline starting from the original dataframe that inputs and transforms values, excludes pre-selected columns, and triggers the model fitting all in one. So to be clear, I don't want to drop columns after the pipeline is fitted/transformed. I want instead that the process of dropping columns is part of the column transformation.
It's easy to remove the numerical columns from the model (by simply not adding them), but how can I exclude the columns created by OHE? I don't want to exclude all columns created by OHE, just some of them. For example, if categorical column "Example" becomes Example_1, Example_2, and Example_3, how can I exclude only Example_2?
Example code:
### Importing libraries
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
'Example' : ['A','B','C','A','B','A','A','C','C'],
'another_col' : range(10,100,10)})
### Pipelines
SimpImpMean_MinMaxScaler = Pipeline([
('SimpleImputer', SimpleImputer(strategy="mean")),
('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])
### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)
preprocessor_transformer
### Preprocessing dummy dataframe
df_foo = pd.DataFrame(preprocessor_transformer.fit_transform(df_foo),
columns=preprocessor_transformer.get_feature_names_out()
)
print(df_foo)
Finally, I've seen this solution out there (Adding Dropping Column instance into a Pipeline) but I didn't manage to make the custom columnDropperTransformer to work in my case. Adding the columnDropperTransformer to my pipeline returns an ValueError: A given column is not a column of the dataframe, refering to column "Example" not existing in the dataframe anymore.
class columnDropperTransformer():
def __init__(self,columns):
self.columns=columns
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
def fit(self, X, y=None):
return self
processor= make_pipeline(preprocessor_transformer,columnDropperTransformer([]))
processor.fit_transform(df_foo)
Any suggestions?

As per my experience and as of today, automating these kinds of treatments in sklearn is not that easy for the following reasons:
the steps performed by the Pipeline (when calling .fit_transform() on it) make you lose the DataFrame structure (the pandas DataFrame becomes a numpy array). I'd suggest reading
How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?
and how to use ColumnTransformer() to return a dataframe?
for some details on what happens when calling .fit() or
.fit_transform() on a Pipeline instance.
In turn, "columns" in a numpy array can't be referenced anymore via their names, but only positionally.
Here are a couple of solutions which are not scalable imo, but that can work for your case:
You can add a step in your pipeline which is only intended to transform the numpy array (which is the standard output of the intermediate transformations performed by the pipeline) back in a pandas DataFrame (via the ColumnExtractor transformer). Once you have a DataFrame, you can exploit the columnDropperTransformer of the referenced link to drop the column Example_B via its name.
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return pd.DataFrame(X, columns=self.columns)
def fit(self, *_):
return self
class columnDropperTransformer():
def __init__(self,columns):
self.columns=columns
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
def fit(self, X, y=None):
return self
The ColumnExtractor transformer is only intended to map the resulting array in a DataFrame. The clear disadvantage is that you'll need to manually specify the columns you would like your DataFrame to be made of.
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
import pandas as pd
import numpy as np
### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
'Example' : ['A','B','C','A','B','A','A','C','C'],
'another_col' : range(10,100,10)})
SimpImpMean_MinMaxScaler = Pipeline([
('SimpleImputer', SimpleImputer(strategy="mean")),
('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])
### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)
processor = make_pipeline(
preprocessor_transformer,
ColumnExtractor(['Num_col', 'Example_A', 'Example_B', 'Example_C']))
f_processor = make_pipeline(
processor,
columnDropperTransformer('Example_B'))
f_processor.fit_transform(df_foo)
You can reference the "columns" of the array which comes out from the application of the Pipeline transformations positionally (i.e. by index). For instance, you might define such a dummy transformer
class NumpyColumnSelector():
def __init__(self):
pass
def transform(self,X,y=None):
return X[:, [0, 1, 3]]
def fit(self, X, y=None):
return self
which only retains all columns but the one which would correspond to the Example_B one.
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
import pandas as pd
import numpy as np
### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
'Example' : ['A','B','C','A','B','A','A','C','C'],
'another_col' : range(10,100,10)})
SimpImpMean_MinMaxScaler = Pipeline([
('SimpleImputer', SimpleImputer(strategy="mean")),
('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])
### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)
f_processor = make_pipeline(preprocessor_transformer, NumpyColumnSelector())
f_processor.fit_transform(df_foo)

For the very specific use-case of removing dummy columns generated by the OHE (+1 to amiola for a more generic answer), you can specify categories and handle_unknown='ignore'. In your example, replacing the OHE line by this:
('OHE', OneHotEncoder(sparse=False, categories=[['A', 'C']], handle_unknown='ignore')),
produces this:
Num_col Example_A Example_C
0 0.000000 1.0 0.0
1 0.125000 0.0 0.0
2 0.482143 0.0 1.0
3 0.375000 1.0 0.0
4 0.500000 0.0 0.0
5 0.625000 1.0 0.0
6 0.750000 1.0 0.0
7 0.482143 0.0 1.0
8 1.000000 0.0 1.0

Using a Pipeline containing ColumnTransformer in SciKit's RFECV

I'm trying to do RFECV on the transformed data using SciKit.
For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
class CustomPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
X = pd.DataFrame({
'col1': [i for i in range(100)] ,
'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])
pipe = CustomPipeline([
('col_transform', ct),
('lr', LinearRegression())
])
rfecv = RFECV(
estimator=pipe,
step=1,
cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)
Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.
I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)

Adding get_feature_names to ColumnTransformer pipeline

I'm trying to create an sklearn.compose.ColumnTransformer pipeline for transforming both categorical and continuous input data:
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
{
'a': [1, 'a', 1, np.nan, 'b'],
'b': [1, 2, 3, 4, 5],
'c': list('abcde'),
'd': list('aaabb'),
'e': [0, 1, 1, 0, 1],
}
)
for col in df.select_dtypes('object'):
df[col] = df[col].astype(str)
categorical_columns = list('acd')
continuous_columns = list('be')
categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
I want to access the feature names created by this transformation pipeline, so I try this:
column_transformer.get_feature_names()
Which raises:
NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.
Since I'm not technically doing anything with columns b and e, I technically could just append them onto X after one-hot encoding all other features, but is there some way I can use one of the scikit base classes (e.g. TransformerMixin, BaseEstimator, or FunctionTransformer) to add to this pipeline so I can grab the continuous feature names in a very pipeline-friendly way?
Something like this, perhaps:
class PassthroughTransformer(FunctionTransformer, BaseEstimator):
def fit(self):
return self
def transform(self, X)
self.X = X
return X
def get_feature_names(self):
return self.X.values.tolist()
continuous_transformer = PassthroughTransformer()
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
But this raises this exception:
TypeError: Cannot clone object '<__main__.PassthroughTransformer object at 0x1132ddf60>' (type <class '__main__.PassthroughTransformer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

There are multiple issues here:
Cannot clone object error is due to parallel processing:
By default, scikit-learn clones (which uses pickle) the supplied transformers and estimators when its working in Pipeline and similar (FeatureUnion, ColumnTransformer etc) or in cross-validation (cross_val_score, GridSearchCV etc).
Now, you have specified n_jobs=-1 in your ColumnTransformer, which introduces multiprocessing in the code. Python's inbuilt pickling dont work well with multiprocessing. And hence the error.
Options:
Set n_jobs = 1 to not use multiprocessing. Still need to correct the code according to point 2 and 3.
If you want to use multiprocessing, then simplest solution is to define the custom classes in a separate file (module) and import it into your main file. Something like this:
Make a new file in same folder named custom_transformers.py with contents:
from sklearn.base import TransformerMixin, BaseEstimator
# Changed the base classes here, see Point 3
class PassthroughTransformer(BaseEstimator, TransformerMixin):
# I corrected the `fit()` method here, it should take X, y as input
def fit(self, X, y=None):
return self
def transform(self, X):
self.X = X
return X
# I have corrected the output here, See point 2
def get_feature_names(self):
return self.X.columns.tolist()
Now in your main file, do this:
from custom_transformers import PassthroughTransformer
For more information, see these questions:
Python multiprocessing pickling error
Python: Can't pickle type X, attribute lookup failed
Custom sklearn pipeline transformer giving "pickle.PicklingError" (I am suggesting this workaround)
You return self.X.values.tolist():-
Here X is a Pandas DataFrame, so X.values.tolist() will return the actual data of the columns you specify, not the column names. So even if you solve the first error, you will get error in this. Correct this to:
return self.X.columns.tolist()
(Minor) Class inheriting:
You defined the PassthroughTransformer as:
PassthroughTransformer(FunctionTransformer, BaseEstimator)
FunctionTransformer already inherits from BaseEstimator so I dont think there is need to inherit from BaseEstimator. You can change it in following ways:
class PassthroughTransformer(FunctionTransformer):
OR
# Standard way
class PassthroughTransformer(BaseEstimator, TransformerMixin):
Hope this helps.

How can I use a custom feature selection function in scikit-learn's `pipeline`

Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline class.
For example, if I want to experiment with PCA vs LDA I could do something like:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('classification', GaussianNB())
])
clf_pca = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classification', GaussianNB())
])
clf_lda = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', LDA(n_components=2)),
('classification', GaussianNB())
])
# Constructing the k-fold cross validation iterator (k=10)
cv = KFold(n=X_train.shape[0], # total number of samples
n_folds=10, # number of folds the dataset is divided into
shuffle=True,
random_state=123)
scores = [
cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
for clf in [clf_all, clf_pca, clf_lda]
]
But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train) and I want to compare them with the other approaches.
How would I include such a manual feature selection in the pipeline?
For example
def select_3_and_4(X_train):
return X_train[:,2:4]
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('feature_select', select_3_and_4),
('classification', GaussianNB())
])
would obviously not work.
So I assume I have to create a feature selection class that has a transform dummy method and fit method that returns the two columns of the numpy array?? Or is there a better way?

I just want to post my solution for completeness, and maybe it is useful to one or the other:
class ColumnExtractor(object):
def transform(self, X):
cols = X[:,2:4] # column 3 and 4 are "extracted"
return cols
def fit(self, X, y=None):
return self
Then, it can be used in the Pipeline like so:
clf = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', ColumnExtractor()),
('classification', GaussianNB())
])
EDIT: General solution
And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:
import numpy as np
class ColumnExtractor(object):
def __init__(self, cols):
self.cols = cols
def transform(self, X):
col_list = []
for c in self.cols:
col_list.append(X[:, c:c+1])
return np.concatenate(col_list, axis=1)
def fit(self, X, y=None):
return self
clf = Pipeline(steps=[
('scaler', StandardScaler()),
('dim_red', ColumnExtractor(cols=(1,3))), # selects the second and 4th column
('classification', GaussianNB())
])

Adding on Sebastian Raschka's and eickenberg's answers, the requirements a transformer object should hold are specified in scikit-learn's documentation.
There are several more requirements than just having fit and transform, if you want the estimator to usable in parameter estimation, such as implementing set_params.

If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is
select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4
and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.
Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.
Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

I didn't find the accepted answer very clear, so here is my solution for others.
Basically, the idea is making a new class based on BaseEstimator and TransformerMixin
The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.
from sklearn.base import TransformerMixin, BaseEstimator
class NonNAselector(BaseEstimator, TransformerMixin):
"""Extract columns with less than x percentage NA to impute further
in the line
Class to use in the pipline
-----
attributes
fit : identify columns - in the training set
transform : only use those columns
"""
def __init__(self, perc=0.1):
self.perc = perc
self.columns_with_less_than_x_na_id = None
def fit(self, X, y=None):
self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
return self
def transform(self, X, y=None, **kwargs):
return X[self.columns_with_less_than_x_na_id]
def get_params(self, deep=False):
return {"perc": self.perc}

You can use the following custom transformer to select the columns specified:
#Custom Transformer that extracts columns passed as an argument to its constructor
class FeatureSelector( BaseEstimator, TransformerMixin ):
#Class Constructor
def __init__( self, feature_names ):
self._feature_names = feature_names
#Return self nothing else to do here
def fit( self, X, y = None ):
return self
#Method that describes what we need this transformer to do
def transform( self, X, y = None ):
return X[ self._feature_names ]`
Here feature_names is the list of features which you want to select
For more details, you can refer to this link
[1]: https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

Another way is to simply use the ColumnTransformer with an «empty» FunctionTransformer:
# a FunctionTransformer with func=None yields the identity function / passthrough
empty_func = make_pipeline(FunctionTransformer(func=None))
clf_all = make_pipeline(StandardScaler(),
ColumnTransformer([("select", empty_func, [3, 4])]),
GaussianNB(),
)
This works because the ColumnTransformer by default drops the remainder of columns that aren't selected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns - python

Related

How to specify the parameter for FeatureUnion to let it pass to underlying transformer

How can I exclude specific columns from a ColumnTransformer/Pipeline in sklearn?

Using a Pipeline containing ColumnTransformer in SciKit's RFECV

Adding get_feature_names to ColumnTransformer pipeline

How can I use a custom feature selection function in scikit-learn's `pipeline`

Categories

Resources