Adding get_feature_names to ColumnTransformer pipeline - python

I'm trying to create an sklearn.compose.ColumnTransformer pipeline for transforming both categorical and continuous input data:
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
{
'a': [1, 'a', 1, np.nan, 'b'],
'b': [1, 2, 3, 4, 5],
'c': list('abcde'),
'd': list('aaabb'),
'e': [0, 1, 1, 0, 1],
}
)
for col in df.select_dtypes('object'):
df[col] = df[col].astype(str)
categorical_columns = list('acd')
continuous_columns = list('be')
categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
I want to access the feature names created by this transformation pipeline, so I try this:
column_transformer.get_feature_names()
Which raises:
NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.
Since I'm not technically doing anything with columns b and e, I technically could just append them onto X after one-hot encoding all other features, but is there some way I can use one of the scikit base classes (e.g. TransformerMixin, BaseEstimator, or FunctionTransformer) to add to this pipeline so I can grab the continuous feature names in a very pipeline-friendly way?
Something like this, perhaps:
class PassthroughTransformer(FunctionTransformer, BaseEstimator):
def fit(self):
return self
def transform(self, X)
self.X = X
return X
def get_feature_names(self):
return self.X.values.tolist()
continuous_transformer = PassthroughTransformer()
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
But this raises this exception:
TypeError: Cannot clone object '<__main__.PassthroughTransformer object at 0x1132ddf60>' (type <class '__main__.PassthroughTransformer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

There are multiple issues here:
Cannot clone object error is due to parallel processing:
By default, scikit-learn clones (which uses pickle) the supplied transformers and estimators when its working in Pipeline and similar (FeatureUnion, ColumnTransformer etc) or in cross-validation (cross_val_score, GridSearchCV etc).
Now, you have specified n_jobs=-1 in your ColumnTransformer, which introduces multiprocessing in the code. Python's inbuilt pickling dont work well with multiprocessing. And hence the error.
Options:
Set n_jobs = 1 to not use multiprocessing. Still need to correct the code according to point 2 and 3.
If you want to use multiprocessing, then simplest solution is to define the custom classes in a separate file (module) and import it into your main file. Something like this:
Make a new file in same folder named custom_transformers.py with contents:
from sklearn.base import TransformerMixin, BaseEstimator
# Changed the base classes here, see Point 3
class PassthroughTransformer(BaseEstimator, TransformerMixin):
# I corrected the `fit()` method here, it should take X, y as input
def fit(self, X, y=None):
return self
def transform(self, X):
self.X = X
return X
# I have corrected the output here, See point 2
def get_feature_names(self):
return self.X.columns.tolist()
Now in your main file, do this:
from custom_transformers import PassthroughTransformer
For more information, see these questions:
Python multiprocessing pickling error
Python: Can't pickle type X, attribute lookup failed
Custom sklearn pipeline transformer giving "pickle.PicklingError" (I am suggesting this workaround)
You return self.X.values.tolist():-
Here X is a Pandas DataFrame, so X.values.tolist() will return the actual data of the columns you specify, not the column names. So even if you solve the first error, you will get error in this. Correct this to:
return self.X.columns.tolist()
(Minor) Class inheriting:
You defined the PassthroughTransformer as:
PassthroughTransformer(FunctionTransformer, BaseEstimator)
FunctionTransformer already inherits from BaseEstimator so I dont think there is need to inherit from BaseEstimator. You can change it in following ways:
class PassthroughTransformer(FunctionTransformer):
OR
# Standard way
class PassthroughTransformer(BaseEstimator, TransformerMixin):
Hope this helps.

Related

How to specify the parameter for FeatureUnion to let it pass to underlying transformer

In my code, I am trying to access the sample_weight of the StandardScaler. However, this StandardScaler is within a Pipeline which again is within a FeatureUnion. I can't seem to get this parameter name correct: scaler_pipeline__scaler__sample_weight which should be specified in the fit method of the preprocessor object.
I get the following error: KeyError: 'scaler_pipeline
What should this parameter name be? Alternatively, if there is a generally better way to do this, feel free to propose it.
The code below is a standalone example.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
import pandas as pd
class ColumnSelector(BaseEstimator, TransformerMixin):
"""Select only specified columns."""
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns]
def set_output(self, *, transform=None):
return self
df = pd.DataFrame({'ds':[1,2,3,4],'y':[1,2,3,4],'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
sample_weight=[0,1,1,1]
scaler_pipeline = Pipeline(
[
(
"selector",
ColumnSelector(['a','b']),
),
("scaler", StandardScaler()),
]
)
remaining_pipeline = Pipeline([("selector", ColumnSelector(["ds","y"]))])
# Featureunion fitting training data
preprocessor = FeatureUnion(
transformer_list=[
("scaler_pipeline", scaler_pipeline),
("remaining_pipeline", remaining_pipeline),
]
).set_output(transform="pandas")
df_training_transformed = preprocessor.fit_transform(
df, scaler_pipeline__scaler__sample_weight=sample_weight
)
fit_transform has no parameter called scaler_pipeline__scaler__sample_weight.
Instead, it is expecting to receive "parameters passed to the fit method of each step" as a dict of string, "where each parameter name is prefixed such that parameter p for step s has key s__p".
So, in your example, it should be:
df_training_transformed = preprocessor.fit_transform(
df, {"scaler_pipeline__scaler__sample_weight":sample_weight}
)

Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack correctly when putting together the resulting matrix.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
['b', 'How you been Tom', 'hot coffee', 2],
['c', 'Hi you', 'I want some coffee', 3]],
columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])
# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
X_vect_ = vectorizer_tf.fit_transform(X)
return X_vect_.toarray()
tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})
# Transformation Pipelines
tf_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('tf', tf_transformer)])
ohe_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])
transformer = ColumnTransformer(transformers=[
('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')
transformed_df = transformer.fit_transform(df)
I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer() is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer). I stumbled upon a DenseTransformer that I wish I could use instead of FunctionTransformer but unfortunately it is not supported in my company.
Imo, the first consideration to be done is that CountVectorizer() requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.
Then you should also consider that when using a CountVectorizer() instance (which - again - requires 1D input) as transformer in a ColumnTransformer() that's how you should pass transformers' columns:
columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. [...]
This would be useful in interpreting the snippet I'll post as a possible solution.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
from sklearn.base import BaseEstimator, TransformerMixin
# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
['b', 'How you been Tom', 'hot coffee', 2],
['c', 'Hi you', 'I want some coffee', 3]],
columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])
class DimTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, *_):
return self
def transform(self, X, *_):
return pd.DataFrame(X)
# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
X_vect_ = vectorizer_tf.fit_transform(X)
return X_vect_.toarray()
tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})
# Transformation Pipelines
tf_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('dt', DimTransformer()),
('ct', ColumnTransformer([
('tf1', tf_transformer, 0),
('tf2', tf_transformer, 1)
]))
])
ohe_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])
transformer = ColumnTransformer(transformers=[
('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')
transformed_df = transformer.fit_transform(df)
Namely, I'm adding a transformer that simply transforms the array returned by the SimpleImputer instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer' and 'dt') I'm adding a further ColumnTransformer which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.
Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:
dt = DimTransformer().fit_transform(df)
ct = ColumnTransformer([
('tf1', tf_transformer, 1),
('tf2', tf_transformer, 2)
])
ct.fit_transform(dt)
print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_)
print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)
and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of 'cat_tf') do coincide with the ones just below.
Here are a couple of posts with focus on the usage of CountVectorizer in a ColumnTransformer instance, though they did not consider imputing the dataset beforehand.
Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly
Vectorize only text column and standardize numeric column using pipeline
In CountVectorizer, pass lower_case=False.
I think you should really look back over your basics again. Your question tells me you don’t understand the function well enough to implement it effectively. Ask again when you’ve done enough research on your own to not embarrass yourself.

Using a Pipeline containing ColumnTransformer in SciKit's RFECV

I'm trying to do RFECV on the transformed data using SciKit.
For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
class CustomPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
X = pd.DataFrame({
'col1': [i for i in range(100)] ,
'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])
pipe = CustomPipeline([
('col_transform', ct),
('lr', LinearRegression())
])
rfecv = RFECV(
estimator=pipe,
step=1,
cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)
Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.
I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)

Python Pipeline returns NaN Scores When Used in Cross Validation

I want to create a pipeline with sklearn including some preprocessing steps and a final step with a model to fit to the data. I use this pipeline to get scores by cross-validation. Later on I would like to use the pipeline in GridSearchCV for parameters optimization.
As to now, the preprocessing steps include:
One step in which some colums are dropped, using a ColumnsRemoval() class that I created,
One step which is specific to each feature type (categorical or numerical). To simplify in the example below I have just included a StandardScaler() for numerical features and OneHotEncoder() for categorical features.
The problem is that the scores I get are all nan. It runs quite fast and it seems as if empty arrays where being passed into the model:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_validate
import numpy as np
# Create random dataframe
num_data = np.random.random_sample((5,4))
cat_data = ['good','bad','fair','excellent','bad']
col_list_stack = ['SalePrice','Id','TotalBsmtSF','GrdLivArea']
data = pd.DataFrame(num_data, columns = col_list_stack)
data['Quality'] = cat_data
X_train = data.drop(labels = ['SalePrice'], axis = 1)
y_train = data['SalePrice']
#------------------------------------------------------------#
# create a custom transformer to remove columns
class ColumnsRemoval(BaseEstimator, TransformerMixin):
def __init__(self, skip = False, remove_cols = ['Id','TotalBsmtSF']):
self._remove_cols = remove_cols
self._skip = skip
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
if not self._skip:
return X.drop(labels = self._remove_cols,axis = 1)
else:
return X
#------------------------------------------------------------#
# PIPELINE and cross-validation
# Preprocessing steps common to numerical and categorical data
preprocessor_common = Pipeline(steps=[
('remove_features', ColumnsRemoval())])
# Separated preprocessing steps
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor_by_cat = ColumnTransformer(
transformers=[
('num', numeric_transformer, ['GrdLivArea']),
('cat', categorical_transformer, ['Quality'])], remainder = 'passthrough')
# Full pipeline with model
pipe = Pipeline(steps = [('preprocessor_common', preprocessor_common),
('preprocessor_by_cat', preprocessor_by_cat),
('model', LinearRegression())])
# Use cross validation to obtain scores
scores = cross_validate(pipe, X_train, y_train,
scoring = ["neg_mean_squared_error","r2"], cv = 4)
I have tried the following:
Using only one of the preprocessing steps plus the model in the pipeline. When I use preprocessor_by_cat + model steps in the pipeline, I get score values. Using preprocessor_common + model steps gives nan scores as well
Perform both preprocessing steps in a pipeline (preprocessor_common+preprocessor_by_cat), .fit_transform() the training data, and then send it to cross_validate(), roughly as below:
pipe = Pipeline(steps = [('preprocessor_common', preprocessor_common),
('preprocessor_by_cat', preprocessor_by_cat),
])
X_processed = pipe.fit_transform(X_train)
# Use cross validation to obtain scores
scores = cross_validate(LinearRegression(), X_processed, y_train,
scoring = ["neg_mean_squared_error","r2"], cv = 4)
From my understanding, doing the preprocessing in a pipeline or doing the preprocessing + the model to the pipeline are the same, which is why I believe it is a problem to get NaN values.
I hope the problem is clear, congratulations if you made it this far :)
TL;DR
You need to redefine the __init()__ function of your custom ColumnsRemoval as passing a Python list as default value will result in an error. One possible solution:
class ColumnsRemoval(BaseEstimator, TransformerMixin):
def __init__(self, skip=False, remove_cols=None):
if remove_cols is None:
remove_cols = ['Id', 'TotalBsmtSF']
self._remove_cols = remove_cols
self._skip = skip
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if not self._skip:
return X.drop(labels=self._remove_cols, axis=1)
else:
return X
With this, your pipeline should work as expected.
Background
I ran your MWE and got the following error:
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.
It was related to the following line of your custom ColumnsRemoval:
return X.drop(labels=self._remove_cols, axis=1)
which threw the error:
ValueError: Need to specify at least one of 'labels', 'index' or 'columns'
It seems to be a known issue when passing a standard Python list to the drop() function and is discussed in this post. The solution is to instead pass e.g. a numpy array or pandas index object. Another solution, which I proposed, is to not set a default for remove_cols in the function definition but to assign it in the function body. This works as well.
It does not look like anybody really knows why this is happening. Sorry that I cannot elaborate more on the actual reason (more than happy if anybody can add). But the problem should be resolved.
I found where the problem was. I have been doing some further tests, also using a float instead of a list as default value.
As detailed here, under the Instantiantion section:
the object's attributes used in __init__() should have exactly the
name of the argument in the constructor.
So what I did was to use the same object attribute names than the parameter names passed in __init__(), and now everything works well. For example:
class ColumnsRemoval(BaseEstimator, TransformerMixin):
def __init__(self, threshold = 0.9)
self.threshold = threshold
Using self._threshold (note the _ before threshold) had a strange behavior, in some cases the object was being used with the provided value (or the default one) but in other cases self._threshold was being set to None. This also allows using a list as default value to pass through __init__() (although using a list as default should be avoided, see afsharov's answer for details)

Using scikit StandardScaler in Pipeline on a subset of Pandas dataframe columns

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])
But now assume I have column 'C' in df of type string and the following pipeline definition
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('standard', StandardScaler())
])
df_scaled = pipeline.fit_transform(df)
How can I tell StandardScaler to only scale columns A and B?
I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler
You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:
mapper = DataFrameMapper([
... (list_of_columnnames, StandardScaler())
... ])
I if you don't need external dependencies, you could use a simple own transformer, as I answered here:
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = make_pipeline(Columns(names=list_of_columnnames),StandardScaler())
In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:
pipeline = Pipeline([
('scale_sum', feature_union(...))
])
where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.
Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:
from ibex.sklearn.preprocessing import StandardScaler
from ibex import trans
pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>

Categories

Resources