How can I standardize only numeric variables in an sklearn pipeline? - python

I am trying to create an sklearn pipeline with 2 steps:
Standardize the data
Fit the data using KNN
However, my data has both numeric and categorical variables, which I have converted to dummies using pd.get_dummies. I want to standardize the numeric variables but leave the dummies as they are. I have been doing this like this:
X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)
However, if I were to create a pipeline like:
pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())
It would standardize all of the columns in my DataFrame. Is there a way to do this while standardizing only the numeric columns?

UPD: 2021-05-10
For sklearn >= 0.20 we can use sklearn.compose.ColumnTransformer
Here is a small example:
imports and data loading
# Author: Pedro Morales <part.morales#gmail.com>
#
# License: BSD 3 clause
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
np.random.seed(0)
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
pipeline-aware data preprocessing using ColumnTransformer:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
classification
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
OLD Answer:
Assuming you have the following DF:
In [163]: df
Out[163]:
a b c d
0 aaa 1.01 xxx 111
1 bbb 2.02 yyy 222
2 ccc 3.03 zzz 333
In [164]: df.dtypes
Out[164]:
a object
b float64
c object
d int64
dtype: object
you can find all numeric columns:
In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')
In [167]: df[num_cols]
Out[167]:
b d
0 1.01 111
1 2.02 222
2 3.03 333
and apply StandardScaler only to those numeric columns:
In [168]: scaler = StandardScaler()
In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
In [170]: df
Out[170]:
a b c d
0 aaa -1.224745 xxx -1.224745
1 bbb 0.000000 yyy 0.000000
2 ccc 1.224745 zzz 1.224745
now you can "one hot encode" categorical (non-numeric) columns...

I would use FeatureUnion. I then usually do something like that, assuming you dummy-encode your categorical variables also within the pipeline instead of before with Pandas:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
numeric = [list of numeric column names]
categorical = [list of categorical column names]
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
])),
('model', KNeighborsClassifier())
])
You could further check out Sklearn Pandas, which is also interesting.

Since you have converted your categorical features into dummies using pd.get_dummies, so you don't need to use OneHotEncoder. As a result, your pipeline should be:
from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion
knn=KNeighborsClassifier()
pipeline=Pipeline(steps= [
('feature_processing', FeatureUnion(transformer_list = [
('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),
#numeric
('numeric', Pipeline(steps = [
('select', FunctionTransformer(lambda data: data[:, num_indices])),
('scale', StandardScaler())
]))
])),
('clf', knn)
]
)

Making the answer of MaxU - stop WAR against UA more general for any columns:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# Preprocessing Step
numeric_features = Xtrain.select_dtypes(include=['int64','float64']).columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = Xtrain.select_dtypes(exclude=['int64','float64']).columns
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Training Step
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
clf.fit(Xtrain, Ytrain)
print("model score: %.3f" % clf.score(Xtest, Ytest))

Another way of doing this would be
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame()
df['col1'] = np.random.randint(1,20,10)
df['col2'] = np.random.randn(10)
df['col3'] = list(5*'Y' + 5*'N')
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])

Related

Getting "valueError: could not convert string to float: ..." for sklearn pipeline

I'm a beginner trying to learn sklearn pipeline. I get a value error of ValueError: could not convert string to float when I run my code below. I'm not sure what's the reason for it since OneHotEncoder shouldn't have any problem converting string to float for categorical variables
import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c!='income']
X = df[x_cols]
y = df['income']
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
preprocessor = ColumnTransformer(
transformers=[
('imputer', SimpleImputer(strategy='most_frequent'),['workclass','education','native-country']),
('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'sex','native-country'])
]
)
clf = Pipeline([('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)
Unfortunately, there is an issue with scikit-learn's SimpleImputer when it tries to impute string variables. Here is a open issue about it on their github page.
To get around this, I'd recommend splitting up your pipeline into two steps. One for just the replacement of null values and 2) the rest, something like this:
cols_with_null = ['workclass','education','native-country']
preprocessor = ColumnTransformer(
transformers=[
(
'imputer',
SimpleImputer(missing_values=np.nan, strategy='most_frequent'),
cols_with_null),
])
preprocessor.fit(X_train)
X_train_new = preprocessor.transform(X_train)
for icol, col in enumerate(cols_with_null):
X_train.loc[:, col] = X_train_new[:, icol]
# confirm no null values in these columns:
for col in cols_with_null:
print('{}, null values: {}'.format(col, pd.isnull(X_train[col]).sum()))
Now that you have X_train with no null values, the rest should work without SimpleImputer:
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'sex','native-country'])])
clf = Pipeline([('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)

Don't know why this comming?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("car-sales-extended.csv")
df.head()
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn. impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
np.random.seed(42)
df.dropna(subset=["Price"], inplace=True)
categorical_features=["Make", "Colour"]
categorical_transformer=Pipeline(steps=[
("imputer",SimpleImputer(strategy="constant", fill_value="missing")),
("Onehot", OneHotEncoder(handle_unknown="ignore"))])
door_feature=["Doors"]
door_transformer=Pipeline(steps=[
("imputer",SimpleImputer(strategy="constant", fill_value=4)),
])
numeric_features=["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="mean")),
("scaler", StandardScaler())])
# Setup preprocessing steps (fill the missing values, then convert to numbers)
preprocessor = ColumnTransformer(
transformers=[(
"cat", categorical_transformer, categorical_features),
("door", door_transformer, door_feature),
("num", numeric_features, numeric_transformer)])
#creating a preprocessing and modelling pipeline
model=Pipeline(steps=[("preprocessing", preprocessor),
("model", RandomForestClassifier())])
# Split data
x=df.drop("Price", axis=1)
y=df["Price"]
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2)
# Fit and score the model
model.fit(x_train, y_train)
model.score(x_test, y_test)
Why this Type error is coming i don't know??
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. '['Odometer (KM)']' (type <class 'list'>) doesn't.
You wrongly swapped parameters in:
preprocessor = ColumnTransformer(
transformers=[(
"cat", categorical_transformer, categorical_features),
("door", door_transformer, door_feature),
("num", numeric_features, numeric_transformer)])
It should be:
preprocessor = ColumnTransformer(
transformers=[(
"cat", categorical_transformer, categorical_features),
("door", door_transformer, door_feature),
("num",numeric_transformer, numeric_features )])

could not convert string to float: 'N' while implementing StandardScaler

Code:
num_features = [feature for feature in x.columns if x[feature].dtypes != 'O']
x[num_features].replace('N',value=0)
from sklearn.preprocessing import StandardScaler
stds = StandardScaler()
x[num_features]= stds.fit(x)
Error:
ValueError: could not convert string to float: 'N'
To select numeric columns:
numeric_features = df.select_dtypes(include=['numeric'])
You method right now will not exclude Booleans(True and False), Category values, or Date and Time.
Just having numeric values is not enough. We may also have missing values. We need to deal with them before normalisation:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# numeric transformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# now before we transform we impute missing values with columns median.
normalise_numeric_features = numeric_transformer.fit_transform(numeric_feature)
This might not be enough, as you might have different data types in your dataset each needing their own transformer. You can do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
# select features types
numeric_features = df.select_dtypes(include=['numeric'])
categorical_features = df.select_dtypes(include=['category'])
# create transformer for each feature types
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# create a process
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# if you have a model you can add it in a pipeline to
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
# now we can use train the model
clf.fit(df[train_features], df[train_target])
# the fit will take the features, do the preprocessing(impute, normalise etc) that is fit_transforms, and train your model.
clf.predict(df[test_features])
# predict will just transform your features, using the learned metrics from train_features.
Hope this helps get the most of scikit-learn.

Using Pipeline to Avoid Data Leakage both for X and y

I followed the example at this link very closely: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
but used a different dataset (found here: https://archive.ics.uci.edu/ml/datasets/Adult)
Unlike the example dataset, my dataset contains a couple of nulls in the y (aka 'target') column as well. I am wondering how I would build a pipeline similar to the one in the sklearn documentation that both transforms columns in X and y rather than just in X. This is what I have tried but of course it doesn't work because the 'target' feature has been dropped from X in order to allow for train_test_split:
import os
import pandas as pd
import numpy as np
os.listdir()
df_train = pd.read_csv('data_train.txt', header=None)
df_test = pd.read_csv('data_test.txt', header=None)
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-sign', 'capital-loss',
'hours-per-week', 'native-country', 'target']
df = pd.concat([df_train, df_test], axis=0)
df.head()
df.columns = names
Data Leakage Example - Preprocessing is done on the entire dataset
df_leakage = df.copy()
#Exploring nulls:
df_leakage[df_leakage.isnull().any(axis=1)]
There are only two nulls so it would be safe to discard them but for the purposes of this demo we will impute values
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Defining numeric features and the transformations we will apply to them
numeric_features = ['capital-sign', 'capital-loss', 'hours-per-week', 'education-num']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
#Defining categorical features and the transformations we will apply to them
categorical_features = ['marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
ordinal_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('ordinal', OrdinalEncoder())])
#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
label_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('label', LabelEncoder())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('ord', ordinal_transformer, ordinal_features),
('lab', label_transformer, label_features),
])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = df_leakage.drop('target', axis=1)
y = df_leakage['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Error log (partial):
658 columns = list(key)
659
--> 660 return [all_columns.index(col) for col in columns]
661
662 elif hasattr(key, 'dtype') and np.issubdtype(key.dtype, np.bool_):
ValueError: 'y' is not in list
Obviously, I could easily apply the transformation to y separately, before any of these other transformations take place. I think it would be fine to handle character strings errors e.g. ".<=50K" vs "<=50K". But if I want to impute the value of y with its mean, the value imputed would depend on the specific choice of y_train -- thus causing some data leakage.
How would I be able to use the pipeline library to do this effectively?

Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

I want to get feature names after I fit the pipeline.
categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Then
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
After fitting with pandas dataframe, I can get feature importances from
clf.steps[1][1].feature_importances_
and I tried clf.steps[0][1].get_feature_names() but I got an error
AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.
How can I get feature names from this?
You can access the feature_names using the following snippet:
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
Using sklearn >= 0.21 version, we can make it even simpler:
clf['preprocessor'].transformers_[1][1]\
['onehot'].get_feature_names(categorical_features)
Reproducible example:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category': ['asdf', 'asfa', 'asdfas', 'as'],
'num1': [1, 1, 0, 0],
'target': [0.2, 0.11, 1.34, 1.123]})
numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
# 'category_asdf' 'category_asdfas' 'category_asfa']
Scikit-Learn 1.0 now has new features to keep track of feature names.
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
self.feature_names_in_)
num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
(num_pipeline, ["age", "height"]),
(OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())
df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
"age": [32, 65, 18, 24],
"height": [172, 163, 169, 190],
"weight": [65, 62, 54, 95]},
index=["Alice", "Bunji", "Cécile", "Dave"])
pipeline.fit(df, df["weight"])
## get pipeline feature names
pipeline[:-1].get_feature_names_out()
## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
columns=pipeline[:-1].get_feature_names_out(),
index=df.index)
EDIT: actually Peter's comment answer is in the ColumnTransformer doc:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer .get_feature_names() method depends on the order of declaration of the steps variable at the ColumnTransformer instanciation.
I could not find any doc so I just played with the toy example below and that let me understand the logic.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler
class testEstimator(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
return self
def transform(self,X):
return np.full(X.shape, self.string).reshape(-1,1)
def get_feature_names(self):
return self.string
transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)
dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)
for name,step in pipeline.named_steps.items():
if hasattr(step, 'get_feature_names'):
print(step.get_feature_names())
For the sake of having a more representative example I added a RobustScaler and nested the ColumnTransformer on a Pipeline. By the way, you will find my version of Venkatachalam's way to get the feature name looping of the steps. You can turn it into a slightly more usable variable by unpacking the names with a list comprehension:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
So play around with the dt_test and the estimators to soo how the feature name is built, and how it is concatenated in the get_feature_names().
Here is another example with a transformer which output 2 columns, using the input column:
class testEstimator3(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
self.unique = np.unique(X)[0]
return self
def transform(self,X):
return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)
def get_feature_names(self):
return list((self.unique,self.string))
dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)
transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)
pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
if hasattr(step[1], 'get_feature_names'):
print(step[1].get_feature_names())
If you are looking for how to access column names after successive pipelines with the last one being ColumnTransformer, you can access them by following this example:
In the full_pipeline there are two pipelines gender and relevent_experience
full_pipeline = ColumnTransformer([
("gender", gender_encoder, ["gender"]),
("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])
The gender pipeline looks like this:
gender_encoder = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
("cat", OneHotEncoder())
])
After fitting the full_pipeline, you can access the column names using the following snippet
full_pipeline.transformers_[0][1][1].get_feature_names()
In my case the output was:
array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)

Categories

Resources