Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer - python

I want to get feature names after I fit the pipeline.
categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Then
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
After fitting with pandas dataframe, I can get feature importances from
clf.steps[1][1].feature_importances_
and I tried clf.steps[0][1].get_feature_names() but I got an error
AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.
How can I get feature names from this?

You can access the feature_names using the following snippet:
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
Using sklearn >= 0.21 version, we can make it even simpler:
clf['preprocessor'].transformers_[1][1]\
['onehot'].get_feature_names(categorical_features)
Reproducible example:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category': ['asdf', 'asfa', 'asdfas', 'as'],
'num1': [1, 1, 0, 0],
'target': [0.2, 0.11, 1.34, 1.123]})
numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])
clf.named_steps['preprocessor'].transformers_[1][1]\
.named_steps['onehot'].get_feature_names(categorical_features)
# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
# 'category_asdf' 'category_asdfas' 'category_asfa']

Scikit-Learn 1.0 now has new features to keep track of feature names.
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
self.feature_names_in_)
num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
(num_pipeline, ["age", "height"]),
(OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())
df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
"age": [32, 65, 18, 24],
"height": [172, 163, 169, 190],
"weight": [65, 62, 54, 95]},
index=["Alice", "Bunji", "Cécile", "Dave"])
pipeline.fit(df, df["weight"])
## get pipeline feature names
pipeline[:-1].get_feature_names_out()
## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
columns=pipeline[:-1].get_feature_names_out(),
index=df.index)

EDIT: actually Peter's comment answer is in the ColumnTransformer doc:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer .get_feature_names() method depends on the order of declaration of the steps variable at the ColumnTransformer instanciation.
I could not find any doc so I just played with the toy example below and that let me understand the logic.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler
class testEstimator(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
return self
def transform(self,X):
return np.full(X.shape, self.string).reshape(-1,1)
def get_feature_names(self):
return self.string
transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)
dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)
for name,step in pipeline.named_steps.items():
if hasattr(step, 'get_feature_names'):
print(step.get_feature_names())
For the sake of having a more representative example I added a RobustScaler and nested the ColumnTransformer on a Pipeline. By the way, you will find my version of Venkatachalam's way to get the feature name looping of the steps. You can turn it into a slightly more usable variable by unpacking the names with a list comprehension:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]
So play around with the dt_test and the estimators to soo how the feature name is built, and how it is concatenated in the get_feature_names().
Here is another example with a transformer which output 2 columns, using the input column:
class testEstimator3(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
self.unique = np.unique(X)[0]
return self
def transform(self,X):
return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)
def get_feature_names(self):
return list((self.unique,self.string))
dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)
transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)
pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
if hasattr(step[1], 'get_feature_names'):
print(step[1].get_feature_names())

If you are looking for how to access column names after successive pipelines with the last one being ColumnTransformer, you can access them by following this example:
In the full_pipeline there are two pipelines gender and relevent_experience
full_pipeline = ColumnTransformer([
("gender", gender_encoder, ["gender"]),
("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])
The gender pipeline looks like this:
gender_encoder = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
("cat", OneHotEncoder())
])
After fitting the full_pipeline, you can access the column names using the following snippet
full_pipeline.transformers_[0][1][1].get_feature_names()
In my case the output was:
array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)

Related

Adding scale to my pipeline showing me weird error

from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
model = DecisionTreeRegressor()
my_pipeline = Pipeline(steps=[
("scale", StandardScaler),
("preprocessor", preprocessor),
("model", model)
])
my_pipeline.fit(X_train, y_train)
Raising the error AttributeError: 'DataFrame' object has no attribute 'fit'.
When I run the code without scaling it shows no error but with the scaling, it shows a weird (for me) error.
So my questions are:
How can I properly add a scale to my pipeline?
Why the variable my_pipeline is a pandas object?
Edit
#Creation of imputer to deal with missing value for numerical_cols
from sklearn.impute import SimpleImputer
numerical_transformer = SimpleImputer(strategy="mean")
#Creation of pipe to deal with categorical value and their missing value
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
categorical_cols = [cname for cname in X.columns if data[cname].nunique() < 10 and data[cname].dtype == "object"]
numerical_cols = [cname for cname in X.columns if data[cname].dtypes != "object"]
numerical_cols.remove("PassengerId")
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy='most_frequent')),
("onehot", OneHotEncoder(handle_unknown='ignore'))
])
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_cols),
("cat", categorical_transformer, categorical_cols)
])
The error is line 12 at my_pipeline.fit(X_train, y_train)

Getting "valueError: could not convert string to float: ..." for sklearn pipeline

I'm a beginner trying to learn sklearn pipeline. I get a value error of ValueError: could not convert string to float when I run my code below. I'm not sure what's the reason for it since OneHotEncoder shouldn't have any problem converting string to float for categorical variables
import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c!='income']
X = df[x_cols]
y = df['income']
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
preprocessor = ColumnTransformer(
transformers=[
('imputer', SimpleImputer(strategy='most_frequent'),['workclass','education','native-country']),
('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'sex','native-country'])
]
)
clf = Pipeline([('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)
Unfortunately, there is an issue with scikit-learn's SimpleImputer when it tries to impute string variables. Here is a open issue about it on their github page.
To get around this, I'd recommend splitting up your pipeline into two steps. One for just the replacement of null values and 2) the rest, something like this:
cols_with_null = ['workclass','education','native-country']
preprocessor = ColumnTransformer(
transformers=[
(
'imputer',
SimpleImputer(missing_values=np.nan, strategy='most_frequent'),
cols_with_null),
])
preprocessor.fit(X_train)
X_train_new = preprocessor.transform(X_train)
for icol, col in enumerate(cols_with_null):
X_train.loc[:, col] = X_train_new[:, icol]
# confirm no null values in these columns:
for col in cols_with_null:
print('{}, null values: {}'.format(col, pd.isnull(X_train[col]).sum()))
Now that you have X_train with no null values, the rest should work without SimpleImputer:
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'sex','native-country'])])
clf = Pipeline([('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)

Why I get more columns than expected after OneHotEncoding in a Sklearn Pipeline?

I'm using sklearn pipelines to preprocess my data.
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler()),
('imputer', KNNImputer(n_neighbors=2,weights='uniform', metric='nan_euclidean', add_indicator=True))
])
categorical_transformer = Pipeline(steps=[
('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
from sklearn.compose import make_column_selector as selector
numeric_features = ['Latitud','Longitud','Habitaciones','Dormitorios','Baños','Superficie_Total','Superficie_cubierta']
categorical_features = ['Tipo_de_propiedad']
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features, selector(dtype_exclude="category"))
,('categorical', categorical_transformer, categorical_features, selector(dtype_include="category"))])
The feature Tipo_de_propiedad has 3 classes: 'Departamento', 'Casa', 'PH'. So the 7 other features plus these dummies should give me 10 after transforming, but when I apply fit_transform, it returns 14 features.
train_transfor=pd.DataFrame(preprocessor.fit_transform(X_train))
train_transfor.head()
When I use pd.get_dummies it works well, but I can't use that to apply in the Pipeline; OneHotEncoder is useful because I can fit on train set and transform on the test set.
dummy=pd.get_dummies(df30[["Tipo_de_propiedad"]])
df_new=pd.concat([df30,dummy],axis=1)
df_new.head()
Your KNNImputer has used the parameter add_indicator=True, so the additional columns are presumably missingness indicators for some of your numeric columns.

could not convert string to float: 'N' while implementing StandardScaler

Code:
num_features = [feature for feature in x.columns if x[feature].dtypes != 'O']
x[num_features].replace('N',value=0)
from sklearn.preprocessing import StandardScaler
stds = StandardScaler()
x[num_features]= stds.fit(x)
Error:
ValueError: could not convert string to float: 'N'
To select numeric columns:
numeric_features = df.select_dtypes(include=['numeric'])
You method right now will not exclude Booleans(True and False), Category values, or Date and Time.
Just having numeric values is not enough. We may also have missing values. We need to deal with them before normalisation:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# numeric transformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# now before we transform we impute missing values with columns median.
normalise_numeric_features = numeric_transformer.fit_transform(numeric_feature)
This might not be enough, as you might have different data types in your dataset each needing their own transformer. You can do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
# select features types
numeric_features = df.select_dtypes(include=['numeric'])
categorical_features = df.select_dtypes(include=['category'])
# create transformer for each feature types
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# create a process
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# if you have a model you can add it in a pipeline to
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
# now we can use train the model
clf.fit(df[train_features], df[train_target])
# the fit will take the features, do the preprocessing(impute, normalise etc) that is fit_transforms, and train your model.
clf.predict(df[test_features])
# predict will just transform your features, using the learned metrics from train_features.
Hope this helps get the most of scikit-learn.

Using Pipeline to Avoid Data Leakage both for X and y

I followed the example at this link very closely: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
but used a different dataset (found here: https://archive.ics.uci.edu/ml/datasets/Adult)
Unlike the example dataset, my dataset contains a couple of nulls in the y (aka 'target') column as well. I am wondering how I would build a pipeline similar to the one in the sklearn documentation that both transforms columns in X and y rather than just in X. This is what I have tried but of course it doesn't work because the 'target' feature has been dropped from X in order to allow for train_test_split:
import os
import pandas as pd
import numpy as np
os.listdir()
df_train = pd.read_csv('data_train.txt', header=None)
df_test = pd.read_csv('data_test.txt', header=None)
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-sign', 'capital-loss',
'hours-per-week', 'native-country', 'target']
df = pd.concat([df_train, df_test], axis=0)
df.head()
df.columns = names
Data Leakage Example - Preprocessing is done on the entire dataset
df_leakage = df.copy()
#Exploring nulls:
df_leakage[df_leakage.isnull().any(axis=1)]
There are only two nulls so it would be safe to discard them but for the purposes of this demo we will impute values
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Defining numeric features and the transformations we will apply to them
numeric_features = ['capital-sign', 'capital-loss', 'hours-per-week', 'education-num']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
#Defining categorical features and the transformations we will apply to them
categorical_features = ['marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
ordinal_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('ordinal', OrdinalEncoder())])
#Defining categorical ordinal features and the transformations we will apply to them
#Note: we already have a education-continuous feature available but we will demo how to apply an ordinal transformation:
ordinal_features = ['education']
label_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('label', LabelEncoder())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('ord', ordinal_transformer, ordinal_features),
('lab', label_transformer, label_features),
])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
X = df_leakage.drop('target', axis=1)
y = df_leakage['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Error log (partial):
658 columns = list(key)
659
--> 660 return [all_columns.index(col) for col in columns]
661
662 elif hasattr(key, 'dtype') and np.issubdtype(key.dtype, np.bool_):
ValueError: 'y' is not in list
Obviously, I could easily apply the transformation to y separately, before any of these other transformations take place. I think it would be fine to handle character strings errors e.g. ".<=50K" vs "<=50K". But if I want to impute the value of y with its mean, the value imputed would depend on the specific choice of y_train -- thus causing some data leakage.
How would I be able to use the pipeline library to do this effectively?

Categories

Resources