My question is about creating custom transformations based on a train set and reapply them on new observations. To achieve this goal I usually use Pipeline object from sklearn.
The transformation I want to build is a custom grouping for categorical variables. The method allows to choose the proportion under which a category is considered as rare. If the proportion of occurrence of a category is less than the specified proportion (parameter), then we classify (rename) this category as 'OTHER'.
The problem comes when there are some categories in the test set that does not occur in the train set. The code below raises the following error:
ValueError: Cannot setitem on a Categorical with a new category, set the
categories first
Here is the code I use:
trainDF = df[0:8000]
testDF = df[8000:10142]
class CustomBinningCategoricalFeature(TransformerMixin):
def __init__(self, percRareClass):
self.percRareClass = percRareClass
self.rareClass = {}
def fit(self, X, y = 0):
for column_name in list(X.head(0)):
if (X[column_name].dtype != np.dtype(float)):
df = pd.crosstab(index = X[column_name], columns = 'count')
df['prop'] = df['count']/df['count'].sum()
self.rareClass[column_name] = df.index[df['prop'] < self.percRareClass].tolist()
return(self)
def transform(self, X, y = 0):
for column_name in self.rareClass.keys():
#X.loc[~X[column_name].isin(list(X[column_name].unique())), column_name] = 'OTHER'
X.loc[X[column_name].isin(self.rareClass[column_name]), column_name] = 'OTHER'
return X
pipeline = make_pipeline(CustomBinningCategoricalFeature(0.01))
pipeline.fit(trainDF)
transformed_testDF = pipeline.transform(testDF)
In the industry, it could happen that new categories occur. In this situation, we are facing at least two choices:
To not score new data if a category is unknown from the train set
As it is the first time that the category occur, we could consider it as a rare class, so we assign it the 'OTHER' category.
In our case, we want to choose the second option.
Do you know a way to code the fit and transform methods to use them in pipelines and apply the transformation on new data according to the second option above (assigning 'OTHER' to new category occuring in the test set) ?
Related
I try to set up my GBDTLRClassifier following the instruction here.
First, I have done label encode on my columns. Then I define my categorical and continuous features, putting column names in two list.
cat # categorical column names
conts # continuous column names
gbm = lgb.LGBMClassifier(n_estimator = 90)
classifier = GBDTLRClassifier(gbm, LogisticRegression(penalty='l2'))
dm = DataFrameMapper([([cat_col], CategoricalDomain()) for cat_col in cat] + [(conts, ContinuousDomain())])
pipeline = PMMLPipeline([('mapper', dm), ('classifier', classifier)])
pipeline.fit(df[cat + conts], df['y'], classifier__gbdt__eval_set=[(val[cat + conts], val['y'])], classifier__gbdt__early_stopping_rounds = 5, classifier__gbdt__categorical_feature=cat)
pp = make_pmml_pipeline(pipelin, target_fields=['y'])
sklearn2pmml(pp, '/tmp/lgb+lr.pmml')
I get error message in fitting:TypeError: Wrong type(str) or unknown name(root) in categorical_feature. While root is definitely in cat. Looks like lgbm not aware of which columns are categorical, which is confusing.
Moreover, when I remove the mapper part, no fitting error but convert failed in making pmml file with message: transformer object of the first step does not specify the number of input features.
Does anyone could tell how to make this procedure work. THx
Based on comment here, need to set feature_name when I send string column names into categorical_feature. A little tricky here.
How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.
I've trying to develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location.
This is my class definition
#initial model to predict the amount of fines a nursing home might expect to pay based on its location
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
class GroupMeanEstimator(BaseEstimator, RegressorMixin):
#defines what a group is by using grouper
#initialises an empty dictionary for group averages
def __init__(self, grouper):
self.grouper = grouper
self.group_averages = {}
#Any calculation I require for my predict method goes here
#Specifically, I want to groupby the group grouper is set by
#I want to then find out what is the mean penalty by each group
#X is the data containing the groups
#Y is fine_totals
#map each state to its mean fine_tot
def fit(self, X, y):
#Use self.group_averages to store the average penalty by group
Xy = X.join(y) #Joining X&y together
state_mean_series = Xy.groupby(self.grouper)[y.name].mean() #Creating a series of state:mean penalties
#populating a dictionary with state:mean key:value pairs
for row in state_mean_series.iteritems():
self.group_averages[row[0]] = row[1]
return self
#The amount of fine an observation is likely to receive is based on his group mean
#Want to first populate the list with the number of observations
#For each observation in the list, what is his group and then set the likely fine to his group mean.
#Return the list
def predict(self, X):
dictionary = self.group_averages
group = self.grouper
list_of_predictions = [] #initialising a list to store our return values
for row in X.itertuples(): #iterating through each row in X
prediction = dictionary[row.STATE] #Getting the value from group_averages dict using key row.group
list_of_predictions.append(prediction)
return list_of_predictions
It works for this
state_model.predict(data.sample(5))
But breaks down when I try to do this:
state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))
My model can't handle the possibility, and I would like to seek help in rectifying it.
The problem I am seeing is in your fit method, iteritems basically iterates over columns rather than rows. you should use itertuples which will give you row wise data. just change the loop in your fit method to
for row in pd.DataFrame(state_mean_series).itertuples(): #row format is [STATE, mean_value]
self.group_averages[row[0]] = row[1]
and then in your predict method, just do a fail safe check by doing
prediction = dictionary.get(row.STATE, None) # None is the default value here in case the 'AS' doesn't exist. you may replace it with what ever you want
I am running a tensorflow model on the gcp-ai platform. The dataset is large and not everything can be kept in memory at the same time, therefore I read the data into a tf.dataset using the following code:
def read_dataset(filepattern):
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0])
features=[cols[1],cols[2]]
label = cols[0]
return features, label
# Create list of files that match pattern
file_list = tf.io.gfile.glob(filepattern)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
return dataset
training_data=read_dataset(<filepattern>)
The problem is that the second column in my data is categorical, and I need to use one hot encoding. How can this be done, either in the function decode_csv or manipulate the tf.dataset later.
You could use tf.one_hot. Assuming that the second column is cols[1] and that the categorical values have been converted to integers, you could do the following:
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0]])
features=[cols[1], tf.one_hot(cols[2], nb_classes)]
label = cols[0]
return features, label
NOTE: Not tested.
Suppose I have location feature. In train data set its unique values are 'NewYork', 'Chicago'. But in test set it has 'NewYork', 'Chicago', 'London'.
So while creating one hot encoding how to ignore 'London'?
In other words, How not to encode the categories that only appear in the test set?
You can use the parameter handle_unknown in one hot encoding.
ohe = OneHotEncoder(handle_unknown=‘ignore’)
This will not show an error and will let execution occur.
See Documentation for more
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Often you never want to eliminate information. You want to wrap this information prior within your model. For example you might have some data with NaN values:
train_data = ['NewYork', 'Chicago', NaN]
Solution 1
You will likely have a way of dealing with this, whether you impute, delete, etc.. is up to you based on the problem. More often than not you can have NaN be it's own category, as this is information as well. Something like this can suffice:
# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
X = df.copy()
X[var_list] = df[var_list].fillna('Missing')
return X
# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)
Therefore, when you move to production you could write a script that pushes unseen categories into this "missing" category you've established earlier.
Solution 2
If you're not satisfied with that idea, you could always turn these unusual cases into a new unique category that we'll call "rare" because it's not present often.
train_data = ['NewYork', 'Chicago', 'NewYork', 'Chicago', 'London']
# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']
def find_frequent_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)['Target_Variable'].count() / len(df)
return tmp[tmp>rare_perc].index
for var in cat_vars:
frequent_ls = find_frequent_labels(X_train, var, 0.01)
X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')
Now, given enough instances of the "normal" categories, London will get pushed into the "Rare" category. Regardless of how many new categories might show up, they will be grouped into 'Rare' as a category; pending they remain rare instances and don't become dominate categories.
Assuming this to be your lists
train_data = ['NewYork', 'Chicago']
test_set = ['NewYork', 'Chicago', 'London']
Based on your question :
How not to encode the categories that only appear in the test set?
for each in test_set:
if filter(lambda element: each in element, train_data):
print each
This outputs NewYork & Chicago, which means London is skipped.