How to encode high cardinality feature? - python

I have a dataset with more 1500+ category in single feature. how to encode these feature ? I tried with target encoding but there is category mismatch in training and test dataset. For example in training dataset there is A,B,C category of feature X while in test dataset there is A,B,D,F category of feature X ?
How to deal with category mismatch and encode categorical variable with high cardinality?

Related

Differencies between OneHotEncoding (sklearn) and get_dummies (pandas)

I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)

Handling category, float and int type features while using LIME for model interpretation

I am using Lime (Local Interpretable Model-agnostic Explanations) with mixed feature types in order to evaluate my model predictions for classification task. Does anyone know how to specify binary features in lime.lime_tabular.LimeTabularExplainer() method. How actually LIME handles these types of features (more features with only 1's and 0's)?
I think your should declare your binary features as categorical features in order to allow your Lime explainer to use its sampling mechanism efficiently when performing local perturbation around the studied sample.
You can do it using the categorical_features keyword parameter in the LimeTabularExplainer constructor.
my_binary_feature_column_index = 0 # put your column index here
explainer = LimeTabularExplainer(my_data, categorical_features=[my_binary_feature_column_index], categorical_name={my_binary_feature_column_index: ["foo", "bar", "baz"]})
categorical_features is a list of categorical column indexes, and
categorical_name is a dictionary containing a map of column index and list of category names.
As it is mentionned in the LIME code :
Explains predictions on tabular (i.e. matrix) data.
For numerical features, perturb them by sampling from a Normal(0,1) and
doing the inverse operation of mean-centering and scaling, according to the
means and stds in the training data. For categorical features, perturb by
sampling according to the training distribution, and making a binary
feature that is 1 when the value is the same as the instance being
explained.
So, categorical features are one hot encoded under the hood and the value 0 or 1 is used according to the feature distribution in your training dataset (unless you chose to use a LabelEncoder, which will result in LIME processing the feature as a continuous variable).
A good tutorial is available in the LIME project: https://github.com/marcotcr/lime/blob/master/doc/notebooks/Tutorial%20-%20continuous%20and%20categorical%20features.ipynb

All independent variables are categorical and dependent(target) variable is continuous

This is the data I need to build model upon:
dataframe
dataframe contains 834 rows and 4 columns('Size','Sector','Road Connectivity','Price')
AIM is to train a model so as to predict the price
'Size','Sector' and 'Road connectivity' are 3 features which are assigned to X variable.
'Price' i.e our target feature is assigned to y variable
i have made a pipeline which consists of one hot encoder and linear regressor
below is the code:
ohc=OneHotEncoder(categories = "auto")
lr=LinearRegression(fit_intercept=True,normalize=True)
pipe=make_pipeline(ohc,lr)
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
kfolds=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
cross_val_score(pipe,X,y,cv=kfolds).mean()
output =0.8970496076598085
xinp=([['04M','Sec 10','C road']])
pipe.fit(X,y)
pipe.predict(xinp)
now when I pass the values to pipeline to predict it shows an error:
"""Found unknown categories ['Sec 10'] in column 1 during transform"""
ANY SUGGESTIONS which help build the model are appreciated...
It looks like you provided category (in xinp, the Sec 10 value), that was not present in training data, thus it can not be one hot encoded, because there is no dummy variable (no corresponding binary column) for it. One of possible solutions can be following:
ohc=OneHotEncoder(categories = "auto", handle_unknown = "ignore")
From scikit one hot encoder documentation:
handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this
parameter is set to ‘ignore’ and an unknown category is encountered
during transform, the resulting one-hot encoded columns for this
feature will be all zeros. In the inverse transform, an unknown
category will be denoted as None.

How to use one hot encoding for multiple label(trainy) in .fit() method?

I have a mobile price classification dataset in which I have 20 features and one target variable called price_range. I need to classify mobile prices as low, medium, high, very high. I have applied a one-hot encoding to my target variable. After that, I split the data into trainX, testX, trainy, testy. So my shape for trainX and trainy is (1600,20) and (1600,4) respectively.
Now when I try to fit trainX and trainy to logisticRegresion,
i.e -> lr.fit(trainX,trainy)
I am getting an error and it says: bad input (1600,4)
So, I understood that I have to give trainy value in shape (1600,1)
but by one hot encoding I have got array of 4 columns for each individual price_range as per the concept of one hot encoding.
So now I am totally confused how people use one hot encoding for target variable in practice?
please help me out.
To train the model, you should only apply OneHotEncoder on features to gain X.
And apply LabelEncoder() to convert y.
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
le.fit_transform(['a','b','a'])
and gain:
output: array([0, 1, 0])

How to one hot encode with pandas on a new dataset?

I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.
For example ,
You have tradf with column ['A_1','A_2']
With your new df you have column['A'] but only have one category 1 , you can do
pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)

Categories

Resources