I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
Related
I am trying to convert the categorical column of my dataset into numerical using LabelEncoder.
dataset
Here is the conversion code:
for i in cat_columns:
df[i]=encoder.fit_transform(df[i])
After conversion dataset looks like dataset after transformation
But the problem is whenever I try to transform my test dataset it gives an error that
y contains previously unseen labels: 'Male'
Code for transformation on test data :
for i in cat_columns:
df1[i]=encoder.transform(df1[i])
test data
Now how can i solve this problem?
I guess the problem is that you are using the same encoder to fit all the different columns. You should instead fit each column using a different encoder. For example, you can use a dictionary to store the different encoders:
from sklearn import preprocessing
encoders = {}
for i in cat_columns:
encoders[i] = preprocessing.LabelEncoder()
df[i] = encoders[i].fit_transform(df[i])
for i in cat_columns:
df1[i] = encoders[i].transform(df1[i])
The error you encounter (previously unseen labels: 'Male') is caused by the fact you are trying to transform the gender column using the last encoder you create in the previous for loop, which in your case might be a smoking_status label encoder.
In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.
OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols = oh.fit(X_cat)
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),
columns=one_hot_cols.get_feature_names(input_features=oh_cols))
I am trying to use a high cardinality feature (siteid) in a sci-kit learn model and am using get_dummies to one-hot encode this feature. I get around 800 new binary columns which returns a decent accuracy using logistic regression. My problem is that when I pass a new dataset through my model I have a different cardinality on this feature with say 300 unique values and the model rightly asks, where are the other 500 columns you trained me on? How can I resolve this?
I don't want to have to train the model every time the cardinality changes nor do i want to hard code these columns in my SQL data load.
cat_columns = ["siteid"]
df = pd.get_dummies(df, prefix_sep="__",
columns=cat_columns)
My recommendation would be to pad these remaining columns with zeros. So if your new training sample has, for example 10 unique values, and the model expects 50 values (number of total_cols), then create 40 columns of zeros to the right to "fill out" the rest of the data:
df = pd.DataFrame({"siteid": range(10)})
cat_columns = ["siteid"]
df1 = pd.get_dummies(df, columns=cat_columns)
# df1 has shape (10, 10)
total_cols = 50 # Number of columns that model expects
zero_padding = pd.DataFrame(np.zeros((df1.shape[0], total_cols - df1.shape[1])))
df = pd.concat([df1, zero_padding], axis=1)
df.columns = ["siteid__" + str(i) for i in range(df.shape[1])]
# df now has shape (10, 50)
I suggest using scikit-learn's OneHotEncoder.
documentation here
In your case, the usage would look something like
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df[['cat_columns']])
categories = [cat for cats in enc.categories_ for cat in cats]
df[categories] = enc.transform(df[['cat_columns']])
the handle_unknown parameter is key, and the enc object is necessary for repeatability for new data.
On new dataframes you would run
df_new[categories] = enc.transform(df_new[['cat_columns']])
This will hot-encode the same categories and ignore any new ones that your model is not accustomed to.
I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.
For example ,
You have tradf with column ['A_1','A_2']
With your new df you have column['A'] but only have one category 1 , you can do
pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)
I have a large multi dimensional unlabelled dataset of cars (price, mileage, horsepower, ...) for which I want to find outliers. I decided to use the sklearn OneClassSVM to build a decision boundary and have two main issues with my approach:
My dataset contains a lot of missing values. Is there a way to make the svm classify the data with missing features as an inlier if any possible values for the missing features would be an inlier?
I now want to add a feedback loop of manual moderated outliers. The manually moderated data should improve the classification of the SVM. I've read about the LabelSpreading model for semi-supervised learning. Would it be feasible to feed the classification output of the OneClassSVM to the LabelSpreading model and retrain this model when a sufficient amount of records are manually validated?
For the first question. You could use an sklearn.preprocessing.imputer to impute the missing values by mean or median:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
You could add some boolean Features that recode if some of the other features had NaNs. So if you have features X_1, X_2 you add the boolean features
X_1_was_NaN and X_2_was_NaN
that are 1 if X_1==NaN or X_2==NaN. If X is your original pd.DataFrame you can create this by
X = pd.DataFrame()
# Create your features here
# Get the locations of the NaNs
X_2 = 1.0 * X.isnull()
# Rename columns
X_2.rename(columns=lambda x: str(x)+"_has_NaN", inplace=True)
# Paste them together
X = pd.concat([X, X_2], axis=1)