How do you measure with features of your dataframe are important for your Kmeans model?
I'm working with a dataframe that has 37 columns of which 33 columns are of categorical data.
These 33 data columns go through one-hot-encoding and now I have 400 columns.
I want to see which columns have an impact on my model and which don't.
Is there a method for this or do I loop this?
For categorical values there is K-Modes and for mixed (categorical and continuous values) there is K-Prototype. That might be worth trying and potentially easier to evaluate. You wouldn't use one-hot encoding there though.
Related
I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
I am using Lime (Local Interpretable Model-agnostic Explanations) with mixed feature types in order to evaluate my model predictions for classification task. Does anyone know how to specify binary features in lime.lime_tabular.LimeTabularExplainer() method. How actually LIME handles these types of features (more features with only 1's and 0's)?
I think your should declare your binary features as categorical features in order to allow your Lime explainer to use its sampling mechanism efficiently when performing local perturbation around the studied sample.
You can do it using the categorical_features keyword parameter in the LimeTabularExplainer constructor.
my_binary_feature_column_index = 0 # put your column index here
explainer = LimeTabularExplainer(my_data, categorical_features=[my_binary_feature_column_index], categorical_name={my_binary_feature_column_index: ["foo", "bar", "baz"]})
categorical_features is a list of categorical column indexes, and
categorical_name is a dictionary containing a map of column index and list of category names.
As it is mentionned in the LIME code :
Explains predictions on tabular (i.e. matrix) data.
For numerical features, perturb them by sampling from a Normal(0,1) and
doing the inverse operation of mean-centering and scaling, according to the
means and stds in the training data. For categorical features, perturb by
sampling according to the training distribution, and making a binary
feature that is 1 when the value is the same as the instance being
explained.
So, categorical features are one hot encoded under the hood and the value 0 or 1 is used according to the feature distribution in your training dataset (unless you chose to use a LabelEncoder, which will result in LIME processing the feature as a continuous variable).
A good tutorial is available in the LIME project: https://github.com/marcotcr/lime/blob/master/doc/notebooks/Tutorial%20-%20continuous%20and%20categorical%20features.ipynb
In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.
OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols = oh.fit(X_cat)
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),
columns=one_hot_cols.get_feature_names(input_features=oh_cols))
I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.
For example ,
You have tradf with column ['A_1','A_2']
With your new df you have column['A'] but only have one category 1 , you can do
pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)
There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?
Thank you
Well, technically it is possible. Spark, including PySpark, provides a number of transformers which we can be used to encode categorical data. In particular you should take a look at the ml.feature.StringIndexer and OneHotEncoder.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["label", "feature"])
stringIndexer = StringIndexer(inputCol="feature", outputCol="indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="encoded")
encoded = encoder.transform(indexed)
So far so good. Problem is that categorical variables are not very useful in case of k-means. It assumes Euclidean norm which, even after encoding, is rather meaningless for categorical data.