How to apply pyspark-mllib-kmeans to categorical variables

How to apply pyspark-mllib-kmeans to categorical variables - python

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?
Thank you

Well, technically it is possible. Spark, including PySpark, provides a number of transformers which we can be used to encode categorical data. In particular you should take a look at the ml.feature.StringIndexer and OneHotEncoder.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["label", "feature"])
stringIndexer = StringIndexer(inputCol="feature", outputCol="indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="encoded")
encoded = encoder.transform(indexed)
So far so good. Problem is that categorical variables are not very useful in case of k-means. It assumes Euclidean norm which, even after encoding, is rather meaningless for categorical data.

Related

Kmeans clustering measuring important features

How do you measure with features of your dataframe are important for your Kmeans model?
I'm working with a dataframe that has 37 columns of which 33 columns are of categorical data.
These 33 data columns go through one-hot-encoding and now I have 400 columns.
I want to see which columns have an impact on my model and which don't.
Is there a method for this or do I loop this?

For categorical values there is K-Modes and for mixed (categorical and continuous values) there is K-Prototype. That might be worth trying and potentially easier to evaluate. You wouldn't use one-hot encoding there though.

Differencies between OneHotEncoding (sklearn) and get_dummies (pandas)

I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()

If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)

standardizing data column-wise before using keras models

I'm working with a large dataset whose data I want to standardize to use with a CNN.
Does keras have a quick utility to standardize a block of numbers column-wise that you can use inside a Sequential model? I'm asking this as i expect eventually the data to be used on-line so ideally this standardization feature could be used on incoming data, ie a trailing moving average of mean and std to normalize the incoming data.
import numpy as np
import pandas as pd
np.random.seed(42)
col_names = ['Column' + str(x+1) for x in range(5)]
training_data = pd.DataFrame(np.random.randint(1,10 **6, 50).reshape(-1,5), columns = col_names)

I am not sure about online, but using sklearn's StandardScaler() should do the right thing, as described here, seems like the right thing.

We can do from sklearn
from sklearn.preprocessing import StandardScaler
training_data[:]= StandardScaler().fit_transform(training_data.T).T

OneHotEncoding after LabelEncoding

In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.

OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols = oh.fit(X_cat)
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),
columns=one_hot_cols.get_feature_names(input_features=oh_cols))

Passing categorical data to Sklearn Decision Tree

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
But running the following script
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
outputs the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

(This is just a reformat of my comment above from 2016...it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

(..)
Able to handle both numerical and categorical data.
This only means that you can use
the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

As of v0.24.0, scikit supports the use of categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor natively!
To enable categorical support, a boolean mask can be passed to the categorical_features parameter, indicating which feature is categorical. In the following, the first feature will be treated as categorical and the second feature as numerical:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])
Equivalently, one can pass a list of integers indicating the indices of the categorical features:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])
You still need to encode your strings, otherwise you will get "could not convert string to float" error. See here for an example on using OrdinalEncoder to convert strings to integers.

Yes decision tree is able to handle both numerical and categorical data.
Which holds true for theoretical part, but during implementation, you should try either OrdinalEncoder or one-hot-encoding for the categorical features before training or testing the model. Always remember that ml models don't understand anything other than Numbers.

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:
def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column

you can apply some conversion method like one hot encoding to transform your categorical data into numeric entities and then create the tree
Refer this URL for more information:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

With sklearn classifiers, you can model categorical variables both as an input and as an output.
Let's assume you have categorical predictors and categorical labels (i.e. multi-class classification task). Moreover, you want to handle missing or unknown labels for both predictors and labels.
First thing you need encoder like OrdinalEncoder.
Basic example:
# encoders
from sklearn.preprocessing import OrdinalEncoder
input_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1)
output_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1 )
input_enc.fit(df[['Attribute A','Attribute B']].values)
output_enc.fit(df[['Label']].values)
# build classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
X = input_enc.transform(df[['Attribute A','Attribute B']].values)
Y = output_enc.transform(df[['Label']].values)
clf.fit(X, Y)
# predict
predicted = clf.predict(input_enc.transform([('Value 1', 'Value 2')]))
predicted_label = output_enc.inverse_transform([predicted])
If you use df[...].values, your encoder will not store attribute names (column names). This does not matter, as long as same format is used for enc.transform() or enc.inverse_transofrm() (otherwise you will a warning).
OrdinalEncoder by default does not handle nan values and they are not handled by cls.fit(). This is solved by encoded_missing_value param.
In prediction phase, by default encoder will throw an error when ask to transform unknown labels. This is handled by handle_unknown param.

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.
Refer to the following code from the documentation:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:
list(le.inverse_transform([2, 2, 1]))
This would return ['tokyo', 'tokyo', 'paris'].
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to apply pyspark-mllib-kmeans to categorical variables - python

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark? Thank you

Related

Kmeans clustering measuring important features

Differencies between OneHotEncoding (sklearn) and get_dummies (pandas)

standardizing data column-wise before using keras models

OneHotEncoding after LabelEncoding

Passing categorical data to Sklearn Decision Tree

Categories

Resources