pySpark scalable onehotencoder with binary representation

pySpark scalable onehotencoder with binary representation - python

I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:
Using StringIndexer + OneHotEncoder + VectorAssembler + Pipeline from pySpark.MlLib. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation.
Appending binary columns to the main dataframe (using .withColumn), where each column represents a category from a categorical feature.
Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2 that doesn't involve sequentially appending more columns to the dataframe?

Related

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

Most efficient way to one-hot encode using pyspark

I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories is of course the list of possible categories in the feature categoricalFeature.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?

How does Python process one-hot-encoded data?

When I use Python sci-kit learn for Machine Learning project I often use one-hot-encoding.
My X dataset consists of rows like this: [1,2,[1,0] ].where the third entry ([1,0]) comes from one-hot-encoding.
Is this equivalent to using a data set where the rows are like [1,2,1,0]?
(I.e. where the rows have been 'flattened')

Individual values of the one-hot-encoded vector can be flattened as individual features. There is a special function in pandas to perform this get_dummies().

Spark VectorAssembler

As far as I know VectorAssembler enables you to combine multiple columns into one column, containing a Vector. This column you can later pass to different ML algorithms and preprocessing implementations.
I would like to know whether there's something like "VectorDisassembler", that is, a helper which would take one Vector column and split its values back into multiple columns (e.g. at the end of ML pipeline)?
If not, what is the best way to achieve that (best in Python, if possible)?
Here's what I had in mind:
PcaComponents = Row(*["p"+str(i) for i in range(35)])
pca_features = reduced_dataset_df.map(lambda x: PcaComponents(*x[0].values.tolist())).toDF()
Can we do better?

Can sklearn random forest directly handle categorical features?

Say I have a categorical feature, color, which takes the values
['red', 'blue', 'green', 'orange'],
and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.
I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.

No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.
A notable exception is H2O. H2O has a very efficient method for handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.
This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.
This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works.
if you are using pandas. use pd.get_dummies, it works really well.

Maybe you can use 1~4 to replace these four color, that is, it is the number rather than the color name in that column. And then the column with number can be used in the models

No.
There are 2 types of categorical features:
Ordinal: use OrdinalEncoder
Cardinal: use LabelEncoder or OnehotEncoder
Note: Differences between LabelEncoder & OnehotEncoder:
Label: only for one column => usually we use it to encode the label
column (i.e., the target column)
Onehot: for multiple columns => can handle more features at one time

You can directly feed categorical variables to random forest using below approach:
Firstly convert categories of feature to numbers using sklearn label encoder
Secondly convert label encoded feature type to string(object)
le=LabelEncoder()
df[col]=le.fit_transform(df[col]).astype('str')
above code will solve your problem

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.