Most efficient way to one-hot encode using pyspark

Most efficient way to one-hot encode using pyspark - python

I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories is of course the list of possible categories in the feature categoricalFeature.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?

Related

Is there a way to compare two dataframes and report which column is different in Pyspark?

I'm using df1.subtract(df2).rdd.isEmpty() to compare two dataframes (assuming the schema of these two df are the same, or at least we expect them to be the same), but if one of the column doesn't match, I can't tell from the output logs, and it takes long time for me to find out the issue in the data (and it's exhausting when the datasets are quite big)
Is there a way that we can compare two df and return which column doesn't match with Pyspark? Thanks a lot.

You could use the chispa library, it is a great tool for comparing data frames.

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

pySpark scalable onehotencoder with binary representation

I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:
Using StringIndexer + OneHotEncoder + VectorAssembler + Pipeline from pySpark.MlLib. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation.
Appending binary columns to the main dataframe (using .withColumn), where each column represents a category from a categorical feature.
Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2 that doesn't involve sequentially appending more columns to the dataframe?

Aggregating data in a tf.Dataset

I have a problem aggregating rows and transforming rows in a tf.data.Dataset.
Each row have an string id and a string category where some categories are subcategories of others.
I want to transform the dataset by mapping each category value to a one hot encoding of the base categories and then grouping them by id and summing up the one hot encodings
I can combine multiple rows using tf.data.experimental.group_by_reducer but I can not for the life of me figure out how to map them them to one hot encodings before reducing them.
Any help would be appriciated.
So far I have tried using tf.one_hot, but it does not work with strings.
I've also tried to implement a tf.lookup.StaticHashTable but could not get it to work with tensors as values, it complained about shape.
Unfortunately the code was written in a notebook and is gone now...
Regards

Spark VectorAssembler

As far as I know VectorAssembler enables you to combine multiple columns into one column, containing a Vector. This column you can later pass to different ML algorithms and preprocessing implementations.
I would like to know whether there's something like "VectorDisassembler", that is, a helper which would take one Vector column and split its values back into multiple columns (e.g. at the end of ML pipeline)?
If not, what is the best way to achieve that (best in Python, if possible)?
Here's what I had in mind:
PcaComponents = Row(*["p"+str(i) for i in range(35)])
pca_features = reduced_dataset_df.map(lambda x: PcaComponents(*x[0].values.tolist())).toDF()
Can we do better?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.