As far as I know VectorAssembler enables you to combine multiple columns into one column, containing a Vector. This column you can later pass to different ML algorithms and preprocessing implementations.
I would like to know whether there's something like "VectorDisassembler", that is, a helper which would take one Vector column and split its values back into multiple columns (e.g. at the end of ML pipeline)?
If not, what is the best way to achieve that (best in Python, if possible)?
Here's what I had in mind:
PcaComponents = Row(*["p"+str(i) for i in range(35)])
pca_features = reduced_dataset_df.map(lambda x: PcaComponents(*x[0].values.tolist())).toDF()
Can we do better?
Related
I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:
Using StringIndexer + OneHotEncoder + VectorAssembler + Pipeline from pySpark.MlLib. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation.
Appending binary columns to the main dataframe (using .withColumn), where each column represents a category from a categorical feature.
Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2 that doesn't involve sequentially appending more columns to the dataframe?
I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories is of course the list of possible categories in the feature categoricalFeature.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?
When I use Python sci-kit learn for Machine Learning project I often use one-hot-encoding.
My X dataset consists of rows like this: [1,2,[1,0] ].where the third entry ([1,0]) comes from one-hot-encoding.
Is this equivalent to using a data set where the rows are like [1,2,1,0]?
(I.e. where the rows have been 'flattened')
Individual values of the one-hot-encoded vector can be flattened as individual features. There is a special function in pandas to perform this get_dummies().
I have two dataframes which have similar data in the columns but different column names. I need to identify if they are similar columns or not.
colName1=['movieName','movieRating','movieDirector','movieReleaseDate']
colName2=['name','release_date','director']
My approach tokenize colName1 and compare them using
- levenshtein/Jaccard Distance
- Find similarity using TFIDF score.
But this works for col names having similar names for eg. movieName and Name. Suppose you have 'IMDB_Score' and 'average_rating' this approach is not going to work.
Is there any way word2vec can be utilized in the above mentioned problem.
Tensorflow beginner here.
My data is split into two csv files, a.csv and b.csv, relating to two different events a and b. Both files contain information on the users concerned and, in particular, they both have a user_id field that I can use to merge the data sets.
I want to train a model to predict the probability of b happening based on the features of a. For doing this, I need to append a label column 'has_b_happened' to the data A retrieved from a.csv. In scala spark, I would do something like:
val joined = A
.join(B.groupBy("user_id").count, A("user_id") === B("user_id"), "left_outer")
.withColumn("has_b_happened", col("count").isNotNull.cast("double"))
In tensorflow, however, I haven't found anything comparable to spark's join. Is there a way of achieving the same result or am I trying to use the wrong tool for it?