When I use Python sci-kit learn for Machine Learning project I often use one-hot-encoding.
My X dataset consists of rows like this: [1,2,[1,0] ].where the third entry ([1,0]) comes from one-hot-encoding.
Is this equivalent to using a data set where the rows are like [1,2,1,0]?
(I.e. where the rows have been 'flattened')
Individual values of the one-hot-encoded vector can be flattened as individual features. There is a special function in pandas to perform this get_dummies().
Related
I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:
Using StringIndexer + OneHotEncoder + VectorAssembler + Pipeline from pySpark.MlLib. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation.
Appending binary columns to the main dataframe (using .withColumn), where each column represents a category from a categorical feature.
Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2 that doesn't involve sequentially appending more columns to the dataframe?
I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories is of course the list of possible categories in the feature categoricalFeature.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?
I've just started learning numpy to analyse experimental data and I am trying to give a custom name to a ndarray column so that I can do operations without slicing.
The data I receive is generally a two column .txt file (I'll call it X and Y for the sake of clarity) with a large number of rows corresponding to the measured data. I do operations on those data and generate new columns (I'll call it F(X,Y), G(X,Y,F), ...). I know I can do column-wise operations by slicing ie Data[:,2]=Data[:,1]+Data[:,0], but with a large number of added columns, this becomes tedious. Hence I'm looking for a way to label the columns so that I can refer to a column by its label, and can also label the new columns I generate. So essentially I'm looking for something that'll allow me to directly write F=X+Y (as a substitute for the example above).
Currently, I'm assigning the entire column to a new variable and doing the operations, and then 'hstack'ing it to the data, but I'm unsure of the memory usage here. For example,
X=Data[:,0]
Y=Data[:,1]
F=X+Y
Data=numpy.hstack((Data,F.reshape(n,1)))
I've seen the use of structured array and record arrays, but the data I'm dealing with is homogeneous and new columns are being added continuously. Also, I hear Pandas is well suited for what I'm describing, but again, since I'm working with numerical data, I don't find the need to learn a new module, unless it's really needed. Thanks in advance.
Tensorflow beginner here.
My data is split into two csv files, a.csv and b.csv, relating to two different events a and b. Both files contain information on the users concerned and, in particular, they both have a user_id field that I can use to merge the data sets.
I want to train a model to predict the probability of b happening based on the features of a. For doing this, I need to append a label column 'has_b_happened' to the data A retrieved from a.csv. In scala spark, I would do something like:
val joined = A
.join(B.groupBy("user_id").count, A("user_id") === B("user_id"), "left_outer")
.withColumn("has_b_happened", col("count").isNotNull.cast("double"))
In tensorflow, however, I haven't found anything comparable to spark's join. Is there a way of achieving the same result or am I trying to use the wrong tool for it?
As far as I know VectorAssembler enables you to combine multiple columns into one column, containing a Vector. This column you can later pass to different ML algorithms and preprocessing implementations.
I would like to know whether there's something like "VectorDisassembler", that is, a helper which would take one Vector column and split its values back into multiple columns (e.g. at the end of ML pipeline)?
If not, what is the best way to achieve that (best in Python, if possible)?
Here's what I had in mind:
PcaComponents = Row(*["p"+str(i) for i in range(35)])
pca_features = reduced_dataset_df.map(lambda x: PcaComponents(*x[0].values.tolist())).toDF()
Can we do better?