Aggregating data in a tf.Dataset

Aggregating data in a tf.Dataset - python

I have a problem aggregating rows and transforming rows in a tf.data.Dataset.
Each row have an string id and a string category where some categories are subcategories of others.
I want to transform the dataset by mapping each category value to a one hot encoding of the base categories and then grouping them by id and summing up the one hot encodings
I can combine multiple rows using tf.data.experimental.group_by_reducer but I can not for the life of me figure out how to map them them to one hot encodings before reducing them.
Any help would be appriciated.
So far I have tried using tf.one_hot, but it does not work with strings.
I've also tried to implement a tf.lookup.StaticHashTable but could not get it to work with tensors as values, it complained about shape.
Unfortunately the code was written in a notebook and is gone now...
Regards

Related

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

Relabeling categorical values in pandas data frame using fuzzy

I have a large data frame with 371 unique categorical entries, however some of the entries are similar and in some cases I want to merge certain categories that may have been seperated, for example I have 3 categories that I know of:
3d
3d_platformer
3d_vision
I want to combine these under a general category of just 3d. I feel like this should be possible on a small scale, but I want to scale it up for all the categories as well. The problem being that I don't know the names of all my categories. So in short the full question is:
How can I search for similar categorical names and then replace all the similar name with one group name, with out searching individually?

Can regular expressions help?
df.col = df.col.str.replace(r'3d.*', '3d')
If you're looking for more semantical-like identity, the NLP libraries like Gensim may provide string similarity computing methods:
https://betterprogramming.pub/introduction-to-gensim-calculating-text-similarity-9e8b55de342d
You can try to use your category names as corpus.

Most efficient way to one-hot encode using pyspark

I want to one-hot encode multiple categorical features using pyspark (version 2.3.2). My data is very large (hundreds of features, millions of rows). The problem is that pyspark's OneHotEncoder class returns its result as one vector column. I need to have the result as a separate column per category. What is the most efficient way to achieve this?
Option 1 & 2
If you indeed use OneHotEncoder, the problem is how to transform that one vector column to separate binary columns. When searching around, I find that most other answers seem to use some variant of this answer which gives two options, both of which look very inefficient:
Use a udf. But I would like to avoid that since udf's are inherently inefficient, especially if I have to repeatedly use one for every categorical feature I want to one-hot encode.
Convert to rdd. Then for each row extract all the values and append the one-hot encoded features. You can find a good example of this here:
def extract(row):
return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
result = df.rdd.map(extract).toDF(allColNames)
The problem is that this will extract each value for each row and then convert that back into a dataframe. This sounds horrible if you need to do this for a dataframe with hundreds of features and millions of rows, especially if you need to do it dozens of times.
Is there another, more efficient way to convert a vector column to separate columns that I'm missing? If not maybe it's better not to use OneHotEncoder at all, since I can't use the result efficiently. Which leads me to option 3.
Option 3
Option 3 would be to just one-hot encode myself using reduce statements and withColumn. For example:
df = reduce(
lambda df, category: df.withColumn(category, sf.when(col('categoricalFeature') == category, sf.lit(1)))
categories,
df
)
Where categories is of course the list of possible categories in the feature categoricalFeature.
Before I spend hours/days implementing this, waiting for the results, debugging, etc. I wanted to ask if anyone has any experience with this? Is there an efficient way I'm missing? If not which of the three would probably be fastest?

Preparing csv data to ML

I would like to implement ML model for classification problem. My csv data looks like this:
Method1; Method2; Method3; Method4; Category; Class
result1; result2; result3; result4; Sport; 12
...
...
All methods, gives a text. Sometimes it is a one word, sometimes more and sometimes the cell is empty (no answer for this method). Column "category" always has a text and column "class" is a numerical showing number of method with correct answer (i.e. number 12 means that only result from method 1 and 2 is correct). Maybe will add more column if necessary.
Now, having a new answers from all methods I would like to classify it to one of the class.
How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?

How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?
There are many different ways of doing this, but the most simple would be to just use a Bag of Words representation, which means concatenating all your Methodx columns and counting how many times does each word appears on them.
With that, you have a vector representation (each word is a column/feature, each count is a numerical value).
Now, from here there are several problems (the main one is that the number of columns/features in your dataset will be quite large), so you may have to either preprocess your data further or find a ML technique that can dealt with it for you. But, in any case, I would recommend to try to have a look at several tutorials on NLP to get a better idea of this and get a better estimation of what would be the best solution for your dataset.

Adding labels to a dataset in tensorflow by using a second dataset

Tensorflow beginner here.
My data is split into two csv files, a.csv and b.csv, relating to two different events a and b. Both files contain information on the users concerned and, in particular, they both have a user_id field that I can use to merge the data sets.
I want to train a model to predict the probability of b happening based on the features of a. For doing this, I need to append a label column 'has_b_happened' to the data A retrieved from a.csv. In scala spark, I would do something like:
val joined = A
.join(B.groupBy("user_id").count, A("user_id") === B("user_id"), "left_outer")
.withColumn("has_b_happened", col("count").isNotNull.cast("double"))
In tensorflow, however, I haven't found anything comparable to spark's join. Is there a way of achieving the same result or am I trying to use the wrong tool for it?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.