Relabeling categorical values in pandas data frame using fuzzy - python

I have a large data frame with 371 unique categorical entries, however some of the entries are similar and in some cases I want to merge certain categories that may have been seperated, for example I have 3 categories that I know of:
3d
3d_platformer
3d_vision
I want to combine these under a general category of just 3d. I feel like this should be possible on a small scale, but I want to scale it up for all the categories as well. The problem being that I don't know the names of all my categories. So in short the full question is:
How can I search for similar categorical names and then replace all the similar name with one group name, with out searching individually?

Can regular expressions help?
df.col = df.col.str.replace(r'3d.*', '3d')
If you're looking for more semantical-like identity, the NLP libraries like Gensim may provide string similarity computing methods:
https://betterprogramming.pub/introduction-to-gensim-calculating-text-similarity-9e8b55de342d
You can try to use your category names as corpus.

Related

R/Python method to combine multiple SPSS-style crosstables into one table

My supervisor wants a single table comparing multiple different categorical variables against another categorical variable. For example, the attached image
x-tab cross table
is found here https://strengejacke.wordpress.com/2014/02/20/no-need-for-spss-beautiful-output-in-r-rstats/ is made from R sjt.xtab() [though the function name has since changed].
I could use sjt.xtab() to create another cross-table with different index variables, for example age category (0-15, 16-29, and etc) with the same column variables (dependency level). What I need to be able to do is combine both of these crosstables into one table where the column categories stay in the same position and several different categorical variables (sex, age categories, shoe size, and etc) are listed in the index. This doesn't seem statistically correct as it would appear to duplicate numbers, but my supervisor just wants it for referencing reasons not publication.
Is there any way to do this in R or python? Happy to clarify my question if needed!
Edit, Here is a terribly edited Microsoft Paint example of what I am looking for Combined cross-tab Image
You may do that with GT tables: https://gt.rstudio.com/articles/intro-creating-gt-tables.html

Aggregating data in a tf.Dataset

I have a problem aggregating rows and transforming rows in a tf.data.Dataset.
Each row have an string id and a string category where some categories are subcategories of others.
I want to transform the dataset by mapping each category value to a one hot encoding of the base categories and then grouping them by id and summing up the one hot encodings
I can combine multiple rows using tf.data.experimental.group_by_reducer but I can not for the life of me figure out how to map them them to one hot encodings before reducing them.
Any help would be appriciated.
So far I have tried using tf.one_hot, but it does not work with strings.
I've also tried to implement a tf.lookup.StaticHashTable but could not get it to work with tensors as values, it complained about shape.
Unfortunately the code was written in a notebook and is gone now...
Regards

Is there any inbuilt pandas operation which can find similar columns of two different dataframes?

I have two dataframes which have similar data in the columns but different column names. I need to identify if they are similar columns or not.
colName1=['movieName','movieRating','movieDirector','movieReleaseDate']
colName2=['name','release_date','director']
My approach tokenize colName1 and compare them using
- levenshtein/Jaccard Distance
- Find similarity using TFIDF score.
But this works for col names having similar names for eg. movieName and Name. Suppose you have 'IMDB_Score' and 'average_rating' this approach is not going to work.
Is there any way word2vec can be utilized in the above mentioned problem.

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

Python Pandas Regression

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:
I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4
The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.
Can anyone help point me in the right direction?
Thank you all in advance!
Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:
other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]
To get a handle on specific ranges, I would recommend looking at:
the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.
To visualize bivariate and simple multivariate relationships, I would recommend
the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.
The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Categories

Resources