R/Python method to combine multiple SPSS-style crosstables into one table - python

My supervisor wants a single table comparing multiple different categorical variables against another categorical variable. For example, the attached image
x-tab cross table
is found here https://strengejacke.wordpress.com/2014/02/20/no-need-for-spss-beautiful-output-in-r-rstats/ is made from R sjt.xtab() [though the function name has since changed].
I could use sjt.xtab() to create another cross-table with different index variables, for example age category (0-15, 16-29, and etc) with the same column variables (dependency level). What I need to be able to do is combine both of these crosstables into one table where the column categories stay in the same position and several different categorical variables (sex, age categories, shoe size, and etc) are listed in the index. This doesn't seem statistically correct as it would appear to duplicate numbers, but my supervisor just wants it for referencing reasons not publication.
Is there any way to do this in R or python? Happy to clarify my question if needed!
Edit, Here is a terribly edited Microsoft Paint example of what I am looking for Combined cross-tab Image

You may do that with GT tables: https://gt.rstudio.com/articles/intro-creating-gt-tables.html

Related

How to Add an Filter to this Excel sheet with this Criteria?

I have this Dataset
Now, In this Dataset, I want to Add a Filter where as we can see there are same or little different names for the same product, so we want to add a filter which if we chose LOSARTAN will show all the values in Product relating to that LOSARTAN, same for the other Products too. Basically, a Filter where we filter all the products which have similar names, if we choose one name in filter we will be able to see all the different names used for that Specific Product.
Thank you!

Relabeling categorical values in pandas data frame using fuzzy

I have a large data frame with 371 unique categorical entries, however some of the entries are similar and in some cases I want to merge certain categories that may have been seperated, for example I have 3 categories that I know of:
3d
3d_platformer
3d_vision
I want to combine these under a general category of just 3d. I feel like this should be possible on a small scale, but I want to scale it up for all the categories as well. The problem being that I don't know the names of all my categories. So in short the full question is:
How can I search for similar categorical names and then replace all the similar name with one group name, with out searching individually?
Can regular expressions help?
df.col = df.col.str.replace(r'3d.*', '3d')
If you're looking for more semantical-like identity, the NLP libraries like Gensim may provide string similarity computing methods:
https://betterprogramming.pub/introduction-to-gensim-calculating-text-similarity-9e8b55de342d
You can try to use your category names as corpus.

What is the best way to integrate different Excel files with different sheets with different formats in Python?

I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

How to use user-defined input for column name in pandas series

I am looking to understand how to use a user-defined variable within a column name. I am using pandas. I have a dataframe with several columns that are in the same format, but the code will be run against the different column names. I don't want to have to put in the different column names each time when only the first part of the name actually changes.
For example,
df['input_same_same']
Where the code will call out different columns where only the first part of the column is different and the rest remains the same.
Is it possible to do something along the lines of:
vari='cats' (and the next time I run I can input dogs, pigs, etc)
for
df['vari_count_litter']
I have tried using %s within the column name but that doesn't work.
I'd appreciate any insight or understanding how this is possible. Thanks!
If I understand right, you could do df[vari+'_count_litter']. However, you may be better off using a MultiIndex that would let you do df[vari, 'count_litter']. It's difficult to say how to set it up without know what your data structure is and how you want to access it.

Categories

Resources