Categorical Nested Lists into Pandas DataFrame

Categorical Nested Lists into Pandas DataFrame - python

I have three levels of categorical data that I need to convert into a Pandas DataFrame with repeating labels on the upper categories. I have lists for "main", "sub", and "tertiary" as follows:
main_labels = ['Certain infectious and parasitic diseases','Neoplasms']
main_icds = ['A00-B99','C00-D49']
sub_labels = ['Intestinal infectious diseases','Tuberculosis','Malignant neoplasms of lip, oral cavity and pharynx','Malignant neoplasms of digestive organs']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_labels = ['Cholera','Typhoid and paratyphoid fevers','Respiratory tuberculosis','Tuberculosis of nervous system','Malignant neoplasm of lip','Malignant neoplasm of base of tongue','Malignant neoplasm of esophagus','Malignant neoplasm of stomach']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
For illustration and example purposes, I need them to look like below in a Pandas DataFrame. If I can accomplish this, I can add in the label values.
It seemed like it would be easy but I'm stumped. Any help greatly appreciated. I tried searching historical posts but was having trouble finding the right key words to get anything close to what I'm trying to do. Thanks!

I think the best way is to start with the ternary category, then find its sub and main classifications. python allows inequalities on alphanumeric strings, so this should be pretty robust.
import pandas as pd
main_icds = ['A00-B99','C00-D49']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
#split on '-' to get bounds for each category
subs = [sub.split('-') for sub in sub_icds]
mains = [main.split('-') for main in main_icds]
df = pd.DataFrame({'ter_icd':ter_icds})
df['sub_icd'] = [sub_icd for ter in ter_icds
for sub_icd,sub in zip(sub_icds,subs)
if (ter >= sub[0]) & (ter <= sub[1])]
df['main_icd'] = [main_icd for ter in ter_icds
for main_icd,main in zip(main_icds,mains)
if (ter >= main[0]) & (ter <= main[1])]

Related

Getting specific values from list of key value pairs in dataframe

I've written the code below to get some citation data from an API and write it to a CSV. It works fine except that one of the columns returns a list of authors and it comes into the CSV like this:
[{'authorId': '83129125', 'name': 'June A. Sekera'}, {'authorId': '13328115', 'name': 'A. Lichtenberger'}]
How can I parse this so I get simply a comma-separated list of the authors in a single cell, ignoring the authorId?
import requests
import json
import pandas as pd
# get data from the API
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
json = r.json()
df = pd.DataFrame(json['data'])
# new df from the column of lists
split_df = pd.DataFrame(df['citingPaper'].tolist())
# display the resulting df
print(split_df)
split_df.to_csv("citations.csv", index=False)

Something like the below (do the authors "cleanup" before we populate the df)
import requests
import pandas as pd
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
data = []
if r.status_code == 200:
entries = r.json()['data']
for entry in entries:
entry['citingPaper']['authors'] = ','.join(x['name'] for x in entry['citingPaper'].get('authors',[]))
data.append(entry['citingPaper'])
df = pd.DataFrame(data)
df.to_csv("citations.csv",index = False)
citations.csv
paperId,url,title,year,authors
1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,https://www.semanticscholar.org/paper/1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,"A comparative study on deformation mechanisms, microstructures and mechanical properties of wide thin-ribbed sections formed by sideways and forward extrusion",2021,"Wenbin Zhou,Junquan Yu,Xiaona Lu,Jianguo Lin,T. Dean"
46f209486a9e8f81c77dbbb39991f4045dbc8f7d,https://www.semanticscholar.org/paper/46f209486a9e8f81c77dbbb39991f4045dbc8f7d,The low-carbon steel industry-Interactions between the hydrogen direct reduction of steel and the electricity system,2021,A. Toktarova
5638200cc8b188a48f923b75dee9793c06c99b62,https://www.semanticscholar.org/paper/5638200cc8b188a48f923b75dee9793c06c99b62,Pore-scale assessment of subsurface carbon storage potential: implications for the UK Geoenergy Observatories project,2021,"R. Payton,M. Fellgett,B. Clark,D. Chiarella,A. Kingdon,S. Hier‐Majumder"
6055ea75b377b6776f468ccb9f21551614b5f61f,https://www.semanticscholar.org/paper/6055ea75b377b6776f468ccb9f21551614b5f61f,Can Nature-Based Solutions Deliver a Win-Win for Biodiversity and Climate Change Adaptation?,2021,"Isabel Key,Alison C. Smith,B. Turner,A. Chausson,C. Girardin,Megan MacGillivray,N. Seddon"
8734d34823bcfb362f05df2a48bad19cc026b1c1,https://www.semanticscholar.org/paper/8734d34823bcfb362f05df2a48bad19cc026b1c1,Trends in air travel inequality in the UK: From the few to the many?,2021,"M. Büchs,Giulio Mattioli"
020eecb5f6edf918b6ef1120d97276b8d0748dc7,https://www.semanticscholar.org/paper/020eecb5f6edf918b6ef1120d97276b8d0748dc7,"Decarbonising the critical sectors of aviation, shipping, road freight and industry to limit warming to 1.5–2°C",2020,"M. Sharmina,O. Edelenbosch,C. Wilson,R. Freeman,D. Gernaat,P. Gilbert,A. Larkin,E. Littleton,M. Traut,D. V. van Vuuren,N. Vaughan,F. R. Wood,C. Le Quéré"
1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,https://www.semanticscholar.org/paper/1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,Investments in climate-friendly materials to strengthen the recovery package JUNE 2020,2020,"F. Lettow,Olga Chiappinelli"
3916ee1df6b8e07f8798a90726407554e990847e,https://www.semanticscholar.org/paper/3916ee1df6b8e07f8798a90726407554e990847e,Pathways for Low-Carbon Transition of the Steel Industry—A Swedish Case Study,2020,"A. Toktarova,I. Karlsson,Johan Rootzén,L. Göransson,M. Odenberger,F. Johnsson"
55667e4e2e3c35d7c6fcf98021a083d4397a308d,https://www.semanticscholar.org/paper/55667e4e2e3c35d7c6fcf98021a083d4397a308d,Potentials for reducing climate impact from tourism transport behavior,2020,"Anneli Kamb,E. Lundberg,J. Larsson,Jonas Nilsson"
9fc21836d61fbc76e3923021cb88f94e3e8c5a41,https://www.semanticscholar.org/paper/9fc21836d61fbc76e3923021cb88f94e3e8c5a41,Decarbonization of construction supply chains-Achieving net-zero carbon emissions in the supply chains linked to the construction of buildings and transport infrastructure,2020,
b657008ea89807ccbb204ed1e2f4debe92ce252b,https://www.semanticscholar.org/paper/b657008ea89807ccbb204ed1e2f4debe92ce252b,Roadmap for Decarbonization of the Building and Construction Industry—A Supply Chain Analysis Including Primary Production of Steel and Cement,2020,"I. Karlsson,Johan Rootzén,A. Toktarova,M. Odenberger,F. Johnsson,L. Göransson"
c611929ef750ba845a9c87da862ad1f8c9711e64,https://www.semanticscholar.org/paper/c611929ef750ba845a9c87da862ad1f8c9711e64,"Assessing Carbon Capture: Public Policy, Science, and Societal Need",2020,"June A. Sekera,A. Lichtenberger"

The easiest way I can think of is to use apply(lambda x: ...), creating a list of values for dictionary key "name" in each dictionary p in each item of the column authors.
Add this underneath split_df = pd.DataFrame(...):
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x])
split_df["authors"][0]
#Out: ['Wenbin Zhou', 'Junquan Yu', 'Xiaona Lu', 'Jianguo Lin', 'T. Dean']
Edit
To have blank "" if there are no authors:
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x] if len(x) > 0 else "")

How to divide a pandas data frame into sublists of n at a time?

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)

This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')

Moving Average for python issue

I am looking at corona data from the NY Times which can be found here: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
And open for everyone to use.
The dataset is set up like this:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df2 = df.copy()
df2 = df2.set_index('date')
df2['cases_lagged'] = df2.groupby(['county', 'state'])['cases'].shift()
df2[df2['fips']== 34041.0].head(10)
I was hoping I could create a moving average column the same way using a groupby statement along with the .rolling() command from pandas to compile a 7-day and 14-day moving average for the data but it does not work.
I tried it two separate ways:
#way 1
df2['moving_avg'] = df2.groupby(['county', 'state']).iloc[:4].rolling(window = 7).mean()
#way 2
df2['moving_avg'] = df2.groupby(['county', 'state'])['cases'].rolling(window = 7).mean()
And neither seems to work here.
Any thoughts on how to compile the moving average for each county within each state without having to break out each and every county into its own df for it to work? Thanks

When I ran it with all data, I terminated it because it was running for a long time. We limited the run to specific counties.
I'm not sure I think I've achieved the intended result.
los = df2[(df2['county'] == 'Los Angeles') & (df2['state'] == 'California')]
los['moving_avg'] = los[['county', 'state', 'cases']].groupby(['county', 'state'], group_keys=False).rolling(window = 7).mean()

How to combine two sets of data with differences in merge-index strings?

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}

So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...

Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")

Find and Replace in DataFrame using Pandas in optimized way

I am trying to find and replace words from the 20K comments. Find and replace words are stored in dataframe and its around more than 20000. Comments in different dataframe and its around 20K.
Below is the example
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
And I am expecting the output below
op = ["Hull Damage happened and its insured by maritime hull insurance company","Blocked and claims are blocked"]})
Please help.
I am using loop but its taking more than 20 mins to do this.
20 k records in the data, 30000 words to be replaced
""KeywordSynonym"" -- Dataframe holds find and replace data in sql
""backup"" -- Dataframe hold data to be cleaned
backup = str(backup)
TrainingClaimNotes_KwdSyn = []
for index,row in KeywordSynonym.iterrows():
word = KeywordSynonym.Synonym[index].lower()
value = KeywordSynonym.Keyword[index].lower()
my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
if re.search(my_regex,backup):
backup = re.sub(my_regex, value, backup)
TrainingClaimNotes_KwdSyn.append(backup)
TrainingClaimNotes_KwdSyn_Cmp = backup.split('\'", "\'')

Use:
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
find_repl = dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Output
>>> print(df1['Data_1'].tolist())
['hull damage happened and its insured by maritime hull insurance company', 'blocked and claims are blocked']
Explanation
dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower())) creates a mapping between what you want to replace and the string you want to replace with -
{'insurence': 'insurance', 'non cash entry': 'blocked'}
Convert the lookups to regex making it ready for lookup -
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
{'(\\b)insurence(\\b)': '\\1insurance\\2', '(\\b)non cash entry(\\b)': '\\1blocked\\2'}
The final piece is just making the actual replacement -
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Note: I did a .lower() everywhere to find proper matches. Obviously you can reshape it to the way you want it to look.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.