Pandas, the searching is really hard? - python

Here I want to search the values of paper_title column in reference column if matched/found as whole text, get the _id of that reference row (not _id of paper_title row) where it matched and save the _id in the paper_title_in column.
In[1]:
d ={
"_id":
[
"Y100",
"Y100",
"Y100",
"Y101",
"Y101",
"Y101",
"Y102",
"Y102",
"Y102"
]
,
"paper_title":
[
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"#emotional tweets",
"#emotional tweets",
"#emotional tweets",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns"
]
,
"reference":
[
"beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
"burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
"gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
"paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
"saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
"robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
"alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
"j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
"tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781"
]
}
import pandas as pd
df=pd.DataFrame(d)
df
Out:
Expected Results:
And finally the final result dataframe with unique values as:
Note here paper_title_in column has all the _id of title present in reference column as list.
I tried this but it returns the _id of paper_title column in paper_presented_in which is being searched than reference column where it matches. The expected result dataframe gives more clear idea. Have a look there.
def return_id(paper_title,reference, _id):
if (paper_title is None) or (reference is None):
return None
if paper_title in reference:
return _id
else:
return None
df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)

So to solve your problem you'll be requiring two dictionaries and a list to store some values temporarily.
# A list to store unique paper titles
unique_paper_title
# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()
# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()
# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()
# Storing values in the dict mapping_dict_paper_to_id
for value in unique_paper_title:
mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]
# Storing values in the dict mapping_id_to_idx
for value in unique_paper_title:
# this gives us the indexes of the matched string ie. the paper_title
idx_list = df[df['reference'].str.contains(value)].index
# Storing values in the dictionary
for idx in idx_list:
mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]
# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly
for i in df.index:
if i in mapping_id_to_idx:
df['paper_present_in'][i] = mapping_id_to_idx[i]
else:
df['paper_present_in'][i] = "None"
The above code is gonna check and update the searched values in the dataframe.

Related

Getting specific values from list of key value pairs in dataframe

I've written the code below to get some citation data from an API and write it to a CSV. It works fine except that one of the columns returns a list of authors and it comes into the CSV like this:
[{'authorId': '83129125', 'name': 'June A. Sekera'}, {'authorId': '13328115', 'name': 'A. Lichtenberger'}]
How can I parse this so I get simply a comma-separated list of the authors in a single cell, ignoring the authorId?
import requests
import json
import pandas as pd
# get data from the API
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
json = r.json()
df = pd.DataFrame(json['data'])
# new df from the column of lists
split_df = pd.DataFrame(df['citingPaper'].tolist())
# display the resulting df
print(split_df)
split_df.to_csv("citations.csv", index=False)
Something like the below (do the authors "cleanup" before we populate the df)
import requests
import pandas as pd
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
data = []
if r.status_code == 200:
entries = r.json()['data']
for entry in entries:
entry['citingPaper']['authors'] = ','.join(x['name'] for x in entry['citingPaper'].get('authors',[]))
data.append(entry['citingPaper'])
df = pd.DataFrame(data)
df.to_csv("citations.csv",index = False)
citations.csv
paperId,url,title,year,authors
1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,https://www.semanticscholar.org/paper/1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,"A comparative study on deformation mechanisms, microstructures and mechanical properties of wide thin-ribbed sections formed by sideways and forward extrusion",2021,"Wenbin Zhou,Junquan Yu,Xiaona Lu,Jianguo Lin,T. Dean"
46f209486a9e8f81c77dbbb39991f4045dbc8f7d,https://www.semanticscholar.org/paper/46f209486a9e8f81c77dbbb39991f4045dbc8f7d,The low-carbon steel industry-Interactions between the hydrogen direct reduction of steel and the electricity system,2021,A. Toktarova
5638200cc8b188a48f923b75dee9793c06c99b62,https://www.semanticscholar.org/paper/5638200cc8b188a48f923b75dee9793c06c99b62,Pore-scale assessment of subsurface carbon storage potential: implications for the UK Geoenergy Observatories project,2021,"R. Payton,M. Fellgett,B. Clark,D. Chiarella,A. Kingdon,S. Hier‐Majumder"
6055ea75b377b6776f468ccb9f21551614b5f61f,https://www.semanticscholar.org/paper/6055ea75b377b6776f468ccb9f21551614b5f61f,Can Nature-Based Solutions Deliver a Win-Win for Biodiversity and Climate Change Adaptation?,2021,"Isabel Key,Alison C. Smith,B. Turner,A. Chausson,C. Girardin,Megan MacGillivray,N. Seddon"
8734d34823bcfb362f05df2a48bad19cc026b1c1,https://www.semanticscholar.org/paper/8734d34823bcfb362f05df2a48bad19cc026b1c1,Trends in air travel inequality in the UK: From the few to the many?,2021,"M. Büchs,Giulio Mattioli"
020eecb5f6edf918b6ef1120d97276b8d0748dc7,https://www.semanticscholar.org/paper/020eecb5f6edf918b6ef1120d97276b8d0748dc7,"Decarbonising the critical sectors of aviation, shipping, road freight and industry to limit warming to 1.5–2°C",2020,"M. Sharmina,O. Edelenbosch,C. Wilson,R. Freeman,D. Gernaat,P. Gilbert,A. Larkin,E. Littleton,M. Traut,D. V. van Vuuren,N. Vaughan,F. R. Wood,C. Le Quéré"
1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,https://www.semanticscholar.org/paper/1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,Investments in climate-friendly materials to strengthen the recovery package JUNE 2020,2020,"F. Lettow,Olga Chiappinelli"
3916ee1df6b8e07f8798a90726407554e990847e,https://www.semanticscholar.org/paper/3916ee1df6b8e07f8798a90726407554e990847e,Pathways for Low-Carbon Transition of the Steel Industry—A Swedish Case Study,2020,"A. Toktarova,I. Karlsson,Johan Rootzén,L. Göransson,M. Odenberger,F. Johnsson"
55667e4e2e3c35d7c6fcf98021a083d4397a308d,https://www.semanticscholar.org/paper/55667e4e2e3c35d7c6fcf98021a083d4397a308d,Potentials for reducing climate impact from tourism transport behavior,2020,"Anneli Kamb,E. Lundberg,J. Larsson,Jonas Nilsson"
9fc21836d61fbc76e3923021cb88f94e3e8c5a41,https://www.semanticscholar.org/paper/9fc21836d61fbc76e3923021cb88f94e3e8c5a41,Decarbonization of construction supply chains-Achieving net-zero carbon emissions in the supply chains linked to the construction of buildings and transport infrastructure,2020,
b657008ea89807ccbb204ed1e2f4debe92ce252b,https://www.semanticscholar.org/paper/b657008ea89807ccbb204ed1e2f4debe92ce252b,Roadmap for Decarbonization of the Building and Construction Industry—A Supply Chain Analysis Including Primary Production of Steel and Cement,2020,"I. Karlsson,Johan Rootzén,A. Toktarova,M. Odenberger,F. Johnsson,L. Göransson"
c611929ef750ba845a9c87da862ad1f8c9711e64,https://www.semanticscholar.org/paper/c611929ef750ba845a9c87da862ad1f8c9711e64,"Assessing Carbon Capture: Public Policy, Science, and Societal Need",2020,"June A. Sekera,A. Lichtenberger"
The easiest way I can think of is to use apply(lambda x: ...), creating a list of values for dictionary key "name" in each dictionary p in each item of the column authors.
Add this underneath split_df = pd.DataFrame(...):
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x])
split_df["authors"][0]
#Out: ['Wenbin Zhou', 'Junquan Yu', 'Xiaona Lu', 'Jianguo Lin', 'T. Dean']
Edit
To have blank "" if there are no authors:
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x] if len(x) > 0 else "")

How to manage the combining of multiple REST API calls into one data model?

I'm learning to collect data from REST APIs to generate custom reports.
For example, one of the APIs I'm dealing with is the POS application MobileBytes. For this API my goal is to model daily sales > group by room > summarized by each category.
A Daily Sales Report uses the debit credit model:
Account
DR
CR
Credit Cards receivable
DR
Cash receivable
DR
Bar liquor
CR
Bar wine
CR
Bar beer
CR
Bar food
CR
Patio Food
CR
Patio liquor
CR
Patio wine
CR
Patio beer
CR
Sales Tax
CR
Tips
CR
As this example of a Sales Report shows, the API's "rooms" represent the different physical areas in the establishment: bar, patio Other rooms could include: dining, banquet, take-out, hrubgub, eberuats. Report categories represent the common categories of sales used in food service accounting: food, liquor, wine, beer, and other categories could include: dessert, retail. The results of each API endpoint are a variable length array of features and their totals. The constraint of the debit-credit model is total debits equal total credits (just like any purchase receipt; total of your purchases + tax equal what you paid).
And just in case you thought it an easy job of querying each endpoint to collect each table for the final report--no--each record item label uses an id which points to the Setup API where the string representation or string label for each record item is found. Putting it all together {"bar" : {"food": $x_1, "liquor":$x_2, "wine": $x_3, "beer": $x_4}, .. is calling at least 3-4 different endpoints--two for each subsection of the sales report (ie. rooms and categories), plus one or more for their labels in the setup and menu endpoints.
How could I manage, organize and combine all of these different API calls?
I'm leaning towards using Pandas DataFrames as described in this example: Query API’s with Json Output in Python (Medium article)
There appears to be no actual ID for you to join against in the data.
You have a list of "report categories" and "rooms". Each with their own IDs. Each with their own quantity, for example.
It's also not clear what each object represents. Days? If so, create a simple loop over each day from start-to-end, then parse each object.
import requests
from datetime import date, timedelta
start = date(2022, 1, 21)
startDate = start.strftime("%Y-%m-%d")
end = date(2022, 1, 22)
endDate= end.strftime("%Y-%m-%d")
# TODO: You need to add API keys in here
sales_api = 'https://api.mobilebytes.com/v2/reports/sales'
categories = requests.get(f'{sales_api}/reportCategories/{startDate}/{endDate}')
rooms = requests.get(f'{sales_api}/rooms/{startDate}/{endDate}')
output = [] # to build your output for a dataframe
if categories.status_code // 100 == 2 and rooms.status_code // 100 == 2:
# These lists should be the same size, so you can zip them
c = categories.json()['application/json']
r = rooms.json()['application/json']
d = start
for category, room in zip(c, r):
print(d, category, room) # For debugging
# TODO: parse both objects and populate your list above
output.append({
'date': d.strftime("%Y-%m-%d"),
'category_id': category['report_category_id'],
# TODO: parse more category values
'room_id': room['room_id']
# TODO: parse more room values
})
d += timedelta(days = 1)
else:
raise Error('Unable to connect')
From a list of dictionaries, it is easy to create a dataframe
import pandas as pd
df = pd.DataFrame(output)
Regarding ORM, you can use swagger-codegen to create Python classes that represent the documented response bodies.

Print corresponding value in pandas DF row

How do I print out entries in a df using a keyword search? I have a legislative database I'm running a list of climate keywords against:
climate_key_words = ['climate','gas','coal','greenhouse','carbon monoxide','carbon',\
'carbon dioxide','education',\
'gas tax','regulation']
Here's my for loop:
for bill in df.title:
for word in climate_key_words:
if word in bill:
print(bill)
print(word)
print(df.state)
print('------------')
When it prints, df.state forces everything to print funky:
24313 AK
24314 AK
24315 AK
24316 AK
24317 AK
Name: state, Length: 24318, dtype: object
------------
Relating to limitations on food regulations at farms, farmers' markets, and cottage food production operations.
regulation
But when print(df.state) is absent, it looks much nicer:
------------
Higher education; providing for the protection of certain expressive activities.
education
------------
Schools; allowing a school district board of education to amend certain policy to stock inhalers. Effective date. Emergency.
education
------------
How can I include df.state (and other values) and have them printed only once?
Ideally, my output should look like this:
###bill
###corresponding title
###corresponding state
print(df.state) is going to print out the column/field 'state'. You presumably want the state associated with that row of the dataframe?
So I would suggest tweaking your approach slightly and doing something like:
for row in range(dataframe.shape[0]): #for each row in the dataframe
for word in keywords:
if word in dataframe.iloc[row][bill]
print(dataframe.iloc[row][bill]) #allows you to access values in the df by row,column
print(dataframe.iloc[row][state])
print(dataframe.iloc[row][title])

How to clean up messy "Country" attribute from biopython pubmed extracts?

I have extracted ~60,000 PubMed abstracts into a data frame using Biopython. The attributes include "Authors", "Title", "Year", "Journal", "Country", and "Abstract".
The attribute "Country" is very messy, with a mixture of countries, cities, names, addresses, free-text items (e.g., "freelance journalist with interest in Norwegian science"), faculties, etc.
I want to clean up the column only to contain the country - and "NA" for those records that are missing the entry, or have a free-text item that does not make sense.
Currently, my clean-up process of this column is very cumbersome:
pub = df['Country']
chicago = pub.str.contains('Chicago')
df['Country'] = np.where(chicago, 'USA', pub.str.replace('-', ' '))
au = pub.str.contains('#edu.au')
df['Country'] = np.where(au, 'Australia', pub.str.replace('-', ' '))
... and so on
Are you aware of some python libraries, or have some ideas for a more automated way of cleaning up this column?

How to combine two sets of data with differences in merge-index strings?

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}
So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...
Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")

Categories

Resources