How to clean up messy "Country" attribute from biopython pubmed extracts?

How to clean up messy "Country" attribute from biopython pubmed extracts? - python

I have extracted ~60,000 PubMed abstracts into a data frame using Biopython. The attributes include "Authors", "Title", "Year", "Journal", "Country", and "Abstract".
The attribute "Country" is very messy, with a mixture of countries, cities, names, addresses, free-text items (e.g., "freelance journalist with interest in Norwegian science"), faculties, etc.
I want to clean up the column only to contain the country - and "NA" for those records that are missing the entry, or have a free-text item that does not make sense.
Currently, my clean-up process of this column is very cumbersome:
pub = df['Country']
chicago = pub.str.contains('Chicago')
df['Country'] = np.where(chicago, 'USA', pub.str.replace('-', ' '))
au = pub.str.contains('#edu.au')
df['Country'] = np.where(au, 'Australia', pub.str.replace('-', ' '))
... and so on
Are you aware of some python libraries, or have some ideas for a more automated way of cleaning up this column?

Related

Sorting keywords from a dataframe

It is necessary to write a geo-classifier that will be able to set the geographical affiliation of a certain region to each row. I.e., if the search query contains the name of the city of the region, then the name of this region is written in the ‘region’ column. If the search query does not contain the name of the city, then put ‘undefined'.
I have the following code that doesn't work
import pandas as pd
data_location = pd.read_csv(r'\Users\super\Desktop\keywords.csv', sep = ',')
def sorting(row):
keyword_set = row['keywords'].lower()
for region, city_list in geo_data.items():
for town in keyword_set:
if town in city_list:
return region
return 'undefined'
Rules of distribution by region Center, North-West and Far East:
geo_location = {
'Центр': ['москва', 'тула', 'ярославль'],
'Северо-Запад': ['петербург', 'псков', 'мурманск'],
'Дальний Восток': ['владивосток', 'сахалин', 'хабаровск']
}
Link to the csv file that is used in the program https://dropmefiles.com/IurAn
I tried to sort through the function, but it doesn't work, there was an idea to create a template for all existing cities and run each line of the file through this template for sorting.
I apologize in advance for such an extensive question, I'm still new in this field and I'm just learning. I will be glad to receive various tips and help.

Keep entry with least missing values for a given observation in dataframe

I have a dataframe that includes US company identifiers (Instrument) as well as the company's Name, ISIN and it's CIK number.
Here is an example of my dataset:
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)
df
In some cases, there is more than one entry for each Instrument as can bee seen at the Instrument 5039191995. In those cases however, most of the time, there is one entry that is "superior" to the other ones in terms of information content.
For example, in the first of the two entries for Instrument 5039191995 there is no information missing, while in the second entry the Company name as well as the ISIN are missing. In this case I only would like to keep the first entry and drop the second one.
Overall Goal: For each entry that has duplicates in terms of the Instrument column, I only want to keep the entry that has the least missing values. If there are duplicates that have the same amount of missing values, all of those should be kept.

You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument
import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)
df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')
Output
Instrument Company Name CIK ISIN
0 4295914485 Orrstown Financial Services Inc 826154 US6873801053
1 4295913199 Ditech Networks Inc 1080667 US25500T1088
2 4295904693 Penn Treaty American Corp 814181 US7078744007
3 5039191995 Verb Technology Company Inc 1566610 US92337U1043

List to dataframe, list to multiple lists, single column to dataframe

Still figuring out programming, help is appreciated! I have a single column of information that i would ultimately like to turn into a dataframe. I could transpose it but the address information varies, it is either 2 lines or 3 lines (some have suite numbers etc).
It generally looks like this.
name x,
ID 1,
123-xyz,
ID 2,
abcdefg,
ACTIVITY,
ggg,
TYPE,
C,
COUNTY,
orange county,
ADDRESS,
123 stack st,
city state zip,
PHONE,
111-111-1111,
EXPIRES,
date,
name y,
ID 1,
456-abc,
ID 2,
cvbnmnb,
ACTIVITY,
ggg,
TYPE,
A,
COUNTY,
dakota county,
ADDRESS,
234 overflow st,
lot a,
city state zip,
PHONE,
000-000-0000,
EXPIRES,
date,
name z,
...,
I was thinking of creating new lists for all desired columns and conditionally appending values with a for loop.
for i in list
if value = ID
append previous value to name list
append next value to ID list
elif value = phone
send next value to phone
elif value = address
evaluate 3 rows down
if value = phone
concatenate previous two values and append to address list
if value != phone
concatenate current and previous 2 values and append to address list
else print error message
Would this be a decently efficient option for lists of around ~20,000 values?
I don't really know how to write this, I am using python in a jupyter notebook. Looking for solutions but also looking to learn more!
-EDIT-
A user had suggested a while loop, and the original data sample I gave was simplified and contained 4 fields. My actual set contained 9, and I tried playing around but unfortunately wasn't able to figure it out on my own.
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
name = id1 = id2 = activity = type = county = address = phone = expires = "" #Reset the fields for every cluster of information
name = df[0][count] #Name is always the first line of cluster
id1 = df[0][count+2] #id is always third line of cluster
id2 = df[0][count+4]
activity = df[0][count+6]
type = df[0][count+8]
county = df[0][count+10]
n=11
while df[0][count+n] != "Phone": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
address=address+df[0][count+n]+", "
n+=1
phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
expires = df[0][count+n+3]
n+=2
newdf = newdf.append({'NAME': name, 'ID 1': id1, 'ID 2': id2, 'ACTIVITY': activity, 'TYPE': type, 'COUNTY': county, 'ADDRESS': address, 'Phone': phone, 'Expires': expires}, ignore_index=True) #Append the data into the new dataframe
count=count+n

You seem to have a brief understanding of what you need to do judging by the pseudocode you provided!
I'm assuming that your xlsx file looks something like this without the commas.
Based on your sample data, this is what I can come with for you. I'll be referencing each user data as a 'cluster'.
This code works under a few assumptions:
The PHONE field always only have 1 line of data
There is complete data for all cluster (or if there is missing data, a blank exists on the next row).
Data is always in this particular order (i.e. name, ID, address, Phone)
count will be like a pointer to the start of a cluster, while n will be the offset from count. Read the comments for the explanations.
import pandas as pd
df = pd.read_excel (r'test.xlsx', header = None) #Import xlsx file
newdf = pd.DataFrame(columns=['name', 'id', 'address', 'phone']) #Creating blank dataframe
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
this_add = this_name = this_id = this_phone = "" #Reset the fields for every cluster of information
this_name = df[0][count] #Name is always the first line of cluster
this_id = df[0][count+2] #id is always third line of cluster
n=4
while df[0][count+n] != "PHONE": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
this_add=this_add+df[0][count+n]+", "
n+=1
this_phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
n+=2
newdf = newdf.append({'name': this_name, 'id': this_id, 'address': this_add, 'phone':this_phone}, ignore_index=True) #Append the data into the new dataframe
count=count+n
As for performance wise, I honestly do not think there is much optimisation that can be done given the nature of the dataset (I might be wrong). If you realised my solution is pretty "hard-coded" to reduce the need for if-else statements, but 20,000 lines should not be huge of a problem for Jupyter Notebook. May take a couple of minutes but that should be alright.
I hope this gets you started on tackling other scenarios you may encounter with the remaining datasets!

Pandas, the searching is really hard?

Here I want to search the values of paper_title column in reference column if matched/found as whole text, get the _id of that reference row (not _id of paper_title row) where it matched and save the _id in the paper_title_in column.
In[1]:
d ={
"_id":
[
"Y100",
"Y100",
"Y100",
"Y101",
"Y101",
"Y101",
"Y102",
"Y102",
"Y102"
]
,
"paper_title":
[
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"#emotional tweets",
"#emotional tweets",
"#emotional tweets",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns"
]
,
"reference":
[
"beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
"burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
"gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
"paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
"saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
"robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
"alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
"j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
"tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781"
]
}
import pandas as pd
df=pd.DataFrame(d)
df
Out:
Expected Results:
And finally the final result dataframe with unique values as:
Note here paper_title_in column has all the _id of title present in reference column as list.
I tried this but it returns the _id of paper_title column in paper_presented_in which is being searched than reference column where it matches. The expected result dataframe gives more clear idea. Have a look there.
def return_id(paper_title,reference, _id):
if (paper_title is None) or (reference is None):
return None
if paper_title in reference:
return _id
else:
return None
df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)

So to solve your problem you'll be requiring two dictionaries and a list to store some values temporarily.
# A list to store unique paper titles
unique_paper_title
# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()
# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()
# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()
# Storing values in the dict mapping_dict_paper_to_id
for value in unique_paper_title:
mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]
# Storing values in the dict mapping_id_to_idx
for value in unique_paper_title:
# this gives us the indexes of the matched string ie. the paper_title
idx_list = df[df['reference'].str.contains(value)].index
# Storing values in the dictionary
for idx in idx_list:
mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]
# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly
for i in df.index:
if i in mapping_id_to_idx:
df['paper_present_in'][i] = mapping_id_to_idx[i]
else:
df['paper_present_in'][i] = "None"
The above code is gonna check and update the searched values in the dataframe.

Folium || Highlight specific countries based on some data?

Problem: I would like to highlight specific countries based on some data I have. As an example, I have a list of shows and countries where they are licensed. I would like to highlight those countries when a show is selected or searched *selecting and searching comes later in the program right now I just want to be able to highlight specific countries.
I have been following the Folium Quickstart page here https://python-visualization.github.io/folium/quickstart.html ,specifically the GeoJSON and TopoJSON. This is the code I have right now and it highlights every country on the map.
#Loads show data into panda dataframe
show_data = pd.read_csv('input files/Show Licensing.csv')
show_data['Contract Expiration'] = pd.to_datetime(show_data['Contract Expiration'])
#Loads country poloygon and names
country_geo=(open("input files/countries.geojson", "r", encoding="utf-8-sig")).read()
folium_map = folium.Map(location=[40.738, -73.98],
tiles="CartoDB positron",
zoom_start=5)
folium.GeoJson(country_geo).add_to(folium_map)
folium_map.save("my_map.html")
Expected Results: For right now I would like to highlight all countries found in my csv file. End goal is to be able to search a show and highlight countries where the show is licensed.

This is the code I wrote which answered my question:
for country in countriesAndContinents_json['features']:
if country['properties']['Name'].lower() == h_country.lower():
if highlightFlag == 'License A':
return folium.GeoJson(
country,
name=(showTitle + ' License A ' + h_country),
style_function=styleLicenseA_function,
highlight_function=highlightLicenseA_function
)
'country', which is used as the geo_data for folium.GeoJson, is the geojson response for a specific country. So when a search for a country is found in the countries.geojson data it will return the geojson response for that specific country, including the geometry needed to highlight it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to clean up messy "Country" attribute from biopython pubmed extracts? - python

Related

Sorting keywords from a dataframe

Keep entry with least missing values for a given observation in dataframe

List to dataframe, list to multiple lists, single column to dataframe

Pandas, the searching is really hard?

Folium || Highlight specific countries based on some data?

Categories

Resources