Sorting keywords from a dataframe - python

It is necessary to write a geo-classifier that will be able to set the geographical affiliation of a certain region to each row. I.e., if the search query contains the name of the city of the region, then the name of this region is written in the ‘region’ column. If the search query does not contain the name of the city, then put ‘undefined'.
I have the following code that doesn't work
import pandas as pd
data_location = pd.read_csv(r'\Users\super\Desktop\keywords.csv', sep = ',')
def sorting(row):
keyword_set = row['keywords'].lower()
for region, city_list in geo_data.items():
for town in keyword_set:
if town in city_list:
return region
return 'undefined'
Rules of distribution by region Center, North-West and Far East:
geo_location = {
'Центр': ['москва', 'тула', 'ярославль'],
'Северо-Запад': ['петербург', 'псков', 'мурманск'],
'Дальний Восток': ['владивосток', 'сахалин', 'хабаровск']
}
Link to the csv file that is used in the program https://dropmefiles.com/IurAn
I tried to sort through the function, but it doesn't work, there was an idea to create a template for all existing cities and run each line of the file through this template for sorting.
I apologize in advance for such an extensive question, I'm still new in this field and I'm just learning. I will be glad to receive various tips and help.

Related

Create new proper time-dataset from raw data with Python

First of all, I'm sorry if this question has already been asked but I believe my challenge is specific enough. I'm not looking for complete answers but simply guidelines on how I can proceed.
First of all, I have a raw dataset of monitoring participants. This data include things like income, savings, etc... and these participants have been tracked for 6 months (Jan to Jun). But the data is stored in a whole single Excel file with a column to specify the month, which means that one participant's name comes back 6 times in the file, one for each month. Each participant has a unique ID.
I want to transfrom this data in a more workable way and I wanted to learn to do it with Python. But then I feel stuck and rusty because it's been ages since I've coded and I'm only used to the codes I use on a regular basis (printing grouped averages, etc...); Here's the steps I want to follow:
a. Start by creating a column which contains a unique list of participants that have been tracked using the ID. Each participant has to be cited once only;
b. Each participants is recorded with an activity and sub-activity type in the original file, which will need to be added in the new dataset as well;
c. For the month of January for example, I want to create a 'january_income' column in which the income from january has been dragged from the raw dataset, and so on for each variable and each month.
Can anyone provide guidelines on how I may proceed? As I said, it doesn't have to be specific codes, it can be methods or steps along with the function I can use.
Thanks alot already.
N.B: I use Spyder as a working environment.
Your question is not specific. But you can try and adjust the code below:
import csv
"""
Convert your excel file to csv format
This sample assumes that you have a csv file with the first row as header or fieldnames
"""
with open('test.csv','w') as fp:
fp.write("""ID,Name,Income,Savings,Month
1,"Sample Name",1000,100,1
""")
def format(infile = 'infile.csv', outfile='outfile.csv'):
months = ['January', 'February', 'March'] #Add specific months
target_fields = ['Income', 'Savings'] # Add your desired fields
timestamp_field = 'Month' #The field which indicate the month of the row
ID_field = 'ID' # The field which indicates the unique identifier of the participant
part_specific_fields = [ID_field, 'Name'] # The fields which are specific for each participant, these fields won't be touched at all.
target_combined_fields = [f'{month}_{field}' for field in target_fields for month in months]
total_fields = part_specific_fields + target_combined_fields
temp = {}
with open(infile,'r') as fpi, open(outfile,'w') as fpo:
reader = csv.DictReader(fpi)
for row in reader:
ID = int(row[ID_field])
if ID not in temp:
temp[ID] = {}
for other_field in part_specific_fields:
# Insert the constant columns that should not be touched
temp[ID][other_field] = row[other_field]
month_pos = int(row[timestamp_field]) - 1 # subtract 1 for 0 indexing
month = months[month_pos] # Month name in plain English
for field in target_fields:
temp[ID][f'{month}_{field}'] = row[field]
# All the processing completed
#now write the data
writer = csv.DictWriter(fpo, fieldnames=total_fields)
writer.writeheader()
for row in temp.values():
writer.writerow(row)
# File has been wriiten successfully
#now return the mapped dictionary
return temp
print(format('test.csv'))
First, You have to convert your .xls file to .csv format
Process the each row and map that to specific <month>_<field> keys.
Write the processed data to outfile.csv file
Thanks for the notes. First of all, I'm sorry if my post is not specific and thanks for initiating me on the community. Since my initial post, I've made some effort to work on my data and with my actual knowledge of the langage, all I could come up with was a filtering code as my code below shows. This lets me have a column for each data of each month okay but I'm stuck on two things: first, I had to repeat this code for each month and change the months in the labels. I wouldn't have minded that approach if I didnt have to face another problem: This doesn't take in account the fact that some participants have not been tracked on certain months, which means that even if the data was sorted according to ID number, there is a mismatch between the columns because their length vary according to the number of participants tracked for that month. Now I'm looking to optimize this code by adding a line which would let me resolve my second issue (at this point I don't mind if the code is long but if there could be optimization to be made at all, I'm open to it as well):
os.chdir("XXXXXXX")
economique = pd.read_csv('data_economique.csv')
#JANVIER
ID_jan = economique.query("mois_de_suivi == 'Janvier'")["ID"]
nom_jan = economique.query("mois_de_suivi == 'Janvier'")["nom"]
sexe_jan = economique.query("mois_de_suivi == 'Janvier'")["sexe"]
district_jan = economique.query("mois_de_suivi == 'Janvier'")["district"]
activite_jan = economique.query("mois_de_suivi == 'Janvier'")["activite"]
CA_jan = economique.query("mois_de_suivi == 'Janvier'")["chiffre_affaire"]
charges_jan = economique.query("mois_de_suivi == 'Janvier'")["charges"]
resultat_jan = economique.query("mois_de_suivi == 'Janvier'")["benefice"]
remb_attendu_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_attendu"]
remb_effectue_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_effectue"]
remb_differe_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_remb_differe"]
epargne_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_epargne"]

How to clean up messy "Country" attribute from biopython pubmed extracts?

I have extracted ~60,000 PubMed abstracts into a data frame using Biopython. The attributes include "Authors", "Title", "Year", "Journal", "Country", and "Abstract".
The attribute "Country" is very messy, with a mixture of countries, cities, names, addresses, free-text items (e.g., "freelance journalist with interest in Norwegian science"), faculties, etc.
I want to clean up the column only to contain the country - and "NA" for those records that are missing the entry, or have a free-text item that does not make sense.
Currently, my clean-up process of this column is very cumbersome:
pub = df['Country']
chicago = pub.str.contains('Chicago')
df['Country'] = np.where(chicago, 'USA', pub.str.replace('-', ' '))
au = pub.str.contains('#edu.au')
df['Country'] = np.where(au, 'Australia', pub.str.replace('-', ' '))
... and so on
Are you aware of some python libraries, or have some ideas for a more automated way of cleaning up this column?

Passing an array of countries to a function

I am fairly new to Python. I am leveraging Python's holidays package which has public holidays by country. I am looking to write a function that loops over any number of countries and returns a dataframe with 3 columns:
Date, Holiday, Country
Based on my limited knowledge, I came up with this sort of implementation:
import holidays
def getholidayDF(*args):
holidayDF = pd.DataFrame(columns=['Date','Holiday','Country'])
for country in args:
holidayDF.append(sorted(holidays.CountryHoliday(country,years=np.arange(2014,2030,1)).items()))
holidayDF['Country'] = country
return holidayDF
holidays = getholidayDF('FRA', 'Norway', 'Finland', 'US', 'Germany', 'UnitedKingdom', 'Sweden')
This returns a blank dataframe. I am not sure how to proceed!
If you change your for-loop as shown below it should be okay for you. Most relevant comments were made by user roganjosh. O'Reilly, Wrokx, Prentece Hall, Pearson, Packt.. just to name a few publishers... they have some good books for you. Skip the cookbooks for now.
.. code snippet ...
for country in args:
holidayDF = holidayDF.append(sorted(holidays.CountryHoliday(country,years=np.arange(2014,2030,1)).items()))
# holidayDF['Country'] = country # remove this from the for-loop.
return holidayDF # move out of the for-loop

Pandas, the searching is really hard?

Here I want to search the values of paper_title column in reference column if matched/found as whole text, get the _id of that reference row (not _id of paper_title row) where it matched and save the _id in the paper_title_in column.
In[1]:
d ={
"_id":
[
"Y100",
"Y100",
"Y100",
"Y101",
"Y101",
"Y101",
"Y102",
"Y102",
"Y102"
]
,
"paper_title":
[
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"#emotional tweets",
"#emotional tweets",
"#emotional tweets",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns"
]
,
"reference":
[
"beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
"burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
"gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
"paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
"saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
"robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
"alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
"j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
"tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781"
]
}
import pandas as pd
df=pd.DataFrame(d)
df
Out:
Expected Results:
And finally the final result dataframe with unique values as:
Note here paper_title_in column has all the _id of title present in reference column as list.
I tried this but it returns the _id of paper_title column in paper_presented_in which is being searched than reference column where it matches. The expected result dataframe gives more clear idea. Have a look there.
def return_id(paper_title,reference, _id):
if (paper_title is None) or (reference is None):
return None
if paper_title in reference:
return _id
else:
return None
df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)
So to solve your problem you'll be requiring two dictionaries and a list to store some values temporarily.
# A list to store unique paper titles
unique_paper_title
# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()
# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()
# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()
# Storing values in the dict mapping_dict_paper_to_id
for value in unique_paper_title:
mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]
# Storing values in the dict mapping_id_to_idx
for value in unique_paper_title:
# this gives us the indexes of the matched string ie. the paper_title
idx_list = df[df['reference'].str.contains(value)].index
# Storing values in the dictionary
for idx in idx_list:
mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]
# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly
for i in df.index:
if i in mapping_id_to_idx:
df['paper_present_in'][i] = mapping_id_to_idx[i]
else:
df['paper_present_in'][i] = "None"
The above code is gonna check and update the searched values in the dataframe.

Folium || Highlight specific countries based on some data?

Problem: I would like to highlight specific countries based on some data I have. As an example, I have a list of shows and countries where they are licensed. I would like to highlight those countries when a show is selected or searched *selecting and searching comes later in the program right now I just want to be able to highlight specific countries.
I have been following the Folium Quickstart page here https://python-visualization.github.io/folium/quickstart.html ,specifically the GeoJSON and TopoJSON. This is the code I have right now and it highlights every country on the map.
#Loads show data into panda dataframe
show_data = pd.read_csv('input files/Show Licensing.csv')
show_data['Contract Expiration'] = pd.to_datetime(show_data['Contract Expiration'])
#Loads country poloygon and names
country_geo=(open("input files/countries.geojson", "r", encoding="utf-8-sig")).read()
folium_map = folium.Map(location=[40.738, -73.98],
tiles="CartoDB positron",
zoom_start=5)
folium.GeoJson(country_geo).add_to(folium_map)
folium_map.save("my_map.html")
Expected Results: For right now I would like to highlight all countries found in my csv file. End goal is to be able to search a show and highlight countries where the show is licensed.
This is the code I wrote which answered my question:
for country in countriesAndContinents_json['features']:
if country['properties']['Name'].lower() == h_country.lower():
if highlightFlag == 'License A':
return folium.GeoJson(
country,
name=(showTitle + ' License A ' + h_country),
style_function=styleLicenseA_function,
highlight_function=highlightLicenseA_function
)
'country', which is used as the geo_data for folium.GeoJson, is the geojson response for a specific country. So when a search for a country is found in the countries.geojson data it will return the geojson response for that specific country, including the geometry needed to highlight it.

Categories

Resources