Passing an array of countries to a function - python

I am fairly new to Python. I am leveraging Python's holidays package which has public holidays by country. I am looking to write a function that loops over any number of countries and returns a dataframe with 3 columns:
Date, Holiday, Country
Based on my limited knowledge, I came up with this sort of implementation:
import holidays
def getholidayDF(*args):
holidayDF = pd.DataFrame(columns=['Date','Holiday','Country'])
for country in args:
holidayDF.append(sorted(holidays.CountryHoliday(country,years=np.arange(2014,2030,1)).items()))
holidayDF['Country'] = country
return holidayDF
holidays = getholidayDF('FRA', 'Norway', 'Finland', 'US', 'Germany', 'UnitedKingdom', 'Sweden')
This returns a blank dataframe. I am not sure how to proceed!

If you change your for-loop as shown below it should be okay for you. Most relevant comments were made by user roganjosh. O'Reilly, Wrokx, Prentece Hall, Pearson, Packt.. just to name a few publishers... they have some good books for you. Skip the cookbooks for now.
.. code snippet ...
for country in args:
holidayDF = holidayDF.append(sorted(holidays.CountryHoliday(country,years=np.arange(2014,2030,1)).items()))
# holidayDF['Country'] = country # remove this from the for-loop.
return holidayDF # move out of the for-loop

Related

Sorting keywords from a dataframe

It is necessary to write a geo-classifier that will be able to set the geographical affiliation of a certain region to each row. I.e., if the search query contains the name of the city of the region, then the name of this region is written in the ‘region’ column. If the search query does not contain the name of the city, then put ‘undefined'.
I have the following code that doesn't work
import pandas as pd
data_location = pd.read_csv(r'\Users\super\Desktop\keywords.csv', sep = ',')
def sorting(row):
keyword_set = row['keywords'].lower()
for region, city_list in geo_data.items():
for town in keyword_set:
if town in city_list:
return region
return 'undefined'
Rules of distribution by region Center, North-West and Far East:
geo_location = {
'Центр': ['москва', 'тула', 'ярославль'],
'Северо-Запад': ['петербург', 'псков', 'мурманск'],
'Дальний Восток': ['владивосток', 'сахалин', 'хабаровск']
}
Link to the csv file that is used in the program https://dropmefiles.com/IurAn
I tried to sort through the function, but it doesn't work, there was an idea to create a template for all existing cities and run each line of the file through this template for sorting.
I apologize in advance for such an extensive question, I'm still new in this field and I'm just learning. I will be glad to receive various tips and help.

Getting specific values from list of key value pairs in dataframe

I've written the code below to get some citation data from an API and write it to a CSV. It works fine except that one of the columns returns a list of authors and it comes into the CSV like this:
[{'authorId': '83129125', 'name': 'June A. Sekera'}, {'authorId': '13328115', 'name': 'A. Lichtenberger'}]
How can I parse this so I get simply a comma-separated list of the authors in a single cell, ignoring the authorId?
import requests
import json
import pandas as pd
# get data from the API
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
json = r.json()
df = pd.DataFrame(json['data'])
# new df from the column of lists
split_df = pd.DataFrame(df['citingPaper'].tolist())
# display the resulting df
print(split_df)
split_df.to_csv("citations.csv", index=False)
Something like the below (do the authors "cleanup" before we populate the df)
import requests
import pandas as pd
r = requests.get("https://api.semanticscholar.org/graph/v1/paper/b5bb17a53f75b48ab5e18c00fb048b783db6b1f4/citations?fields=title,authors,year,url")
data = []
if r.status_code == 200:
entries = r.json()['data']
for entry in entries:
entry['citingPaper']['authors'] = ','.join(x['name'] for x in entry['citingPaper'].get('authors',[]))
data.append(entry['citingPaper'])
df = pd.DataFrame(data)
df.to_csv("citations.csv",index = False)
citations.csv
paperId,url,title,year,authors
1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,https://www.semanticscholar.org/paper/1ebccd3d83ed2fd79bc57cf0d06a8e02ba16180f,"A comparative study on deformation mechanisms, microstructures and mechanical properties of wide thin-ribbed sections formed by sideways and forward extrusion",2021,"Wenbin Zhou,Junquan Yu,Xiaona Lu,Jianguo Lin,T. Dean"
46f209486a9e8f81c77dbbb39991f4045dbc8f7d,https://www.semanticscholar.org/paper/46f209486a9e8f81c77dbbb39991f4045dbc8f7d,The low-carbon steel industry-Interactions between the hydrogen direct reduction of steel and the electricity system,2021,A. Toktarova
5638200cc8b188a48f923b75dee9793c06c99b62,https://www.semanticscholar.org/paper/5638200cc8b188a48f923b75dee9793c06c99b62,Pore-scale assessment of subsurface carbon storage potential: implications for the UK Geoenergy Observatories project,2021,"R. Payton,M. Fellgett,B. Clark,D. Chiarella,A. Kingdon,S. Hier‐Majumder"
6055ea75b377b6776f468ccb9f21551614b5f61f,https://www.semanticscholar.org/paper/6055ea75b377b6776f468ccb9f21551614b5f61f,Can Nature-Based Solutions Deliver a Win-Win for Biodiversity and Climate Change Adaptation?,2021,"Isabel Key,Alison C. Smith,B. Turner,A. Chausson,C. Girardin,Megan MacGillivray,N. Seddon"
8734d34823bcfb362f05df2a48bad19cc026b1c1,https://www.semanticscholar.org/paper/8734d34823bcfb362f05df2a48bad19cc026b1c1,Trends in air travel inequality in the UK: From the few to the many?,2021,"M. Büchs,Giulio Mattioli"
020eecb5f6edf918b6ef1120d97276b8d0748dc7,https://www.semanticscholar.org/paper/020eecb5f6edf918b6ef1120d97276b8d0748dc7,"Decarbonising the critical sectors of aviation, shipping, road freight and industry to limit warming to 1.5–2°C",2020,"M. Sharmina,O. Edelenbosch,C. Wilson,R. Freeman,D. Gernaat,P. Gilbert,A. Larkin,E. Littleton,M. Traut,D. V. van Vuuren,N. Vaughan,F. R. Wood,C. Le Quéré"
1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,https://www.semanticscholar.org/paper/1e77bf66cfe8f94463c73289e4940d0efcc2a5e4,Investments in climate-friendly materials to strengthen the recovery package JUNE 2020,2020,"F. Lettow,Olga Chiappinelli"
3916ee1df6b8e07f8798a90726407554e990847e,https://www.semanticscholar.org/paper/3916ee1df6b8e07f8798a90726407554e990847e,Pathways for Low-Carbon Transition of the Steel Industry—A Swedish Case Study,2020,"A. Toktarova,I. Karlsson,Johan Rootzén,L. Göransson,M. Odenberger,F. Johnsson"
55667e4e2e3c35d7c6fcf98021a083d4397a308d,https://www.semanticscholar.org/paper/55667e4e2e3c35d7c6fcf98021a083d4397a308d,Potentials for reducing climate impact from tourism transport behavior,2020,"Anneli Kamb,E. Lundberg,J. Larsson,Jonas Nilsson"
9fc21836d61fbc76e3923021cb88f94e3e8c5a41,https://www.semanticscholar.org/paper/9fc21836d61fbc76e3923021cb88f94e3e8c5a41,Decarbonization of construction supply chains-Achieving net-zero carbon emissions in the supply chains linked to the construction of buildings and transport infrastructure,2020,
b657008ea89807ccbb204ed1e2f4debe92ce252b,https://www.semanticscholar.org/paper/b657008ea89807ccbb204ed1e2f4debe92ce252b,Roadmap for Decarbonization of the Building and Construction Industry—A Supply Chain Analysis Including Primary Production of Steel and Cement,2020,"I. Karlsson,Johan Rootzén,A. Toktarova,M. Odenberger,F. Johnsson,L. Göransson"
c611929ef750ba845a9c87da862ad1f8c9711e64,https://www.semanticscholar.org/paper/c611929ef750ba845a9c87da862ad1f8c9711e64,"Assessing Carbon Capture: Public Policy, Science, and Societal Need",2020,"June A. Sekera,A. Lichtenberger"
The easiest way I can think of is to use apply(lambda x: ...), creating a list of values for dictionary key "name" in each dictionary p in each item of the column authors.
Add this underneath split_df = pd.DataFrame(...):
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x])
split_df["authors"][0]
#Out: ['Wenbin Zhou', 'Junquan Yu', 'Xiaona Lu', 'Jianguo Lin', 'T. Dean']
Edit
To have blank "" if there are no authors:
split_df["authors"] = split_df["authors"].apply(lambda x: [p["name"] for p in x] if len(x) > 0 else "")

Bloomberg APIs - historical index members in Python

I'm trying to get index members using Bloomberg APIs in Python. I have no issues getting current constituents, but I want a historical list (example: what where Russell 1000 or S&P 500 constituents as of Q1 1995).
To get the current index members I can use following:
In excel I can use INDX_MEMBERS to get the constituents:
=BDS("Index Ticker", INDX_MEMBERS)
In Python:
import pybbg
def Main():
bbg = pybbg.Pybbg()
IndexConst = bbg.bds('IndexName', 'INDX_MEMBERS')
or:
from tia.bbg import LocalTerminal
resp = LocalTerminal.get_reference_data(index_ticker + ' INDEX', 'INDX_MEMBERS')
members = resp.as_frame().iloc[0,0]
Question is how can I get historical index members/constituents. For example I would generate quarterly dates and then I want to know list of constituents for each date.
['2020-06-30',
'2020-03-31',
'2019-12-31',
'2019-09-30',
'2019-06-30',
'2019-03-31',
'2018-12-31' ... '1980-06-30',]
I've tried many solutions, including one below where I'm getting an empty frame:
from tia.bbg import LocalTerminal
date_start = datetime.date(2010,6,28)
date_end = datetime.date(2020,6,28)
members_russell1000_3 = LocalTerminal.get_historical('RIY Index', 'INDX_MEMBERS',start=date_start, end=date_end,).as_frame()
or the solution below, where regardless of date (now or 20 years ago) I'm receiving the same list of constituents:
from xbbg import blp
members = blp.bds('RIY Index', 'INDX_MEMBERS', DVD_Start_Dt=k[1], DVD_End_Dt=k[1])
Variable Explanation to above examples:
'RIY Index' - Russell 1000 index ticker
'INDX_MEMBERS' - Bloomberg field (flds) for list of index constituents
Alternatively I would be happy if I could get historical list of changes to index constituents with dates (I already have current constituents)
You need to use the INDX_MWEIGHT_PX field and the END_DATE_OVERRIDE override (date format: yyyymmdd). It is a reference data request, so probably bds and not bdh in the python library but I've never used it, so not 100% sure and you may need to try a few solutions until you find the correct one.
I've found that the below works
blp.bds('RIY Index', "INDX_MWEIGHT", END_DATE_OVERRIDE="20210101")
and gives the same results as an excel query
=BDS("RIY Index", "INDX_MWEIGHT_HIST", "END_DATE_OVERRIDE",'20210101')
Alternatively using "INDX_MWEIGHT_PX" gives the actual weight and current price values also.

How to combine two sets of data with differences in merge-index strings?

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}
So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...
Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")

Cannot convert object type to string; and then filter on that string; python pandas dataframe

I am trying to pull all stock tickers from NYSE, and then filter out for only those with MarketCap above 5B.
I am running into a problem because based on how my data load comes in all columns are data type "Object" and I cannot find anyway to convert them to anything else. See my code and comments below:
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
This is my initial data load of NYSE stocks, and then I filter for just MarketCap, Sector, and Industry.
At first I was hoping to filter out MarketCap first by anything with "M" in it was removed and then removing the first and last characters to get a number which then could be filtered to keep anything above 5. However I think it is because of the data types being "Object" and not string I have not bee able to do it directly. So I then created new columns that would contain only letters or numbers, hoping that I could then convert to data type string and float from there.
df['MarketCap_Num'] = df.MarketCap.str[1:-1]
df['Billion_Filter'] = df.MarketCap.str[-1:]
So MarketCap_Num column has only the numbers by removing the first and last characters and Billion_Filter is only the last character where I will remove any value that = M.
However even though these columns are just numbers or just strings I CANNOT find anyway to convert to change from object datatype so then my filtering is not working at all. Any help is much appreciated.
I have tried .astype(float), pd.to_numeric, type functions to no success.
My filtering code would then be:
df[df.Billion_Filter.str.contains("B")]
But when I run that nothing happens, no error but also no filter happens. When I run this code on a different table it works, so it must be the object data type that is holding it up.
Convert the MarketCap column into floats by first removing the dollar signs and then substituting B with e9 and M with e6. This should make it easy to use .astype(float) on the column to do the conversion.
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
df = df.replace({'MarketCap': {'\$': '', 'B': 'e9', 'M': 'e6', 'n/a': np.nan}}, regex=True)
df.MarketCap = df.MarketCap.astype(float)
print(df[df.MarketCap > 5000000000].head(10))
Yields:
MarketCap Sector industry
Symbol
MMM 1.419900e+11 Health Care Medical/Dental Instruments
WUBA 1.039000e+10 Technology Computer Software: Programming, Data Processing
ABB 5.676000e+10 Consumer Durables Electrical Products
ABT 9.887000e+10 Health Care Major Pharmaceuticals
ABBV 1.563200e+11 Health Care Major Pharmaceuticals
ACN 9.388000e+10 Miscellaneous Business Services
AYI 7.240000e+09 Consumer Durables Building Products
ADNT 7.490000e+09 Capital Goods Auto Parts:O.E.M.
AAP 7.370000e+09 Consumer Services Other Specialty Stores
ASX 1.083000e+10 Technology Semiconductors
You should be able to change the type of the MarketCap_Num column by using:
df['MarketCap_Num'] = df.MarketCap.str[1:-1].astype(np.float64)
You can then check the data types by df.dtypes.
As for the filter, you can simple just say
df_filtered = df[df['Billion_Filter'] =="B"].copy()
since you will only have one letter in your Billion_Filter column.
Obhect datatype works as string. You should be able to use both str.contains and extract the number without having to convert the object type to string
df = df[df['MarketCap'].str.contains('B')].copy()
df['MarketCap'] = df['MarketCap'].str.extract('(\d+.?\d*)', expand = False)
MarketCap Sector industry
Symbol
DDD 1.12 Technology Computer Software: Prepackaged Software
MMM 141.99 Health Care Medical/Dental Instruments
WUBA 10.39 Technology Computer Software: Programming, Data Processing
EGHT 1.32 Public UtilitiesTelecommunications Equipment
AIR 1.48 Capital Goods Aerospace

Categories

Resources