I'm trying to obtain a table of data obtaining just the country, year and value from this World Bank API but I can't seem to filter for just the data I want. I've seen that these types of questions have already been asked but all the answers didn't seem to work.
Would really appreciate some help. Thank you!
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url ="http://api.worldbank.org/v2/country/{}/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
country = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
html={}
for i in country:
url_one = url.format(i)
html[i] = requests.get(url_one).json()
my_values=[]
for i in country:
value=html[i][1][0]['value']
my_values.append(value)
Edit
My data currently looks like this, I'm trying to extract the country name which is in '{'country': {'id': 'AO', 'value': 'Angola''}, the 'date' and the 'value'
Edit 2
Got the data I'm looking for but its repeated twice each
Note: Assumed that it would be great to store information for all the years at once and not only for one year - Enables you to simply filter in later processing. Take a look, there is a missing "," between your countries "GRC""HUN"
There are different options to achieve your goal, just point with two of them in the right direction.
Option #1
Pick information needed from json response, create a reshaped dict and append() it to my_values:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
Example
import requests
import pandas as pd
url = 'http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?format=json'
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC","HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
my_values = []
for country in countries:
data = requests.get(url %country).json()
try:
for d in data[1]:
my_values.append({
'country':d['country']['value'],
'date':d['date'],
'value':d['value']
})
except Exception as err:
print(f'[ERROR] country ==> {country} error ==> {err}')
pd.DataFrame(my_values).sort_values(['country', 'date'], ascending=True)
Option #2
Create a dataframes directly from the json response, concat them and make some adjustments on the final dataframe:
for d in data[1]:
my_values.append(pd.DataFrame(d))
...
pd.concat(my_values).loc[['value']][['country','date','value']].sort_values(['country', 'date'], ascending=True)
Output
country
date
value
Algeria
1971
341.389
Algeria
1972
442.678
Algeria
1973
554.293
Algeria
1974
818.008
Algeria
1975
936.79
...
...
...
Zimbabwe
2016
1464.59
Zimbabwe
2017
1235.19
Zimbabwe
2018
1254.64
Zimbabwe
2019
1316.74
Zimbabwe
2020
1214.51
Pandas read_json method needs valid JSON str, path object or file-like object, but you put string.
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Try this:
import requests
import pandas as pd
url = "http://api.worldbank.org/v2/country/%s/indicator/NY.GDP.PCAP.CD?date=2015&format=json"
countries = ["DZA","AGO","ARG","AUS","AUT","BEL","BRA","CAN","CHL","CHN","COL","CYP", "CZE","DNK","FIN","FRA","GEO","DEU",
"GRC""HUN","ISL","IND","IDN","IRL","ISR","ITA","JPN","KAZ","KWT","LBN","LIE","MYS","MEX","MCO","MAR","NPL","NLD",
"NZL","NGA","NOR","OMN","PER","PHL","POL","PRT","QAT","ROU","SGP","ZAF","ESP","SWE","CHE","TZA","THA","TUR","UKR",
"GBR","USA","VNM","ZWE"]
datas = []
for country in countries:
data = requests.get(url %country).json()
try:
values = data[1][0]
datas.append(pd.DataFrame(values))
except Exception as err:
print(f"[ERROR] country ==> {country} with error ==> {err}")
df = pd.concat(datas)
Related
I'm trying to scrape the 'Profile and investment' table from the following url: https://markets.ft.com/data/funds/tearsheet/summary?s=LU0526609390:EUR, using the following code:
import requests
import pandas as pd
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
r = requests.get(url).content
df = pd.read_html(r)[0]
print (df)
However, when I use the pd.read_html function, I get the following error code:
ValueError: invalid literal for int() with base 10: '100%'
because the table has entries in %. Is there a way to make Pandas accept % values?
My required output is to get a table with the following format:
Fund_ID Fund_type Income_treatment Morningstar category ......
LU0526609390:EUR ... ... ....
IE00BHBX0Z19:EUR ... ... ....
LU1076093779:EUR ... ... ....
LU1116896363:EUR ... ... ....
The issue is the site uses the 'colspan' attribute and uses % instead of with an int. As AsishM mentions in the comments:
browsers are usually more lenient with things like %, but the html spec for colspan clearly mentions this should be an integer. Browsers treat 100% as 100. mdn link. It's not a pandas problem per se.
these should be in the form of an int, and while some browsers will accommodate for that, pandas is specifically wanting it to be the appropriate syntax of:
<td colspan="number">
Ways to approach this is:
Use BeautifulSoup to fix those attributes
Since it's not within the table you actually want to parse, use BeautifulSoup to grab that first table and then don't need to worry about it.
See if the table has a specific attribute and could add that to the .read_html() as a parameter so it grabs only that specific table.
I chose option 2 here:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data __scraping__ from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
results = pd.DataFrame()
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), index_col=0)[0].T
results = results.append(df, sort=False)
results = results.reset_index(drop=True)
print (results)
Output:
print(results.to_string())
0 Fund type Income treatment Morningstar category IMA sector Launch date Price currency Domicile ISIN Manager & start date Investment style (bonds) Investment style (stocks)
0 SICAV Income Global Bond - EUR Hedged -- 06 Aug 2010 GBP Luxembourg LU0526609390 Jonathan Gregory01 Nov 2012Vivek Acharya09 Dec 2015Simon Foster01 Nov 2012 NaN NaN
1 Open Ended Investment Company Income EUR Diversified Bond -- 21 Feb 2014 EUR Ireland IE00BHBX0Z19 Lorenzo Pagani12 May 2017Konstantin Veit01 Jul 2019 Credit Quality: HighInterest-Rate Sensitivity: Mod NaN
2 SICAV Income Eurozone Large-Cap Equity -- 11 Jul 2014 GBP Luxembourg LU1076093779 NaN NaN Market Cap: LargeInvestment Style: Blend
3 SICAV Income EUR Flexible Bond -- 01 Dec 2014 EUR Luxembourg LU1116896363 NaN NaN NaN
Here's how you could use BeautifulSoup to fix those colspan attributes.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Define all urls required for data scraping from the FT Website - if new fund is added simply add the appropriate Fund ID to the List
List = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR', 'LU1076093779:EUR', 'LU1116896363:EUR']
df = pd.DataFrame(List, columns=['List'])
urls = 'https://markets.ft.com/data/funds/tearsheet/summary?s='+ df['List']
for url in urls:
print(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
all_colspan = soup.find_all(attrs={'colspan':True})
for colspan in all_colspan:
colspan.attrs['colspan'] = colspan.attrs['colspan'].replace('%', '')
df = pd.read_html(str(soup))
I am trying to pull search results data from an API on a website and put it into a pandas dataframe. I've been able to successfully pull the info from the API into a JSON format.
The next step I'm stuck on is how to loop through the search results on a particular page and then again for each page of results.
Here is what I've tried so far:
#Step 1: Connect to an API
import requests
import json
response_API = requests.get('https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page=1')
#200
#Step 2: Get the data from API
data = response_API.text
#Step 3: Parse the data into JSON format
parse_json = json.loads(data)
#Step 4: Extract data
title = parse_json['results'][0]['title']
pub_date = parse_json['results'][0]['publication_date']
agency = parse_json['results'][0]['agencies'][0]['name']
Here is where I've tried to put this all into a loop:
import numpy as np
import pandas as pd
df=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
for i in parse_json:
title = parse_json['results'][i]['title']
pub_date = parse_json['results'][i]['publication_date']
agency = parse_json['results'][i]['agencies'][0]['name']
df.append([title,pub_date,agency])
cols = ["Title", "Date","Agency"]
df = pd.DataFrame(df,columns=cols)
I feel like I'm close to the correct answer, but I'm not sure how to move forward from here. I need to iterate through the results where I placed the i's when parsing through the json data, but I get an error that reads, "Type Error: list indices must be integers or slices, not str". I understand I can't put the i's in those spots, but how else am I supposed to iterate through the results?
Any help would be appreciated!
Thank you!
I think you are very close!
import numpy as np
import pandas as pd
import requests
BASE_URL = "'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}"
results = []
for page in range(0, 7):
response = requests.get(BASE_URL.format(page=page))
if response.ok:
resp_json = response.json()
for res in resp_json["results"]:
results.append(
[
res["title"],
res["publication_date"],
[agency["name"] for agency in res["agencies"]]
]
)
df = pd.DataFrame(results, columns=["Title", "Date", "Agencies"])
In this block of code, I used the requests library's built-in .json() method, which can automatically convert a response's text to a JSON dict (if it's in the proper format).
The if response.ok is a little less-verbose way provided by requests to check if the status code is < 400, and can prevent errors that might occur when attempting to parse the response if there was a problem with the HTTP call.
Finally, I'm not sure what data you need exactly for your DataFrame, but each object in the
"results" list from the pages pulled from that website has "agencies" as a list of agencies... wasn't sure if you wanted to drop all that data, so I kept the names as a list.
*Edit:
In case the response objects don't contain the proper keys, we can use the .get() method of Python dictionaries.
# ...snip
for res in resp_json["results"]:
results.append(
[
res.get("title"), # This will return `None` as a default, instead of causing a KeyError
res.get("publication_date"),
[
# Here, get the 'raw_name' or None, in case 'name' key doesn't exist
agency.get("name", agency.get("raw_name"))
for agency in res.get("agencies", [])
]
]
)
Slightly different approach: rather than iterating through the response, read into a dataframe then save what you need. The saves the first agency name in the list.
df_list=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
# print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
df = pd.json_normalize(parse_json['results'])
df['Agency'] = df['agencies'][0][0]['raw_name']
df_list.append(df[['title', 'publication_date', 'Agency']])
df_final = pd.concat(df_list)
df_final
title publication_date Agency
0 Determination of the Promotion of Economy and ... 2021-09-28 OFFICE OF MANAGEMENT AND BUDGET
1 Corporate Average Fuel Economy Standards for M... 2021-09-03 OFFICE OF MANAGEMENT AND BUDGET
2 Public Hearing for Corporate Average Fuel Econ... 2021-09-14 OFFICE OF MANAGEMENT AND BUDGET
3 Investigation of Urea Ammonium Nitrate Solutio... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
4 Call for Nominations To Serve on the National ... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
.. ... ... ...
15 Energy Conservation Program: Test Procedure fo... 2021-09-14 DEPARTMENT OF COMMERCE
16 Self-Regulatory Organizations; The Nasdaq Stoc... 2021-09-09 DEPARTMENT OF COMMERCE
17 Regulations To Improve Administration and Enfo... 2021-09-20 DEPARTMENT OF COMMERCE
18 Towing Vessel Firefighting Training 2021-09-01 DEPARTMENT OF COMMERCE
19 Patient Protection and Affordable Care Act; Up... 2021-09-27 DEPARTMENT OF COMMERCE
[140 rows x 3 columns]
I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement
I'm trying to convert World Bank's API into pandas DF format from the API link: http://api.worldbank.org/countries/indicators/6.0.GDP_growth?per_page=100&date=2000:2015&format=json
Which has the page response of:
[{"page":1,"pages":147,"per_page":"2","total":294},[{"indicator":{"id":"6.0.GDP_growth","value":"GDP
growth (annual %)"},"country":{"id":"L5","value":"Andean
Region"},"value":null,"decimal":"0","date":"2001"},{"indicator":{"id":"6.0.GDP_growth","value":"GDP
growth (annual %)"},"country":{"id":"L5","value":"Andean
Region"},"value":null,"decimal":"0","date":"2000"}]]
I'm trying to get a data frame similar to:
Out[253]:
Country Name GDP_growth
0 Afghanistan 14.43474129
1 Albania 1.623698601
2 Algeria 3.299991384
3 American Samoa ..
4 Andorra -1.760010328
Here are the commands I have called to date:
from urllib2 import Request, urlopen
In [2]:
import json
In [3]:
from pandas.io.json import json_normalize
In [4]:
request = Request('http://api.worldbank.org/countries/indicators/6.0.GDP_growth?per_page=100&date=2000:2015&format=json')
In [5]:
response = urlopen(request)
In [6]:
elevations = response.read()
data = json.loads(elevations)
json_normalize(data['indicator'])
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last) in ()
----> 1 json_normalize(data['indicator'])
TypeError: list indices must be integers, not str
Would appreciate help on the last line.
Thanks!
data is a list at this moment. You see it a bit better, when you pretty-print it:
from pprint import pprint
pprint(data)
The first item with an indicator field is data[1][0].