I used python 3 and beautiful soup 4 to parse the webpage from Hong Kong stock exchange. However, the table (ie: No. of listed companies...No. of listed H shares...) under "HONG KONG AND MAINLAND MARKET HIGHLIGHTS" cannot be extracted. Here is the link: "https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=10&select1=0"
Kindly advice.
My code:
import requests
from bs4 import BeautifulSoup
import csv
import sys
import os
result = requests.get("https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=10&select1=3")
result.raise_for_status()
result.encoding = "utf-8"
src = result.content
soup = BeautifulSoup(src, 'lxml')
print(soup.prettify())
print(" ")
print("soup.pretty() printed")
print(" ")
wait = input("PRESS ENTER TO CONTINUE.")
table = soup.find_all('table')
print(table)
print(" ")
print("TABLE printed")
print(" ")
wait2 = input("PRESS ENTER TO CONTINUE.")
No need to render the page first, as you can get the data back in the json format. The tricky part is the json format is how to render the table (with the td tags and colspan tags, etc.). So there has to be a little work to be done to iterate through that, but not impossible to do:
import requests
import pandas as pd
url = 'https://www.hkex.com.hk/eng/csm/ws/Highlightsearch.asmx/GetData'
payload = {
'LangCode': 'en',
'TDD': '1',
'TMM': '11',
'TYYYY': '2019'}
jsonData = requests.get(url, params=payload).json()
final_df = pd.DataFrame()
for row in jsonData['data']:
#row = jsonData['data'][1]
data_row = []
for idx, colspan in enumerate(row['colspan']):
colspan_int = int(colspan[0])
data_row.append(row['td'][idx] * colspan_int)
flat_list = [item for sublist in data_row for item in sublist]
temp_row = pd.DataFrame([flat_list])
final_df = final_df.append(temp_row, sort=True).reset_index(drop=True)
df = final_df[final_df[0].str.contains(r'Total market
capitalisation(?!$)')].iloc[:,:2]
df['date'] = date
df.to_csv('file.csv', index=False)
Output:
print (final_df.to_string())
0 1 2 3 4 5 6
0 Hong Kong <br>Exchange (01/11/2019 ) Hong Kong <br>Exchange (01/11/2019 ) Shanghai Stock<br>Exchange (01/11/2019 ) Shanghai Stock<br>Exchange (01/11/2019 ) Shenzhen Stock<br>Exchange (01/11/2019 ) Shenzhen Stock<br>Exchange (01/11/2019 )
1 Main Board GEM A Share B Share A Share B Share
2 No. of listed companies 2,031 383 1,488 50 2,178 47
3 No. of listed H shares 256 22 n.a. n.a. n.a. n.a.
4 No. of listed red-chips stocks 170 5 n.a. n.a. n.a. n.a.
5 Total no. of listed securities 12,573 384 n.a. n.a. n.a. n.a.
6 Total market capitalisation<br>(Bil. dollars) HKD 31,956 HKD 109 RMB 32,945 RMB 81 RMB 22,237 RMB 50
7 Total negotiable <br>capitalisation (Bil. doll... n.a. n.a. RMB 28,756 RMB 81 RMB 16,938 RMB 49
8 Average P/E ratio (Times) 11.16 19.76 13.90 9.18 24.70 9.55
9 Total turnover <br>(Mil. shares) 196,082 560 15,881 15 22,655 14
10 Total turnover <br>(Mil. dollars) HKD 79,397 HKD 160 RMB 169,934 RMB 85 RMB 260,208 RMB 57
11 Total market turnover<br>(Mil. dollars) HKD 79,557 HKD 79,557 RMB 176,232 RMB 176,232 RMB 260,264 RMB 260,264
Related
I have been trying to scrape all data from the first page to the last page, but it returns only the first page as the output. How can I solve this? Below is my code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
pages = np.arange(2, 1589, 20)
for page in pages:
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_="project-card-vertical h-full flex flex-col rounded border-thin border-inactive-blue overflow-hidden pointer")
for list in lists:
title = list.find('p', class_ ="project-location text-body text-base mb-3").text. replace ('\n', '',).strip()
location = list.find('span', class_ ="text-gray-1").text. replace ('\n', '',).strip()
status = list.find('span', class_ ="text-purple-1 font-bold").text. replace ('\n', '',).strip()
units = list.find('span', class_ ="text-body font-semibold").text. replace ('\n', '',).strip()
info = [title,location,status,units]
print(info)
The page is loaded dynamically using the API. Therefore, with a regular GET request, you will always get the first page. You need to study how the page communicates with the browser and find the request you need, I wrote an example for review.
import json
import requests
def get_info(page):
url = f"https://services.estateintel.com/api/v2/properties?type\\[\\]=residential&page={page}"
headers = {
'accept': 'application/json',
'authorization': 'false',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
}
response = requests.request("GET", url, headers=headers)
json_obj = json.loads(response.text)
for data in json_obj['data']:
print(data['name'])
print(data['area'], data['state'])
print(data['status'])
print(data['size']['value'], data['size']['unit'])
print('------')
for page in range(1, 134):
get_info(page)
You can choose the fields you need, this is just an example, also add to dataframe. Output:
Twin Oaks Apartment
Kilimani Nairobi
Completed
0 units
------
Duchess Park
Lavington Nairobi
Completed
62 units
------
Greenvale Apartments
Kileleshwa Nairobi
Completed
36 units
------
The Urban apartments & Suites
Osu Greater Accra
Completed
28 units
------
Chateau Towers
Osu Greater Accra
Completed
120 units
------
Cedar Haus Gardens
Oluyole Oyo
Under Construction
38 units
------
10 Agoro Street
Oluyole Oyo
Completed
1 units
..............
Think it is working well, but needs the time to sleep - Just in case, you could select your elements more specific e.g. with css selectors and store information in a list of dicts instead just printing it.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
data = []
for page in range(1,134):
print(page)
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
for item in soup.select('div.project-grid > a'):
data.append({
'title' : item.h3.text.strip(),
'location' : item.find('span', class_ ="text-gray-1").text.strip(),
'status' : item.find('span', class_ ="text-purple-1 font-bold").text.strip(),
'units' : item.find('span', class_ ="text-body font-semibold").text.strip()
})
pd.DataFrame(data)
Output
title
location
status
units
0
Twin Oaks Apartment
Kilimani, Nairobi
Completed
Size: --
1
Duchess Park
Lavington, Nairobi
Completed
Size: 62 units
2
Greenvale Apartments
Kileleshwa, Nairobi
Completed
Size: 36 units
3
The Urban apartments & Suites
Osu, Greater Accra
Completed
Size: 28 units
4
Chateau Towers
Osu, Greater Accra
Completed
Size: 120 units
5
Cedar Haus Gardens
Oluyole, Oyo
Under Construction
Size: 38 units
6
10 Agoro Street
Oluyole, Oyo
Completed
Size: 1 units
7
Villa O
Oluyole, Oyo
Completed
Size: 2 units
8
Avenue Road Apartments
Oluyole, Oyo
Completed
Size: 6 units
9
15 Alafia Street
Oluyole, Oyo
Completed
Size: 4 units
10
12 Saint Mary Street
Oluyole, Oyo
Nearing Completion
Size: 8 units
11
RATCON Estate
Oluyole, Oyo
Completed
Size: --
12
1 Goodwill Road
Oluyole, Oyo
Completed
Size: 4 units
13
Anike's Court
Oluyole, Oyo
Completed
Size: 3 units
14
9 Adeyemo Quarters
Oluyole, Oyo
Completed
Size: 4 units
15
Marigold Residency
Nairobi West, Nairobi
Under Construction
Size: --
16
Kings Distinction
Kilimani, Nairobi
Completed
Size: --
17
Riverview Apartments
Kyumvi, Machakos
Completed
Size: --
18
Serene Park
Kyumvi, Machakos
Under Construction
Size: --
19
Gitanga Duplexes
Lavington, Nairobi
Under Construction
Size: 36 units
20
Westpointe Apartments
Upper Hill, Nairobi
Completed
Size: 254 units
21
10 Olaoluwa Street
Oluyole, Oyo
Under Construction
Size: 12 units
22
Rosslyn Grove
Nairobi West, Nairobi
Under Construction
Size: 90 units
23
7 Kamoru Ajimobi Street
Oluyole, Oyo
Completed
Size: 2 units
#pip install trio httpx pandas
import trio
import httpx
import pandas as pd
allin = []
keys1 = ['name', 'area', 'state']
keys2 = ['value', 'unit']
async def scraper(client, page):
client.params = client.params.merge({'page': page})
r = await client.get('/properties')
allin.extend([[i.get(k, 'N/A') for k in keys1] +
[i['size'].get(b, 'N/A')
for b in keys2] for i in r.json()['data']])
async def main():
async with httpx.AsyncClient(timeout=None, base_url='https://services.estateintel.com/api/v2') as client, trio.open_nursery() as nurse:
client.params = {
'type[]': 'residential'
}
for page in range(1, 3):
nurse.start_soon(scraper, client, page)
df = pd.DataFrame(allin, columns=[keys1 + keys2])
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
0 Cedar Haus Gardens Oluyole Oyo 38 units
1 10 Agoro Street Oluyole Oyo 1 units
2 Villa O Oluyole Oyo 2 units
3 Avenue Road Apartments Oluyole Oyo 6 units
4 15 Alafia Street Oluyole Oyo 4 units
5 12 Saint Mary Street Oluyole Oyo 8 units
6 RATCON Estate Oluyole Oyo 0 units
7 1 Goodwill Road Oluyole Oyo 4 units
8 Anike's Court Oluyole Oyo 3 units
9 9 Adeyemo Quarters Oluyole Oyo 4 units
10 Marigold Residency Nairobi West Nairobi 0 units
11 Riverview Apartments Kyumvi Machakos 0 units
12 Socian Villa Apartments Kileleshwa Nairobi 36 units
13 Kings Pearl Residency Lavington Nairobi 55 units
14 Touchwood Gardens Kilimani Nairobi 32 units
15 Panorama Apartments Upper Hill Nairobi 0 units
16 Gitanga Duplexes Lavington Nairobi 36 units
17 Serene Park Kyumvi Machakos 25 units
18 Kings Distinction Kilimani Nairobi 48 units
19 Twin Oaks Apartment Kilimani Nairobi 0 units
20 Duchess Park Lavington Nairobi 70 units
21 Greenvale Apartments Kileleshwa Nairobi 36 units
22 The Urban apartments & Suites Osu Greater Accra 28 units
23 Chateau Towers Osu Greater Accra 120 units
I am trying to web scrap a wikipedia table into a dataframe. In the wikipedia table, I want to drop Population density, Land Area, and specifically Population (Rank). In the end I want to keep State or territory and just Population (People).
https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density
Here is my code:
wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)
soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})
df=pd.read_html(str(indiatable))
df=pd.DataFrame(df[0])
data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
print (wikidata.head())
How to do I reference specifically that subtable header to drop the rank in Population?
Note: There is no expected result in your question, so you may have to make some adjustments to your headers. Assuming you like to rename people to population and not population by itself I changed that.
To get your goal, simply set the header parameter while reading the html to choose only the second, so you do not need to drop it separatly:
df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)
soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})
df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)
Output
State
Rank(all)
Rank(50 states)
permi2
perkm2
Population
Rank.1
mi2
km2
District of Columbia
1
—
11295
4361
689545
56
61
158
New Jersey
2
1
1263
488
9288994
46
7354
19046.8
Rhode Island
3
2
1061
410
1097379
51
1034
2678
Puerto Rico
4
—
960
371
3285874
49
3515
9103.8
Massachusetts
5
3
901
348
7029917
45
7800
20201.9
Connecticut
6
4
745
288
3605944
48
4842
12540.7
Guam
7
—
733
283
153836
52
210
543.9
American Samoa
8
—
650
251
49710
55
77
199.4
So I'm using pandas.read_html to try to get a table from a website. For some reason it's not giving me the entire table and it's just getting the header row. How can I fix this?
Code:
import pandas as pd
term_codes = {"fall":"10", "spring":"20", "summer":"30"}
# year must be last number in school year: 2021-2022 so we pick 2022
year = "2022"
department = "CSCI"
term_code = year + term_codes["fall"]
url = "https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code=" + term_code + "&term_subj=" + department + "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search"
def findCourseTable():
dfs = pd.read_html(url)
print(dfs[0])
#df = dfs[1]
#df.to_csv(r'courses.csv', index=False)
if __name__ == "__main__":
findCourseTable()
Output:
Empty DataFrame
Columns: [CRN, COURSE ID, CRSE ATTR, TITLE, INSTRUCTOR, CRDT HRS, MEET DAY:TIME, PROJ ENR, CURR ENR, SEATS AVAIL, STATUS]
Index: []
The page contains malformed HTML code, so use flavor="html5lib" in pd.read_html to read it correctly:
import pandas as pd
term_codes = {"fall": "10", "spring": "20", "summer": "30"}
# year must be last number in school year: 2021-2022 so we pick 2022
year = "2022"
department = "CSCI"
term_code = year + term_codes["fall"]
url = (
"https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code="
+ term_code
+ "&term_subj="
+ department
+ "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search"
)
df = pd.read_html(url, flavor="html5lib")[0]
print(df)
Prints:
CRN COURSE ID CRSE ATTR TITLE INSTRUCTOR CRDT HRS MEET DAY:TIME PROJ ENR CURR ENR SEATS AVAIL STATUS
0 16064 CSCI 100 01 C100, NEW Reading#Russia Willner, Dana; Prokhorova, Elena 4 MWF:1300-1350 10 10 0* CLOSED
1 14614 CSCI 120 01 NaN A Career in CS? And Which One? Kemper, Peter 1 M:1700-1750 36 20 16 OPEN
2 16325 CSCI 120 02 NEW Concepts in Computer Science Deverick, James 3 TR:0800-0920 36 25 11 OPEN
3 12372 CSCI 140 01 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:0900-0950 36 24 12 OPEN
4 14620 CSCI 140 02 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:1100-1150 36 27 9 OPEN
5 13553 CSCI 140 03 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:1300-1350 36 25 11 OPEN
...and so on.
I want to scrape tennis matches results from this website
The results table I want has the columns: tournament_name match_time player_1 player_2 player_1_score player_2_score
This is an example
tournament_name match_time player_1 player_2 p1_set1 p2_set1
Roma / Italy 11:00 Krajinovic Filip Auger Aliassime Felix 6 4
Iasi (IX) / Romania 10:00 Bourgue Mathias Martineau Matteo 6 1
I can't associate each tournament name on the id="main_tour" with each row (one row is 2 class="match" or 2 class="match1"
I tried this code:
import requests
from bs4 import BeautifulSoup
u = "http://www.tennisprediction.com/?year=2020&month=9&day=14"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
# print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
for table in soup.select('#main_tur'):
tourn_value = [i.get_text(strip=True) for i in table.select('tr:nth-child(1)')][0].split('/')[0].strip()
tourn_name = [i.get_text(strip=True) for i in table.select('tr td#main_tour')]
row = [i.get_text(strip=True) for i in table.select('.match')]
row2 = [i.get_text(strip=True) for i in table.select('.match1')]
print(tourn_value, tourn_name)
You can use this script to save the table to CSV in your format:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Saves data.csv:
EDIT: To add Odd, Prob columns for both players:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
odd1 = t.find_next(class_='main_odds_m')
odd2 = t.parent.find_next_sibling().find_next(class_='main_odds_m')
prob1 = t.find_next(class_='main_perc')
prob2 = t.parent.find_next_sibling().find_next(class_='main_perc')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
'odd1': odd1.text,
'prob1': prob1.text,
'odd2': odd2.text,
'prob2': prob2.text
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Andrej's solution is really nice and elegant. Accept his solution, but here was my go at it:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
rows=[]
for matchClass in ['match','match1']:
matches = soup.find_all('tr',{'class':'match'})
for idx, match in enumerate(matches):
if idx%2 != 0:
continue
row = {}
tourny = match.find_previous('td',{'id':'main_tour'}).text
time = match.find('td',{'class':'main_time'}).text
p1 = match.find('td',{'class':'main_player'})
player_1 = p1.text
row.update({'tournament_name':tourny,'match_time':time,'player_1':player_1})
sets = p1.find_previous('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p1_set%d'%(idx+1):each_set.text})
p2 = match.find_next('td',{'class':'main_player'})
player_2 = p2.text
row.update({'player_2':player_2})
sets = p2.find_next('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p2_set%d'%(idx+1):each_set.text})
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df.head(10).to_string())
tournament_name match_time player_1 p1_set1 p1_set2 p1_set3 p1_set4 p1_set5 player_2 p2_set1 p2_set2 p2_set3 p2_set4 p2_set5
0 Roma / Italy prize / money : 5791 000 USD 11:10 Krajinovic Filip (SRB) (26) 6 7 Krajinovic Filip (SRB) (26) 4 5
1 Roma / Italy prize / money : 5791 000 USD 13:15 Dimitrov Grigor (BGR) (20) 7 6 Dimitrov Grigor (BGR) (20) 5 1
2 Roma / Italy prize / money : 5791 000 USD 13:50 Coric Borna (HRV) (32) 6 6 Coric Borna (HRV) (32) 4 4
3 Roma / Italy prize / money : 5791 000 USD 15:30 Humbert Ugo (FRA) (42) 6 7 Humbert Ugo (FRA) (42) 3 6 (5)
4 Roma / Italy prize / money : 5791 000 USD 19:00 Nishikori Kei (JPN) (34) 6 7 Nishikori Kei (JPN) (34) 4 6 (3)
5 Roma / Italy prize / money : 5791 000 USD 22:00 Travaglia Stefano (ITA) (87) 6 7 Travaglia Stefano (ITA) (87) 4 6 (4)
6 Iasi (IX) / Romania prize / money : 100 000 USD 10:05 Menezes Joao (BRA) (189) 6 6 Menezes Joao (BRA) (189) 4 4
7 Iasi (IX) / Romania prize / money : 100 000 USD 12:05 Cretu Cezar (2001) (ROU) 2 6 6 Cretu Cezar (2001) (ROU) 6 3 4
8 Iasi (IX) / Romania prize / money : 100 000 USD 14:35 Zuk Kacper (POL) (306) 6 6 Zuk Kacper (POL) (306) 2 0
9 Roma / Italy prize / money : 3452 000 USD 11:05 Pavlyuchenkova Anastasia (RUS) (32) 6 6 6 Pavlyuchenkova Anastasia (RUS) (32) 4 7 (5) 1
I am trying to run this script to extract data from the US census but the census API is rejecting my request. It is rejecting my pulls, I did a bit of work, but am stumped....any ideas on how to deal with this
import pandas as pd
import requests
from pandas.compat import StringIO
#Sourced from the following site https://github.com/mortada/fredapi
from fredapi import Fred
fred = Fred(api_key='xxxx')
import StringIO
import datetime
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO as stio
else:
from io import StringIO as stio
year_list = '2013','2014','2015','2016','2017'
month_list = '01','02','03','04','05','06','07','08','09','10','11','12'
#############################################
#Get the total exports from the United States
#############################################
exports = pd.DataFrame()
for i in year_list:
for s in month_list:
try:
link="https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
str1 = ''.join([i])
txt = '-'
str2 = ''.join([s])
total_link=link+str1+txt+str2
r = requests.get(total_link, headers = {'User-agent': 'your bot 0.1'})
df = pd.read_csv(StringIO(r.text))
##################### change starts here #####################
##################### since it is a dataframe itself, so the method to create a dataframe from a list won't work ########################
# Drop the total sales line
df.drop(df.index[0])
# Rename Column name
df.columns=['CTY_CODE','CTY_NAME','EXPORT MTH','EXPORT YR','time','UN']
# Change the ["1234" to 1234
df['CTY_CODE']=df['CTY_CODE'].str[2:-1]
# Change the 2017-01] to 2017-01
df['time']=df['time'].str[:-1]
##################### change ends here #####################
exports = exports.append(df, ignore_index=False)
except:
print i
print s
Here you go:
import ast
import itertools
import pandas as pd
import requests
base = "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
year_list = ['2013','2014','2015','2016','2017']
month_list = ['01','02','03','04','05','06','07','08','09','10','11','12']
exports = []
rejects = []
for year, month in itertools.product(year_list, month_list):
url = '%s%s-%s' % (base, year, month)
r = requests.get(url, headers={'User-agent': 'your bot 0.1'})
if r.text:
r = ast.literal_eval(r.text)
df = pd.DataFrame(r[2:], columns=r[0])
exports.append(df)
else:
rejects.append((int(year), int(month)))
exports = pd.concat(exports).reset_index().drop('index', axis=1)
Your result looks like this:
CTY_CODE CTY_NAME ALL_VAL_MO ALL_VAL_YR time
0 1010 GREENLAND 233446 233446 2013-01
1 1220 CANADA 23170845914 23170845914 2013-01
2 2010 MEXICO 17902453702 17902453702 2013-01
3 2050 GUATEMALA 425978783 425978783 2013-01
4 2080 BELIZE 17795867 17795867 2013-01
5 2110 EL SALVADOR 207606613 207606613 2013-01
6 2150 HONDURAS 429806151 429806151 2013-01
7 2190 NICARAGUA 75752432 75752432 2013-01
8 2230 COSTA RICA 598484187 598484187 2013-01
9 2250 PANAMA 1046236431 1046236431 2013-01
10 2320 BERMUDA 47156737 47156737 2013-01
11 2360 BAHAMAS 256292297 256292297 2013-01
... ... ... ... ...
13883 0024 LAFTA 27790655209 193139639307 2017-07
13884 0025 EURO AREA 15994685459 121039479852 2017-07
13885 0026 APEC 76654291110 550552655105 2017-07
13886 0027 ASEAN 6030380132 44558200533 2017-07
13887 0028 CACM 2133048149 13333440411 2017-07
13888 1XXX NORTH AMERICA 41622877949 299981278306 2017-07
13889 2XXX CENTRAL AMERICA 4697852283 30756310800 2017-07
13890 3XXX SOUTH AMERICA 8117215081 55039567414 2017-07
13891 4XXX EUROPE 25201247938 189925038230 2017-07
13892 5XXX ASIA 38329181070 274304503490 2017-07
13893 6XXX AUSTRALIA AND OC... 2389798925 16656777753 2017-07
13894 7XXX AFRICA 1809443365 13022520158 2017-07
Walkthrough:
itertools.product iterates over the product of (year, month) combinations, joining them with your base url
if the text of the response object is not blank (periods such as 2017-12 will be blank), create a DataFrame out of the literally-evaluated text, which is a list of lists. Use the first element as columns and ignore the second element.
otherwise, add the (year, month) combo to rejects, a list of tuples of the items not found
I used exports = [] because it is much more efficiently to concatenate a list of DataFrames than to append to an existing DataFrame