Scraping a table with row labels in Python using Beautiful Soup - python

I'm trying to scrape a table from a website that has row labels. I'm able to get the actual data from the table, but I have no idea how to get the row labels as well.
Here is my code right now:
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)
html = res.read()
## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
tables = bs.find_all("table")
table = tables[0]
df = pd.DataFrame()
rows = table.find_all("tr")
#extract the first column name (Employment income groups (18))
column_names = []
header_cells = rows[0].find_all("th")
for cell in header_cells:
header = cell.text
header = header.strip()
header = header.replace("\n", " ")
column_names.append(header)
#extract the rest of the column names
header_cells = rows[1].find_all("th")
for cell in header_cells:
header = cell.text
header = header.strip()
header = header.replace("\n", " ")
column_names.append(header)
#this is an extra label
column_names.remove('Main mode of commuting (10)')
#get the data from the table
data = []
for row in rows[2:]:
## create an empty tuple
dt = ()
cells = row.find_all("td")
for cell in cells:
## dp stands for "data point"
font = cell.find("font")
if font is not None:
dp = font.text
else:
dp = cell.text
dp = dp.strip()
dp = dp.replace("\n", " ")
## add to tuple
dt = dt + (dp,)
data.append(dt)
df = pd.DataFrame(data, columns = column_names)
Creating the dataframe will give an error because the code above only extracts the cells with data points but does not extract the first cell of each row that contains the row label.
That is, there are 11 column names, but the tuples only have 10 values because it is not extracting the row label (ie, Total - Employment income) because they are of "th" type.
How can I get the row label and put it into the tuple as I process the rest of the data in the table?
Thank you for your help.
(The table I am trying to scrape is on this site if it's not clear from the code)

Use this table.findAll('th',{'headers':'col-0'}) to find row labels
lab = []
labels = table.findAll('th',{'headers':'col-0'})
for label in labels:
data = str(label.text).strip()
data = str(data).split("($)Footnote", 1)[0]
lab.append(data)
#print(data)
EDIT:
Using pandas.read_html
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)
html = res.read()
## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
tables = bs.find_all("table")
df = (pd.read_html(str(tables)))[0]
#print(df)
columns = ['Employment income groups (18)','Total - Main mode of commuting','Car, truck or van','Driver, alone',
'2 or more persons shared the ride to work','Driver, with 1 or more passengers',
'Passenger, 2 or more persons in the vehicle','Sustainable transportation',
'Public transit','Active transport','Other method']
df.columns = columns
Edit 2: Element wont be accesible by index because strings are not proper strings (Employment income groups (18) column labels). I have the edited the code again.

Related

Convert bunch of list items (got from scaping vertical table) into pandas dataframe of equal headers and row and ultimately save as csv or excel

I was scraping a website for data on a company and so far what I get as final result is bunch of string items which were converted into list.
Code snippet:
for tr in tables.find_all("tr"):
for td in tr.find_all("td"):
lists = td.text.split('\n')
now if, I print this lists with index and value using enumerate, I get 16 items as per the table scrapped which is correct if checked as per the website.
Result of print(lists) using enumerate:
Index Data
0 ['XYZ']
1 ['100DL20C201961']
2 ['Capital']
3 ['12345']
4 ['Age']
5 ['16 Years']
6 ['Text']
7 ['56789']
8 ['Company Status']
9 ['Active']
10 ['Last Date']
11 ['27-11-2021']
12 ['Class']
13 ['Public Company']
14 ['Date']
15 ['31-12-2021']
However what I want to achieve is saving these bunch of list items as csv or excel so that every even number is header for the column name and odd number is data for the row.
Question:
Is pandas DataFrame needed for this?
How to convert bunch of lists as above (or strings) into a '.csv' or '.xlsx' table
Summary of goal: -
A (2 row x 8 columns) table in .csv or .xlsx format.
Try:
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)
data.append(d)
df = pd.DataFrame(data).to_csv('out.csv',index=False)
#print(df)
Complete ResulSet
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)
d.update(dict((row.select_one('td:nth-child(3)').get_text(),row.select_one('td:nth-child(4)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8]))
data.append(d)
df = pd.DataFrame(data).to_csv('out2.csv',index=False)
#print(df)

Getting headers from html (parsing)

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.
Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]
It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.
by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Python Google Finance historical data foreign stock Codes eg ASX

The code below works for the American stocks APLE and BHP however when I replace them with the ASX codes it crashes. I though it was due to the colon and have placed str(ASX:BHP) with out success. Unfortunately Yahoo is no longer supplying historical data. Any thoughts or solutions alternatives would be greatly appreciated.
Thanks
import datetime
import pandas as pd
from pandas_datareader import data, wb
list = ["APLE","BHP"]
#list = ["ASX:AMP","ASX:BHP"]
df_all_stock = pd.DataFrame([])
start = datetime.datetime(2016, 1, 1)
end = datetime.datetime(2017, 1, 1)
for row in list:
row = str(row)
df_stock = data.DataReader(row, "google", start, end)
df_all_stock = df_all_stock.append(df_stock)
df_all_stock['code'] = row
df_all_stock
Just use the ASX API
https://www.asx.com.au/asx/1/share/AMP
will return:
{"code":"AMP","isin_code":"AU000000AMP6","desc_full":"Ordinary Fully Paid","last_price":1.155,"open_price":1.115,"day_high_price":1.155,"day_low_price":1.11,"change_price":0.040,"change_in_percent":"3.587%","volume":24558498,"bid_price":1.15,"offer_price":1.16,"previous_close_price":1.115,"previous_day_percentage_change":"3.241%","year_high_price":1.77,"last_trade_date":"2021-08-13T00:00:00+1000","year_high_date":"2020-12-03T00:00:00+1100","year_low_price":1.038,"year_low_date":"2021-07-30T00:00:00+1000","year_open_price":4.97,"year_open_date":"2014-02-25T11:00:00+1100","year_change_price":-3.815,"year_change_in_percentage":"-76.761%","pe":32.08,"eps":0.036,"average_daily_volume":20511519,"annual_dividend_yield":0,"market_cap":3641708026,"number_of_shares":3266105853,"deprecated_market_cap":3772352000,"deprecated_number_of_shares":3266105853,"suspended":false}
Other sample queries that can give you price history, announcements, directors, dividends etc:
https://www.asx.com.au/asx/1/share/AMP/prices?interval=daily&count=255
https://www.asx.com.au/asx/1/company/AMP
https://www.asx.com.au/asx/1/company/AMP?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices
https://www.asx.com.au/asx/1/company/AMP/announcements?count=10&market_sensitive=true
https://www.asx.com.au/asx/1/company/AMP/dividends
https://www.asx.com.au/asx/1/company/AMP/dividends/history?years=10
https://www.asx.com.au/asx/1/company/AMP/people
https://www.asx.com.au/asx/1/company/AMP/options?count=1000
https://www.asx.com.au/asx/1/company/AMP/warrants?count=1000
https://www.asx.com.au/asx/1/chart/highcharts?asx_code=AMP&years=10
https://www.asx.com.au/asx/1/company/AMP/similar?compare=marketcap
I included some sample Python code here: https://stackoverflow.com/a/68790147/8459557
will need to build a scraper to get the data out of the html table and then build up a pandas dataframe that resembles the one we get as output for the american stock data.
I determined the base url for Canadian stocks on google finance to be: 'https://www.google.ca/finance/historical?q=TSE%3A' To get data for a stock, we simply append its name to the end of the above base url. For example to see the historical stock data for 'VCN' we would need to go to the page: https://www.google.ca/finance/historical?q=TSE%3AVCN
To do the above in python code we simply need the following, where the stock variable can be changed for any TSE(Tornto stock exchange) stock of interest.
from datetime import datetime
from pandas import DataFrame
import pandas_datareader.data as web
google_historical_price_site= 'https://www.google.ca/finance/historical?
q=TSE%3A'
stock = 'VCN' #sub any sock in here
historical_price_page = google_historical_price_site + stock
print(historical_price_page)
from urllib.request import urlopen
from bs4 import BeautifulSoup
#open the historical_price_page link and acquire the source code
stock_dat = urlopen(historical_price_page)
#parse the code using BeautifulSoup
historical_page = BeautifulSoup(stock_dat,'lxml')
#scrape the table
table_dat = historical_page.find('table',{'class':'gf-table
historical_price'})
#find all the rows in the table
rows = table_dat.findAll('td',{'class':'lm'})
#get just the dates out of the table rows, strip the newline characters
dates = [x.get_text().rstrip() for x in rows]
#turn dates to python datetime format
datetime_dates = [datetime.strptime(x, '%b %d, %Y') for x in dates]
#next we build up the price dataframe rows
#iterate through the table, taking the siblings to the
#right of the dates and adding to the row's data
prices = []
for num, row in enumerate(rows):
row_dat = [datetime_dates[num]] #first column is the dates
for i in row.next_siblings:
row_dat.append(i.get_text().rstrip()) #iterate through columns, append
prices.append(row_dat) #add the row to the list of rows
#turn the output into the dataframe
outdat = DataFrame(prices,columns =
['Date','Open','High','Low','Close','Volume'])
#make the Volume columns integers, in case we wish to use it later!
outdat["Volume"] = outdat["Volume"].apply(lambda x: int(x.replace(',','')))
#change the other columns to floating point values
for col in ['Open','High','Low','Close']:
outdat[col] = outdat[col].apply(lambda x: float(x))
#set the index to match the american stock data
outdat = outdat.set_index('Date')
#sort the index so it is in the same orientation as the american data
outdat = outdat.sort_index()
#have a look
outdat
EXAMPLE OF downloading Hong Kong Stock as CSV file (STOCK EXAMPLE: Tencent Holdings Ltd(HKG:0700)
from datetime import datetime
from pandas import DataFrame
import pandas_datareader.data as web
import os
google_historical_price_site='https://finance.google.com/finance/historical?q=HKG:0700'
print(google_historical_price_site)
from urllib.request import urlopen
from bs4 import BeautifulSoup
#open the historical_price_page link and acquire the source code
stock_dat = urlopen(google_historical_price_site)
#parse the code using BeautifulSoup
google_historical_price_site = BeautifulSoup(stock_dat,'lxml')
#scrape the table
table_dat = google_historical_price_site.find('table',{'class':'gf-table
historical_price'})
#find all the rows in the table
rows = table_dat.findAll('td',{'class':'lm'})
#get just the dates out of the table rows, strip the newline characters
dates = [x.get_text().rstrip() for x in rows]
#turn dates to python datetime format
datetime_dates = [datetime.strptime(x, '%b %d, %Y') for x in dates]
#next we build up the price dataframe rows
#iterate through the table, taking the siblings to the
#right of the dates and adding to the row's data
prices = []
for num, row in enumerate(rows):
row_dat = [datetime_dates[num]] #first column is the dates
for i in row.next_siblings:
row_dat.append(i.get_text().rstrip()) #iterate through columns, append
prices.append(row_dat) #add the row to the list of rows
#turn the output into the dataframe
outdat = DataFrame(prices,columns =
['Date','Open','High','Low','Close','Volume'])
#make the Volume columns integers, in case we wish to use it later!
outdat["Volume"] = outdat["Volume"].apply(lambda x: int(x.replace(',','')))
#change the other columns to floating point values
for col in ['Open','High','Low','Close']:
outdat[col] = outdat[col].apply(lambda x: float(x))
#set the index to match the american stock data
outdat = outdat.set_index('Date')
#sort the index so it is in the same orientation as the american data
outdat = outdat.sort_index()
#output CSV.file
df=outdat
path_d = 'C:\MA data'
df.to_csv(os.path.join(path_d, 'HKGstock700.csv'))

Html table scraping using beautifulsoup

I am trying to a table from SEC filling 10-K, I think it is going allright except the part where pandas converting it to dataframes as I am new to data frames so I think making mistake in indexing, Please help me on this as I am getting fallowing error "IndexError: index 2 is out of bounds for axis 0 with size 2"
I am using this programming
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1022344/000155837017000934/spg-20161231x10k.htm#Item8FinancialStatementsandSupplementary'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,2), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
If dataframe issue isn't resolvable than please suggest any other substitute like writing data to csv/excel also any sugesstion for extracting mutiple table at once will be really helpful

Get xml from webservice?

I'm trying to get a data from this site
and then use some of it. Sorry for not copy-paste it but it's a long xml. So far I tried to get this data those ways:
from urllib.request import urlopen
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
s = urlopen(url)
content = s.read()
as print(content) looks good, now I would like to get a data from it
<tabela_rozklad data-aktualizacji="1480583567">
<DZIEN>2</DZIEN>
<GODZ>3</GODZ>
<ILOSC>2</ILOSC>
<TYG>0</TYG>
<ID_NAUCZ>66</ID_NAUCZ>
<ID_SALA>79</ID_SALA>
<ID_PRZ>104</ID_PRZ>
<RODZ>W</RODZ>
<GRUPA>1</GRUPA>
<ID_ST>13</ID_ST>
<SEM>1</SEM>
<ID_SPEC>0</ID_SPEC>
</tabela_rozklad>
How can I handle this data to easy use it?
You can use Beautiful soup and capture the tags you want. The code below should get you started!
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
# secure url content
response = requests.get(url).content
soup = BeautifulSoup(response)
# find each tabela_rozklad
tables = soup.find_all('tabela_rozklad')
# for each tabela_rozklad looks like there is 12 nested corresponding tags
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
# initialize empty dataframe
df = pd.DataFrame()
# iterate over each tabela_rozklad and extract each tag and append to pandas dataframe
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
# insert tags as columns
df.columns = tags
# display first 5 rows of table
df.head()
# and the shape of the data
df.shape # 665 rows, 12 columns
# and now you can get to the information using traditional pandas functionality
# for instance, count observations by rodz
df.groupby('rodz').count()
# or subset only observations where rodz = J
J = df[df.rodz == 'J']

Categories

Resources