Getting headers from html (parsing)

Getting headers from html (parsing) - python

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.

Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]

It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.

by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Related

Python: How to Webscrape All Rows from a Specific Table

For practice, I am trying to webscrape financial data from one table in this url: https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue
I'd like to save the data from the "Tesla Quarterly Revenue" table into a data frame and return two columns: Data, Revenue.
Currently the code as it runs now is grabbing data from the adjacent table, "Tesla Annual Revenue." Since the tables don't seem to have unique id's from which to separate them in this instance, how would I select elements only from the "Tesla Quarterly Revenue" table?
Any help or insight on how to remedy this would be deeply appreciated.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html5lib')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all("tr"):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue},ignore_index=True)
tesla_revenue.head()
Below are the results when I run this code:

You can let pandas do all the work
import pandas as pd
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
tables = pd.read_html(url)
for df in tables:
# loop over all found tables
pass
# quarterly revenue is the second table
df = tables[1]
df.columns = ['Date', 'Revenue'] # rename the columns if you want to
print(df)

Pandas df read data source stored in ".xml?" page which is not in tables format?

I need to download the data table and export to excel in"http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/index.html" Inspecting the page using Chrome, inspect function. The data is in "http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/report_en.xml?id=2". However, it is no longer in Table format.
url = "http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/index.html"
table= pd.read_html(url)[2]
table.info()
print(table)
table.to_excel("GGR.xlsx")

I see that now your source web site returns the content in XML format.
To process it, you can apply BeautifulSoup. Assuming that you have installed it,
proceed as follows:
Import necessary modules.
from bs4 import BeautifulSoup
import requests
Read the source page:
page = requests.get('http://www.dicj.gov.mo/web/en/information/DadosEstat_mensal/2019/report_en.xml?id=2')
soup = BeautifulSoup(page.text, 'lxml')
Read column names and create the MultiIndex for columns in the target DataFrame:
col = soup.find_all('column')
h1 = [ col[i].contents[0] for i in range(1,3) ]
h2 = [ col[i].contents[0] for i in range(3,6) ]
cols = pd.MultiIndex.from_product([h1, h2])
Process source records, creating ind (index) and rows:
recs = soup.find_all('record')
ind = []
rows = []
for rec in recs:
cells = rec.find_all('data')
ind.append(cells[0].contents[0])
rows.append([ cells[i].contents[0] for i in range(1,7) ])
And the last step - create the target DataFrame:
df = pd.DataFrame(rows, index=ind, columns=cols)
I tried to read this table from the first page given by you, using read_html,
but I failed.
Probably the final content is loaded by some JavaScript in this page,
which can not be "seen" by read_html.

Parse complex multi-header html table with pandas and bs4

Complex Table link
I have used bs4, pandas and lxml libraries to parse the html table above but i am not having success. With pandas i try to skip rows and setting header to 0 however the result is a DataFrame highly unstructured and it also seems that some data is missing.
With the other 2 libraries i tried to use selectors and even the xpath from the tbody section but i receive a empty list in both cases.
This would be what i want to retrieve:
Can anyone give me a hand about how i can i scrape that data?
Thank you!

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
page = urlopen('https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&datepicker-day-offset-select-dv-date-from_input=D&dateTime.dateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YES-REE------0!BZN%7C10YES-REE------0&productionType.values=B01&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC')
soup = BeautifulSoup(page.read())
table = soup.find('tbody')
res = []
row = []
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
row.append(td.text)
res.append(row)
row = []
df = pd.DataFrame(data=res)
Then add column names with df.columns and drop empty columns.
EDIT: Suggest this modifed for-loop. (BillBell)
>>> for tr in table.find_all('tr'):
... for td in tr.find_all('td'):
... row.append(td.text.strip())
... res.append(row)
... row = []
The original form of the for statement failed compilation.
The original form of the the append left new-lines and blanks in constants.

Html table scraping using beautifulsoup

I am trying to a table from SEC filling 10-K, I think it is going allright except the part where pandas converting it to dataframes as I am new to data frames so I think making mistake in indexing, Please help me on this as I am getting fallowing error "IndexError: index 2 is out of bounds for axis 0 with size 2"
I am using this programming
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1022344/000155837017000934/spg-20161231x10k.htm#Item8FinancialStatementsandSupplementary'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,2), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
If dataframe issue isn't resolvable than please suggest any other substitute like writing data to csv/excel also any sugesstion for extracting mutiple table at once will be really helpful

Parsing html data into python list for manipulation

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.

I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps

from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting headers from html (parsing) - python

It's probably easier in these cases to try to read tables with pandas, and go from there: import pandas as pd table = soup.select_one("div#covid19-container table") df = pd.read_html(str(table))[0] df The output is the target table.

by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Related

Python: How to Webscrape All Rows from a Specific Table

Pandas df read data source stored in ".xml?" page which is not in tables format?

Parse complex multi-header html table with pandas and bs4

Html table scraping using beautifulsoup

Parsing html data into python list for manipulation

Categories

Resources