Beautiful Soup scrape table with table breaks

Beautiful Soup scrape table with table breaks - python

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.

If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

Related

Convert table sourced from html webpage in to pandas dataframe

I'm trying to obtain a table from a webpage and convert in to a dataframe to be used in analysis. I've used the BeautifulSoup package to scrape the url and parse the table info, but I can't seem to export the info to a dataframe. My code is below:
from bs4 import BeautifulSoup as bs
from urllib import request
source = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
for tr in table_rows:
td = tr.find_all("td")
row = [i.text for i in td]
print(row)
By doing this I can see each row, but I'm not sure how to convert it to df. Any ideas?

please try this.
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import pandas as pd
source = urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
postal_codes = []
for tr in table_rows:
td = tr.find_all("td")
row = [ i.text[:-1] for i in td]
postal_codes.append(row)
#print(row)
postal_codes.pop(0)
df = pd.DataFrame(postal_codes, columns=['PostalCode', 'Borough', 'Neighborhood'])
print(df)

u can utilize pandas read_html
# read's all the tables & return as an array, pick the data table that meets your need
table_list = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print(table_list[0])
Postal Code Borough Neighborhood
0 M1A Not assigned NaN
1 M2A Not assigned NaN
2 M3A North York Parkwoods
3 M4A North York Victoria Village

How to get specific table from HTML

We have form 10-k of several companies. We want to get Earnings tables (Item 6) from the HTML. The structure of the form changes for the companies.
For e.g
url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm'
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'
We need to get the table in Item 6 Consolidated Financial data.
One way we tried is based on string search for Item 6, getting all the text from Item 6 to Item 7 then get the tables as following:
doc10K = requests.get(url2)
st6 =doc10K.text.lower().find("item 6")
end6 = doc10K.text.lower().find("item 7")
# get text fro item 6 and removing currency sign
item6 = doc10K.text[st6:end6].replace('$','')
Tsoup = bs.BeautifulSoup(item6, 'lxml')
# Extract all tables from the response
html_tables =Tsoup.find_all('table')
This approach doesn't work for all the forms. E.g. With KSS, we are not able to find string 'Item6'. Ideal output will be the table given in Item 6.

petezurich is right, but the marker is not fully positioned.
# You can try this, too. The start parameter can be a list, just match any one of the above
doc10K = requests.get(url2)
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc(doc10K.text)
start = doc.html.rfind('Selected Consolidated Financial Data')
if start<0:
start = doc.html.rfind('Selected Financial Data')
tables = doc.getElementsByTag('table',start=start,end=['Item 7','Item 7'])
for table in tables:
trs = table.trs
for tr in trs:
tds = tr.tds
for td in tds:
print(td.text)
# print(td.unescape()) #Replace HTML entity

The string item 6 seems to contain either a space or a non breaking space.
Try this cleaned code:
import requests
from bs4 import BeautifulSoup
url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm'
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'
doc10K = requests.get(url2)
st6 = doc10K.text.lower().find("item 6")
# found "item 6"? if not search search with underscore
if st6 == -1:
st6 = doc10K.text.lower().find("item_6")
end6 = doc10K.text.lower().find("item 7")
item6 = doc10K.text[st6:end6].replace('$','')
soup = BeautifulSoup(item6, 'lxml')
html_tables = soup.find_all('table')

With bs4 4.7.1+ you can use :contains and :has to specify the appropriate matching patterns for the table based on the html. You can use css Or syntax so either of the two patterns shown below are matched.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
urls = ['https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm','https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:contains("Item 6") ~ div:has(table) table, p:contains("Selected Consolidated Financial Data") ~ div:has(table) table')))[0]
table.dropna(axis = 0, how = 'all',inplace= True)
table.dropna(axis = 1, how = 'all',inplace= True)
table.fillna(' ', inplace=True)
table.rename(columns= table.iloc[0], inplace = True) #set headers same as row 1
table.drop(table.index[0:2], inplace = True) #lose row 1
table.reset_index(drop=True, inplace = True) #re-index
print(table)

How to specify table for BeautifulSoup to find?

I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.

In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)

ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)

Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950

Beautiful Soup:Scrape Table Data

I'm looking to extract table data from the url below. Specifically I would like to extract the data in first column. When I run the code below, the data in the first column repeats multiple times. How can I get the values to show only once as it appears in the table?
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
for cell in data:
print(data[0].text)

Try this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if (len(data) > 0):
cell = data[0]
print(cell.text)

Using requests module in combination with selectors you can try like the following as well:
import requests
from bs4 import BeautifulSoup
link = 'http://www.pythonscraping.com/pages/page3.html'
soup = BeautifulSoup(requests.get(link).text, 'lxml')
for table in soup.select('table#giftList tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
Output:
Vegetable Basket
Russian Nesting Dolls
Fish Painting
Dead Parrot
Mystery Box

How to scrape multiple tables in R?

I am a "newbie" when it comes to R, but i would really like to know how do i scrape multiple tables (that i don't know the dimensions of) from a site like:
https://en.wikipedia.org/wiki/World_population
(just to be specific, here's is what the code looks like in python:
from bs4 import BeautifulSoup
import urllib2
url1 = "https://en.wikipedia.org/wiki/World_population"
page = urllib2.urlopen(url1)
soup = BeautifulSoup(page)
table1 = soup.find("table", {'class' : 'wikitable sortable'})
trs = soup.find_all('tr')
tds = soup.find_all('td')
for row in trs:
for column in tds:
a = column.get_text().strip()
print a
break

In R,
u <- "https://en.wikipedia.org/wiki/World_population" # input
library(XML)
b <- basename(u)
download.file(u, b)
L <- readHTMLTable(b)
L is now a list of the 29 tables in u, each as an R data frame.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup scrape table with table breaks - python

Related

Convert table sourced from html webpage in to pandas dataframe

How to get specific table from HTML

How to specify table for BeautifulSoup to find?

Beautiful Soup:Scrape Table Data

How to scrape multiple tables in R?

Categories

Resources