Web scraping with BeautifulSoup on Wikipedia

Web scraping with BeautifulSoup on Wikipedia - python

i am new to python and trying to use BeautifulSoup to extract all the train Stations Names on Wikipedia page from the third column of the wikitable.
i have tried the code below but it seems to return every row of cells as 1 group of information
contentTable = soup.find('table', { "class" : "wikitable"})
cols = contentTable.find_all('td')
for col in cols:
soup.find_all("a")
print(col.get_text())
output as below representing 1 row from the table:
CG2
TE [a]
Changi Airport
樟宜机场
சாங்கி விமானநிலையம்
8 February 2002
Changi Airport
CGA
Changi
Singapore Changi Airport, Changi Airport PTB2 Bus Terminal
expected dataframe column Station Names :
Station Names
Jurong East
Bukit Batok
etc...
can someone teach me how to code this correctly?
Thank you!

Your program is just simply printing the text contents of each 'td' tag on the wikitable.
Try this instead:
contentTable = soup.find('table', {"class": "wikitable"})
trs = contentTable.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
for td in tds:
if tds.index(td) == 2:
print(td.get_text())
First it scrapes each row, finds each 'td'-tag on that row, and prints out its contents if it's the third 'td'-tag on the said row.

try this
import requests
from bs4 import BeautifulSoup
# url to be scrape
URL = "https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations"
PAGE = requests.get(URL)
# get HTML content
SOUP = BeautifulSoup(PAGE.content, 'lxml') # lxml is faster then html.parser
contentTable = SOUP.find('table', {"class": "wikitable"})
rows = contentTable.findAll('tr')
for tr in rows:
columns = tr.find_all('td')
for index, td in enumerate(columns):
if index == 2:
print(td.text)

Related

How to scrape td corresponding to header text in Beautifulsoup

I am trying to scrape Wikipedia using Beautiful Soup. I want to get the text inside , but only the contents of the row with a certain header text.
For example:
I want to get the list of awards Alan Turing has received from https://en.wikipedia.org/wiki/Alan_Turing
The information I need is in the right table, in the table data corresponding to the table header with text Awards. How can I get the list of awards?
I have tried looping through the table rows and checking if table header is equal to 'Awards' but I don't know how to stop the loop in case there is no 'Awards' header in the table.
testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = requests.get(testurl)
page_content = BeautifulSoup(page.content, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})
while True:
tr = table.find('tr')
if tr.find('th').renderContents() == 'Awards':
td = tr.find('td')
break
print(td)

You can use CSS selector th:contains("Awards") - that will select <th> tag which contains text Awards.
Then + td a[title] will select next sibling <td> and every <a> tag with title= attribute:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Alan_Turing'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
awards = [a.text for a in soup.select('th:contains("Awards") + td a[title]')]
print(awards)
Prints:
["Smith's Prize"]
For url = 'https://en.wikipedia.org/wiki/Albert_Einstein' it will print:
['Barnard Medal', 'Nobel Prize in Physics', 'Matteucci Medal', 'ForMemRS', 'Copley Medal', 'Gold Medal of the Royal Astronomical Society', 'Max Planck Medal', 'Member of the National Academy of Sciences', 'Time Person of the Century']
Update 2021/10/31
beautifulsoup4 version 4.10.0
th:contains is now deprecated, use th:-soup-contains instead of th:contains.
example
awards = [a.text for a in soup.select('th:-soup-contains("Awards") + td a[title]')]

Here's how you can access the 'Awards' part. Hope this is helpful to you
from bs4 import BeautifulSoup
import urllib.request
testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = urllib.request.urlopen(testurl)
page_content = BeautifulSoup(page, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})
for link in table.find_all('th'):
if link.text == 'Awards':
your_needed_variable = link.text
print(your_needed_variable)

Convert table sourced from html webpage in to pandas dataframe

I'm trying to obtain a table from a webpage and convert in to a dataframe to be used in analysis. I've used the BeautifulSoup package to scrape the url and parse the table info, but I can't seem to export the info to a dataframe. My code is below:
from bs4 import BeautifulSoup as bs
from urllib import request
source = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
for tr in table_rows:
td = tr.find_all("td")
row = [i.text for i in td]
print(row)
By doing this I can see each row, but I'm not sure how to convert it to df. Any ideas?

please try this.
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import pandas as pd
source = urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
postal_codes = []
for tr in table_rows:
td = tr.find_all("td")
row = [ i.text[:-1] for i in td]
postal_codes.append(row)
#print(row)
postal_codes.pop(0)
df = pd.DataFrame(postal_codes, columns=['PostalCode', 'Borough', 'Neighborhood'])
print(df)

u can utilize pandas read_html
# read's all the tables & return as an array, pick the data table that meets your need
table_list = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print(table_list[0])
Postal Code Borough Neighborhood
0 M1A Not assigned NaN
1 M2A Not assigned NaN
2 M3A North York Parkwoods
3 M4A North York Victoria Village

Extracting contents from the table beautiful soup

I have been trying to extract out the contents inside a table on a website.
descriptions = []
sources = []
values = []
site = 'https://www.eia.gov/todayinenergy/prices.php' #address of the site
driver = webdriver.Chrome(executable_path=r"chromedriver.exe")
driver.execute_script("document.body.style.zoom='100%'")
driver.get(site)
soup_1 = bs(driver.page_source, 'lxml') #clean up the site using beautiful soup
tables = soup_1.find_all('tbody') #script of interest
print(len(tables)) #count the scripts
for table in tables:
rows = table.find_all('tr')
print(len(rows))
for row in rows:
description = row.find('td', class_='s1')
descriptions.append(descri_clean)
source = row.find('td', class_='s2')
sources.append(source_clean)
value = row.find('td', class_='d1') #find the row that gives the data
values.append(value_clean) #compile it all together
driver.close()
I have been trying to get clean text form the table however the data extracted looks like this.
<td class="s1" rowspan="3">Crude Oil<br/> ($/barrel)</td>
While i want something like just ''Crude Oil ($/barrel)
When i tried
description = row.find('td', class_='s1').text.renderContents()
descriptions.append(descri_clean)
The error showed up
AttributeError: 'NoneType' object has no attribute 'renderContents'

You can use just requests. You can filter out your values by doing string matching on expected values for certain class attributes when looping table rows. I set the two tables of interest into separate variables which are lists of the rows within those tables. The tables on the page each have their own distinct class identifier for the table number e.g. t1, t2 ......
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.eia.gov/todayinenergy/prices.php')
soup = bs(r.content, 'lxml')
table1 = soup.select('.t1 tr')
table2 = soup.select('.t2 tr')
for item in table1:
if 'Crude Oil ($/barrel) - Nymex Apr' in item.text:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)
elif 'Ethanol ($/gallon) - CBOT Apr' in item.text:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)
for item in table2:
if len(item.select('td')) == 4:
header = item.select_one('td.s1').text
if item.select_one('td.s2'):
if item.select_one('td.s2').text in ['WTI','Brent','Louisiana Light','Los Angeles'] and header in ['Crude Oil ($/barrel)','Gasoline (RBOB) ($/gallon)']:
rowInfo = [td.text for td in item.select('td')]
print(rowInfo)

Beautiful Soup scrape table with table breaks

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.

If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

I do not quite understand how to parse the Yahoo NHL Page

Here is my code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01")
content = url.read()
soup = BeautifulSoup(content)
print (soup.prettify)
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.findAll('yspscores')
for yspscores in td:
print (yspscores)
The problem I've been having is that the HTML for that yahoo page has the table data in this context: <td class="yspscores">
I do not quite understand how to reference it in my code. My goal is to print out the scores and name of the teams that the score corresponds to.

You grabbed the first table, but there is more than one table on that page. In fact, there are 46 tables.
You want to find the tables with the scores class:
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
for cell in row.find_all('td', class_='yspscores'):
print(cell.text)
Note that searching for a specific class is done with the class_ keyword argument.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with BeautifulSoup on Wikipedia - python

Related

How to scrape td corresponding to header text in Beautifulsoup

Convert table sourced from html webpage in to pandas dataframe

Extracting contents from the table beautiful soup

Beautiful Soup scrape table with table breaks

I do not quite understand how to parse the Yahoo NHL Page

Categories

Resources