Python BeautifulSoup to scrape tables from a webpage - python

I am trying to gather information from a website that has a database for ships.
I was trying to get the information with BeautifulSoup. But at the moment it does not seem to be working. I tried searching the web and tried different solutions, but did not manage to get the code working.
I was wondering to I have to change
table = soup.find_all("table", { "class" : "table1" }) --- line as there are 5 tables with class='table1', but my code only finds 1.
Do I have to create a loop for the tables? As I tried this I could not get it working. Also the next line table_body = table.find('tbody') it gives an error:
AttributeError: 'ResultSet' object has no attribute 'find'
This should be the conflict between BeautifulSoup's source code, that ResultSet subclasses list and my code. Do I have to iterate over that list?
from urllib import urlopen
shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)
from bs4 import BeautifulSoup
soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
print table
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for tr in rows:
cols = tr.find_all('td')
for td in cols:
print td
print

A couple of things:
As Kevin mentioned, you need to use a for loop to iterate through the list returned by find_all.
Not all of the tables have a tbody so you have to wrap the result of the find in a try block.
When you do a print you want to use the .text method so you print the text value and not the tag itself.
Here is the revised code:
shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)
soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
for mytable in table:
table_body = mytable.find('tbody')
try:
rows = table_body.find_all('tr')
for tr in rows:
cols = tr.find_all('td')
for td in cols:
print td.text
except:
print "no tbody"
Which produces the below output:
Register Number:
08910M
IMO Number:
9365398
Ship Name:
SUPERSTAR
Call Sign:
ESIY
.....

Related

Scraping wrong table

I'm trying to get the advanced stats of players onto an excel sheet but the table it's scraping is the first one instead of the advanced stats table.
ValueError: Length of passed values is 23, index implies 21
If i try to use the id instead, i get an another error about tbody.
Also, I get an error about
lname=name.split(" ")[1]
IndexError: list index out of range.
I think that has to do with 'Nene' in the list. Is there a way to fix that?
import requests
from bs4 import BeautifulSoup
playernames=['Carlos Delfino',
'Yao Ming',
'Andris Biedrins',
'Nene']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}).find_next('tbody')
print(table)
columns = ['Season', 'Team', 'League', 'GP', 'GS', 'TS%', 'eFG%', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'TOV%', 'STL%', 'BLK%', 'USG%', 'Total S%', 'PPR', 'PPS', 'ORtg', 'DRtg', 'PER']
df = pd.DataFrame(columns=columns)
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '') for td in tds]
df = df.append(pd.Series(row, index=columns), ignore_index=True)
df.to_csv('international players.csv', index=False)
Brazilians only use one name for soccer think Fred. If you want to use their moniker (Nene/Fred) then you need to implement exception handling for this, something like
try:
lname=name.split(" ")[1]
except IndexError:
lname=name
For your scraping issue, try using find_all as opposed to find, this will give you every data table on a given page and then you can pull the correct table out of the list
Change table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}, {'id':'table-3554'}) to find_all
FYI also, the table ID's change every time you refresh the page so you can't use ID as a search mechanism.

WebScraping for names - NoneType error on find_all

I am trying to extract names (notifier) using BeautifulSoup. But when I test run it it gives a NoneType error:
AttributeError: 'NoneType' object has no attribute 'find_all'
Code:
import requests
from bs4 import BeautifulSoup
page_counter = 1
while page_counter < 5:
print('page number: %d' %page_counter)
url = requests.get('http://zone-h.org/archive/page=%d'%page_counter,timeout=8).text()
soup = BeautifulSoup(url, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')
The issue is in these two lines:
table = soup.find('table')
rows = table.find_all('tr')
soup.find('table') will return None if it cant find any element with the tag table. That would cause table.find_all('tr') to return the error you got, because table was assigned to None in the previous line.
Does that make sense?

Trying to select rows in a table, always getting NavigableString error

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?
If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)
Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

I do not quite understand how to parse the Yahoo NHL Page

Here is my code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01")
content = url.read()
soup = BeautifulSoup(content)
print (soup.prettify)
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.findAll('yspscores')
for yspscores in td:
print (yspscores)
The problem I've been having is that the HTML for that yahoo page has the table data in this context: <td class="yspscores">
I do not quite understand how to reference it in my code. My goal is to print out the scores and name of the teams that the score corresponds to.
You grabbed the first table, but there is more than one table on that page. In fact, there are 46 tables.
You want to find the tables with the scores class:
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
for cell in row.find_all('td', class_='yspscores'):
print(cell.text)
Note that searching for a specific class is done with the class_ keyword argument.

How do I find an element with a certain class in a web page with BeautifulSoup?

I have tried to find a table with class "data" in a web page with this code.
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table.data")
print rows
However, I am getting none for rows even though I am sure that a table with class "data" exists on that page. What is the proper way to find an element with class "data" on a web page with BeautifulSoup?
If you want to pick up the rows, you'll need the following
import urllib2
from BeautifuSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
# if there's only one table with class = data
table = soup.find('table', attrs = {'class' : 'data'})
# if there are multiple tables with class = data
table = soup.findAll('table', attrs = {'class' : 'data'})[n]
# suppose you need the n-th table of the list returned
rows = table.findAll('tr') # gives all the rows, you can set attrs to filter
Then you can also iterate through the columns:
for row in rows:
cols = row.findAll('td')
...
You want something like
rows = soup.find_all('table', attrs = {"class": "data"})
instead of your current line (tested). The class of an element is an attribute, so you filter by attribute in find_all. This line returns a large table element from your sample page.

Categories

Resources