Trying to select rows in a table, always getting NavigableString error - python

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?

If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)

Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

Related

Web scrape and pull an attribute value instead of the text value out of td for the entire table beautiful soup

I am trying to scrape some data from a table, but they have the content that I actually would like in an attribute.
Example xml:
'''
<tr data-row="0">
<th scope ="row" class="left" data_append-csv="AlleRi00" data-stat="player" csk="Allen, Ricardo">
Ricardo Allen
</th>
<td class="center poptip out dnp" data-stat="week_4" data-tip"Out: Concussion" csk= "4">
<strong>O</strong>
</td>
'''
When scraping the table I use the following code:
'''
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr.text for tr in td]
final_data.append(row)
df = pd.DataFrame(final_data[1:],final_data[0])
'''
With my current code, I get a good looking dataframe with headers and all the info that is visible when looking at the table. However, I would like to get "Out: Concussion" instead of "O" within the table. I've been trying numerous ways and cannot figure it out. Please let me know if this is possible with the current process or if I am approaching it all wrong.
This should help you:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr['data-tip'] if tr.has_attr("data-tip") else tr.text for tr in td]
final_data.append(row)
m = final_data[1:]
final_dataa = [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]
df = pd.DataFrame(final_dataa,final_data[0]).T
df.to_csv("D:\\injuries.csv", index = False)
Screenshot of csv file (I've done some formatting so that it looks neat):

How to scrape a table from any site and store it to data frame?

I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)
I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

BeautifulSoup: Can't Access Info Within TD

I'm looking at the following website:
https://modules.ussquash.com/ssm/pages/leagues/League_Information.asp?leagueid=1859
I want to extract the name of each university and the href associated with it. So for the first entry, I'd like to get Stanford and https://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=18564
I've gotten to the point where I have all of the TDs, using BeautifulSoup. I'm just having difficulty extracting the school and its href.
Here's my attempt:
def main():
r = requests.get('https://modules.ussquash.com/ssm/pages/leagues/League_Information.asp?leagueid=1859')
data = r.text
soup = BeautifulSoup(data)
table = soup.find_all('table')[1]
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
print(cols)
When I try to access cols[0], I get:
IndexError: list index out of range
Any idea how to fix this would be awesome!
Thanks
The first two tr's are in the thead which have no td tags, you want to skip the first two tr's:
rows = table.find_all('tr')[2:]
To get what you want, we can simplify using css selectors:
table = soup.find_all('table', limit=2)[1]
# skip first two tr's
rows = table.select("tr + tr + tr")
for row in rows:
# anchor we want is inside the first td
a = row.select_one("td a") # or a = row.find("td").a
print(a.text,a["href"])
Also the href is a relative path so you need to join it to a base url:
import requests
from bs4 import BeautifulSoup
from urllib.urlparse import urljoin
def main():
base = "https://modules.ussquash.com/ssm/pages/leagues/"
r = requests.get('https://modules.ussquash.com/ssm/pages/leagues/League_Information.asp?leagueid=1859')
data = r.text
soup = BeautifulSoup(data)
table = soup.find_all('table', limit=2)[1]
# skip first two tr's
rows = table.select("tr + tr + tr")
for row in rows:
a = row.select_one("td a")
print(a.text, urljoin(base, a["href"]))

Python BeautifulSoup to scrape tables from a webpage

I am trying to gather information from a website that has a database for ships.
I was trying to get the information with BeautifulSoup. But at the moment it does not seem to be working. I tried searching the web and tried different solutions, but did not manage to get the code working.
I was wondering to I have to change
table = soup.find_all("table", { "class" : "table1" }) --- line as there are 5 tables with class='table1', but my code only finds 1.
Do I have to create a loop for the tables? As I tried this I could not get it working. Also the next line table_body = table.find('tbody') it gives an error:
AttributeError: 'ResultSet' object has no attribute 'find'
This should be the conflict between BeautifulSoup's source code, that ResultSet subclasses list and my code. Do I have to iterate over that list?
from urllib import urlopen
shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)
from bs4 import BeautifulSoup
soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
print table
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for tr in rows:
cols = tr.find_all('td')
for td in cols:
print td
print
A couple of things:
As Kevin mentioned, you need to use a for loop to iterate through the list returned by find_all.
Not all of the tables have a tbody so you have to wrap the result of the find in a try block.
When you do a print you want to use the .text method so you print the text value and not the tag itself.
Here is the revised code:
shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)
soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
for mytable in table:
table_body = mytable.find('tbody')
try:
rows = table_body.find_all('tr')
for tr in rows:
cols = tr.find_all('td')
for td in cols:
print td.text
except:
print "no tbody"
Which produces the below output:
Register Number:
08910M
IMO Number:
9365398
Ship Name:
SUPERSTAR
Call Sign:
ESIY
.....

I do not quite understand how to parse the Yahoo NHL Page

Here is my code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01")
content = url.read()
soup = BeautifulSoup(content)
print (soup.prettify)
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.findAll('yspscores')
for yspscores in td:
print (yspscores)
The problem I've been having is that the HTML for that yahoo page has the table data in this context: <td class="yspscores">
I do not quite understand how to reference it in my code. My goal is to print out the scores and name of the teams that the score corresponds to.
You grabbed the first table, but there is more than one table on that page. In fact, there are 46 tables.
You want to find the tables with the scores class:
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
for cell in row.find_all('td', class_='yspscores'):
print(cell.text)
Note that searching for a specific class is done with the class_ keyword argument.

Categories

Resources