Extracting contents of two tables from web data - python

How can I retrieve all td information from this html data:
<h1>All staff</h1>
<h2>Manager</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Jon Staut
</td>
<td>Line Manager</td>
<td>0160 315 3832</td>
<td>
Jon.staut#strx.usc.com </td>
</tr>
</tbody>
</table>
<h2>Junior Staff</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Peter Boone
</td>
<td>Mailer</td>
<td>0160 315 3834</td>
<td>
Peter.Boone#strx.usc.com
</td>
</tr>
<tr>
<td>
John Peters
</td>
<td>Builder</td>
<td>0160 315 3837</td>
<td>
John.Peters#strx.usc.com
</td>
</tr>
</tbody>
</table>
Here's my code that generated an error:
response =requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.findAll('table', attrs={'class': 'StaffList'})
list_of_rows = []
for row in table.findAll('tr'): #2 rows found in table -loop through
list_of_cells = []
for cell in row.findAll('td'): # each cell in in a row
text = cell.text.replace('&nbsp','')
list_of_cells.append(text)
#print list_of_cells
list_of_rows.append(list_of_cells)
#print all cells in the two rows
print list_of_rows
Error message:
AttributeError: 'ResultSet' object has no attribute 'findAll'
What do I need to do to make the code output all the information in the two web tables?

The problem starts at this line:
table = soup.findAll('table', attrs={'class': 'StaffList'})
The findAll returns an array which has no attribute findAll.
Simply, change the findAll to find:
table = soup.find('table', attrs={'class': 'StaffList'})

Alternatively, you can use CSS selector expression to return tr elements from the StaffList table without having to extract the table first :
for row in soup.select('table.StaffList tr'): #2 rows found in table -loop through
......

Thanks for suggestions guys. Problem now solved after replacing 2 lines of code:
The first one:
table = soup.findAll('table', attrs={'class': 'StaffList'})
replaced with:
table = soup.findAll('tr')
The second one:
for row in table.findAll('tr'):
replaced with:
for row in table:

Related

Adding a new table to tbody using Beautiful Soup

I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?
To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>

Beautifulsoup iterate to get either <td>sometext</td> or url

I Want to create a list that contains a key-value pair. With the <thead> items as the key. For the values I want to get the text for all <th>items except the <th> items where there is a <a href='url'>, then I want to get the url instead.
Currently I am only able to get the text from all items. But how do I do to get '/someurl' instead of Makulerad and Detaljer?
<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>
My code:
raw_html = simple_get('https://example.com/')
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head,(td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
Try this:
simply get the text data of <td> if it doesn't have an <a>. Otherwise get the href value.
from bs4 import BeautifulSoup
raw_html = '''<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head, [td.get_text() if not td.a else td.a['href'] for td in row.find_all("td")]))
datasets.append(dataset)
print(datasets)
OUTPUT:
[{'Volymsenhet': 'Antal', 'Pris': '5,40', 'Valuta': 'SEK', 'Handelsplats': 'NASDAQ STOCKHOLM AB', 'url1': '/someurl', 'url2': '/someurl'}]

Select a table without tag by Beautifulsoup

Could BeautifulSoup select no tag table?
There's many tables in a HTML, but the data I want is in the table without any tags.
Here is my example:
There are 2 tables in HTML.
One is english, and the other is number.
from bs4 import BeautifulSoup
HTML2 = """
<table>
<tr>
<td class>a</td>
<td class>b</td>
<td class>c</td>
<td class>d</td>
</tr>
<tr>
<td class>e</td>
<td class>f</td>
<td class>g</td>
<td class>h</td>
</tr>
</table>
<table cellpadding="0">
<tr>
<td class>111</td>
<td class>222</td>
<td class>333</td>
<td class>444</td>
</tr>
<tr>
<td class>555</td>
<td class>666</td>
<td class>777</td>
<td class>888</td>
</tr>
"""
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.select('table[cellpadding!="0"]') #<---I think the key point is here.
for div in f2:
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)
I only want the data in the "english" table
And make the format like following:
a b c d
e f g h
Then save to excel.
But I can only access that "number" table.
Is there a hint?
Thanks!
You could use find_all and select only tables that don't have a specific attribute.
f2 = soup2.find_all('table', {'cellpadding':None})
Or if you want to select tables that have absolutely no attributes:
f2 = [tbl for tbl in soup2.find_all('table') if not tbl.attrs]
Then you can make a list of columns from f2 and pass it to the dataframe .
data = [
[td.text for td in tr.find_all('td')]
for table in f2 for tr in table.find_all('tr')
]
You can use has_attr method to test whether table contains the cellpadding attribute:
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.find_all('table')
for div in f2:
if not div.has_attr('cellpadding'):
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Getting the nth element using BeautifulSoup

From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?
You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:
rows = soup.findAll('tr')[4::5]
This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)
row = 5
while true
element = soup.select('tr:nth-of-type('+ row +')')
if len(element) > 0:
# element is your desired row element, do what you want with it
row += 5
else:
break
As a general solution, you can convert the table to a nested list and iterate...
import BeautifulSoup
def listify(table):
"""Convert an html table to a nested list"""
result = []
rows = table.findAll('tr')
for row in rows:
result.append([])
cols = row.findAll('td')
for col in cols:
strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
text = ''.join(strings)
result[-1].append(text)
return result
if __name__=="__main__":
"""Build a small table with one column and ten rows, then parse into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
soup = BeautifulSoup.BeautifulSoup(htstring)
for idx, ii in enumerate(listify(soup)):
if ((idx+1)%5>0):
continue
print ii
Running that...
[mpenning#Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning#Bucksnort ~]$
Another option, if you prefer raw html...
"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \
if (idx+1)%5==0]
print result
Running that...
[mpenning#Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning#Bucksnort ~]$
Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:
from gazpacho import Soup
url = "https://en.wikipedia.org/wiki/List_of_probability_distributions"
soup = Soup.get(url)
a_tags = soup.find("a", {"href": "distribution"})
links = ["https://en.wikipedia.org" + a.attrs["href"] for a in a_tags]
links[4::5] # start at 0,1,2,3,**4** and stride by 5

Categories

Resources