I have an HTML text like this
<tr>
<td><strong>Turnover</strong></td>
<td width="20%" class="currency">£348,191</td>
<td width="20%" class="currency">£856,723</td>
<td width="20%" class="currency">£482,177</td>
</tr>
<tr>
<td> Cost of sales</td>
<td width="20%" class="currency">£275,708</td>
<td width="20%" class="currency">£671,345</td>
<td width="20%" class="currency">£357,587</td>
</tr>
<tr>
There's lots of html before and after it. I'd like to parse the numbers. There can be varying number of td columns, so I'd like to parse all of them. In this case, there are three columns, so the result I'm looking for is:
[348191, 856723, 482177]
Ideally, I'd like to parse the Turnover and Cost of Sales data separately into different variables
You can use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """ <tr>
... <td><strong>Turnover</strong></td>
... <td width="20%" class="currency">£348,191</td>
... <td width="20%" class="currency">£856,723</td>
... <td width="20%" class="currency">£482,177</td>
... </tr>
... <tr>
... <td> Cost of sales</td>
... <td width="20%" class="currency">£275,708</td>
... <td width="20%" class="currency">£671,345</td>
... <td width="20%" class="currency">£357,587</td>
... </tr>"""
>>> soup = BS(html)
>>> for i in soup.find_all('tr'):
... if i.find('td').text == "Turnover":
... for x in i.find_all('td', {'class':'currency'}):
... print x.text
...
£348,191
£856,723
£482,177
Explanation
First we convert the HTML to a bs4 type which we can easily navigate through. find_all, no prizes for guessing what it does, finds all the <tr>s.
We loop through each tr and if the first <td> is Turnover, we then go through the rest of the <td>s.
We loop through each td with class="currency" and print its content.
Related
I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]
I have a webpage with a table that only appears when I click 'Inspect Element' and is not visible through the View Source page. The table contains only two rows with several cells each and looks similar to this:
<table class="datadisplaytable">
<tbody>
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</tbody>
</table>
What I'm trying to do is to iterate through the rows and return the text contained in each cell. I can't really seem to do it with Selenium. The elements contain no IDs and I'm not sure how else to get them. I'm not very familiar with using xpaths and such.
Here is a debugging attempt that returns a TypeError:
def check_grades(self):
table = []
for i in self.driver.find_element_by_class_name("dddefault"):
table.append(i)
print(table)
What is an easy way to get the text from the rows?
XPath is fragile. It's better to use CSS selectors or classes:
mytable = find_element_by_css_selector('table.datadisplaytable')
for row in mytable.find_elements_by_css_selector('tr'):
for cell in row.find_elements_by_tag_name('td'):
print(cell.text)
If you want to go row by row using an xpath, you can use the following:
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault'][text()])
Which outputs:
['16759', 'MATH', '123', '001', 'Calculus']
['16449', 'PHY', '456', '002', 'Physics']
Using td[text()] will avoid getting any Nones returned for the td's that hold no text.
So to do the same using selenium you would:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][1]"])
For multiple tables:
def get_row_data(table):
for row in table.find_elements_by_xpath(".//tr"):
yield [td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][text()]"])
for table in driver.find_elements_by_xpath("//table[#class='datadisplaytable']"):
for data in get_row_data(table):
# use the data
Correction of the Selenium part of #Padraic Cunningham's answer:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault']")])
Note: there was one missing round bracket at the end; also removed the [1] index, to match the first XML example.
Another note: Though, the example with the index [1] should also be preserved, to show how to extract individual elements.
Another Version (modified and corrected post by Padraic Cunningham):
Tested with Python 3.x
#!/usr/bin/python
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault']")])
I'm trying to parse html like the following:
<tbody>
<tr class data-row="0">
<td align="right"></td>
</tr>
<tr class data-row="1">
<td align="right"></td>
</tr>
<tr class="thead over_theader" data-row="2">
<td align="right"></td>
</tr>
<tr class="thead" data-row="3">
<td align="right"></td>
</tr>
<tr class data-row="4">
<td align="right"></td>
</tr>
<tr class data-row="5">
<td align="right"></td>
</tr>
</tbody>
I want to obtain all tr tags (and their children) where class is not specified. For the example above, that means I want the tr tags where data-row is not 2 or 3.
How do I do this using Beautiful Soup 4?
I tried
tableBody = soup.findAll('tbody')
rows = tableBody[0].findAll(attrs={"class":""})
but this returned a type bs4.element.ResultSet of length 8 (i.e. it included the tr children with td tags) when I wanted a bs4.element.ResultSet of length 4 (one for each tr tag with class = "").
Your method actually works for me when I specify the tr tag name:
>>> from bs4 import BeautifulSoup
>>> data = """
... <tbody>
... <tr class data-row="0">
... <td align="right"></td>
... </tr>
... <tr class data-row="1">
... <td align="right"></td>
... </tr>
... <tr class="thead over_theader" data-row="2">
... <td align="right"></td>
... </tr>
... <tr class="thead" data-row="3">
... <td align="right"></td>
... </tr>
... <tr class data-row="4">
... <td align="right"></td>
... </tr>
... <tr class data-row="5">
... <td align="right"></td>
... </tr>
... </tbody>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> len(soup.find_all("tr", class_=""))
4
Alternatively, you can use a tr[class=""] CSS selector:
>>> len(soup.select('tr[class=""]'))
4
find_all will, by default, search recursively. So the td tags are valid matches.
Docs:
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False
So you might write, for example:
tableBody = soup.findAll('tbody')
rows = tableBody[0].find_all(attrs={"class":""}, recursive=False)
print(len(rows))
for r in rows:
print('---')
print(r)
Output:
4
---
<tr class="" data-row="0">
<td align="right"></td>
</tr>
---
<tr class="" data-row="1">
<td align="right"></td>
</tr>
---
<tr class="" data-row="4">
<td align="right"></td>
</tr>
---
<tr class="" data-row="5">
<td align="right"></td>
</tr>
I want to scrape some data prices out of a bunch of html tables. The tables contain all sorts of prices, and of course the table data tags don't contain anything useful.
<div id="item-price-data">
<table>
<tbody>
<tr>
<td class="some-class">Normal Price:</td>
<td class="another-class">$100.00</td>
</tr>
<tr>
<td class="some-class">Member Price:</td>
<td class="another-class">$90.00</td>
</tr>
<tr>
<td class="some-class">Sale Price:</td>
<td class="another-class">$80.00</td>
</tr>
<tr>
<td class="some-class">You save:</td>
<td class="another-class">$20.00</td>
</tr>
</tbody>
</table>
</div>
The only prices that I care about are those that are paired with an element that has "Normal Price" as it's text.
What I'd like to be able to do is scan the table's descendants, find the <td> tag that has that text, then pull the text from it's sibling.
The problem I'm having is that in BeautifulSoup the descendants attribute returns a list of NavigableString, not Tag.
So if I do this:
from bs4 import BeautifulSoup
from urllib import request
html = request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'id': 'item-price-data'})
table_data = div.find_all('td')
for element in table_data:
if element.get_text() == 'Normal Price:':
price = element.next_sibling
print(price)
I get nothing. Is there an easy way to get the string value back?
You can use the find_next() method also you may need a bit of regex:
Demo:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """<div id="item-price-data">
... <table>
... <tbody>
... <tr>
... <td class="some-class">Normal Price:</td>
... <td class="another-class">$100.00</td>
... </tr>
... <tr>
... <td class="some-class">Member Price:</td>
... <td class="another-class">$90.00</td>
... </tr>
... <tr>
... <td class="some-class">Sale Price:</td>
... <td class="another-class">$80.00</td>
... </tr>
... <tr>
... <td class="some-class">You save:</td>
... <td class="another-class">$20.00</td>
... </tr>
... </tbody>
... </table>
... </div>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', {'id': 'item-price-data'})
>>> for element in div.find_all('td', text=re.compile('Normal Price')):
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
If you don't want to bring regex into this then the following will work for you.
>>> table_data = div.find_all('td')
>>> for element in table_data:
... if 'Normal Price' in element.get_text():
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.
I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.
Desired output :
Website Last Visited Last Loaded
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
...... more if present
Following is a code sample from a messy web page :
<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">
<a href="http://google.com"</a>
</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
<a href="http://stackoverflow.com"</a>
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>
>>> from lxml import html
>>> table_html = """"
... <table border="2" width="100%">
... <tbody><tr>
... <td width="33%" class="BoldTD">Website</td>
... <td width="33%" class="BoldTD">Last Visited</td>
... <td width="34%" class="BoldTD">Last Loaded</td>
... </tr>
... <tr>
... <td width="33%">
... <a href="http://google.com"</a>
... </td>
... <td width="33%">01/14/2011
... </td>
... <td width="34%">
... </td>
... </tr>
... <tr>
... <td width="33%">
... <a href="http://stackoverflow.com"</a>
... </td>
... <td width="33%">01/10/2011
... </td>
... <td width="34%">
... </td>
... </tr>
... </tbody></table>"""
>>> table = html.fromstring(table_html)
>>> for row in table.xpath('//table[#border="2" and #width="100%"]/tbody/tr'):
... for column in row.xpath('./td[position()=1]/a/#href | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
... print column.strip(),
... print
...
Website Last Visited Last Loaded
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
>>>
voila;)
of course instead of printing you can add your values to nested lists or dicts;)
Here's a version that uses elementtree and the limited XPath it provides:
from xml.etree.ElementTree import ElementTree
doc = ElementTree().parse('table.html')
for t in doc.findall('.//table'):
# there may be multiple tables, check we have the right one
if t.find('./tbody/tr/td').text == 'Website':
for tr in t.findall('./tbody/tr/')[1:]: # skip the header row
tds = tr.findall('./td')
print tds[0][0].attrib['href'], tds[1].text.strip(), tds[2].text.strip()
Results:
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.line = ""
self.in_tr = False
self.in_table = False
def handle_starttag(self, tag, attrs):
if self.in_table and tag == "tr":
self.line = ""
self.in_tr = True
if tag=='a':
for attr in attrs:
if attr[0] == 'href':
self.line += attr[1] + " "
def handle_endtag(self, tag):
if tag == 'tr':
self.in_tr = False
if len(self.line):
print self.line
elif tag == "table":
self.in_table = False
def handle_data(self, data):
if data == "Website":
self.in_table = 1
elif self.in_tr:
data = data.strip()
if data:
self.line += data.strip() + " "
if __name__ == '__main__':
myp = MyParser()
myp.feed(open('table.html').read())
Hopefully this addresses everything you need and you can accept this as the answer.
Updated as requested.