Find Text of Sibling Element, Where Original Element Matches Specific String - python

I want to scrape some data prices out of a bunch of html tables. The tables contain all sorts of prices, and of course the table data tags don't contain anything useful.
<div id="item-price-data">
<table>
<tbody>
<tr>
<td class="some-class">Normal Price:</td>
<td class="another-class">$100.00</td>
</tr>
<tr>
<td class="some-class">Member Price:</td>
<td class="another-class">$90.00</td>
</tr>
<tr>
<td class="some-class">Sale Price:</td>
<td class="another-class">$80.00</td>
</tr>
<tr>
<td class="some-class">You save:</td>
<td class="another-class">$20.00</td>
</tr>
</tbody>
</table>
</div>
The only prices that I care about are those that are paired with an element that has "Normal Price" as it's text.
What I'd like to be able to do is scan the table's descendants, find the <td> tag that has that text, then pull the text from it's sibling.
The problem I'm having is that in BeautifulSoup the descendants attribute returns a list of NavigableString, not Tag.
So if I do this:
from bs4 import BeautifulSoup
from urllib import request
html = request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'id': 'item-price-data'})
table_data = div.find_all('td')
for element in table_data:
if element.get_text() == 'Normal Price:':
price = element.next_sibling
print(price)
I get nothing. Is there an easy way to get the string value back?

You can use the find_next() method also you may need a bit of regex:
Demo:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """<div id="item-price-data">
... <table>
... <tbody>
... <tr>
... <td class="some-class">Normal Price:</td>
... <td class="another-class">$100.00</td>
... </tr>
... <tr>
... <td class="some-class">Member Price:</td>
... <td class="another-class">$90.00</td>
... </tr>
... <tr>
... <td class="some-class">Sale Price:</td>
... <td class="another-class">$80.00</td>
... </tr>
... <tr>
... <td class="some-class">You save:</td>
... <td class="another-class">$20.00</td>
... </tr>
... </tbody>
... </table>
... </div>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', {'id': 'item-price-data'})
>>> for element in div.find_all('td', text=re.compile('Normal Price')):
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
If you don't want to bring regex into this then the following will work for you.
>>> table_data = div.find_all('td')
>>> for element in table_data:
... if 'Normal Price' in element.get_text():
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>

Related

How to extract <td> elements with no attributes?

I am a newbie in web scraping and I have been stuck with some issue. I tried and searched but nothing at all. I want to extract data from the table. The problem is that tr and td elements don't have attributes, just class. <td class="review-value"> is the same for different values. I don't know how to separate them. I need lists with every single value for example:
list_1 = [aircraft1, aircraft2, aircraft3, ..]
list_2 = [type_of_traveller1, type_of_traveller2, ...]
list_3 = [cabin_flown1, cabin_flown2,...]
list_4 = [route1, route2,...]
list_5 = [date_flown1, date_flown2, ..]
This is the table code:
<table class="review-ratings">
<tr>
<td class="review-rating-header aircraft">Aircraft</td>
<td class="review-value">Boeing 787-9</td>
</tr>
<tr>
<td class="review-rating-header type_of_traveller">Type Of Traveller</td>
<td class="review-value">Couple Leisure</td>
</tr>
<tr>
<td class="review-rating-header cabin_flown">Seat Type</td>
<td class="review-value">Business Class</td>
</tr>
<tr>
<td class="review-rating-header route">Route</td>
<td class="review-value">Mexico City to London</td>
</tr>
<tr>
<td class="review-rating-header date_flown">Date Flown</td>
<td class="review-value">February 2023</td>
</tr>
</table>
I am using BeautifulSoup:
page = requests.get(url)
table = soup.find('article')
review_table = table.find_all('table', class_ = 'review- ratings')
find_tr_header = table.find_all('td', class_ = 'review-rating-header')
headers = []
for i in find_tr_header:
headers.append(i.text.strip())
And I don't know what to do with class="review-value".
As I can see in your table each field has a cell .review-value that is following it (direct sibling).
So what you can do is use the selector + in CSS.
For instance .aircraft + .review-value will give you the value of the aircraft.
In Beautiful Soup you can even avoid this type of selector since there are built-in methods available for you. Check next-sibling

How to use Beautiful Soup 4 to find attribute

I'm trying to parse html like the following:
<tbody>
<tr class data-row="0">
<td align="right"></td>
</tr>
<tr class data-row="1">
<td align="right"></td>
</tr>
<tr class="thead over_theader" data-row="2">
<td align="right"></td>
</tr>
<tr class="thead" data-row="3">
<td align="right"></td>
</tr>
<tr class data-row="4">
<td align="right"></td>
</tr>
<tr class data-row="5">
<td align="right"></td>
</tr>
</tbody>
I want to obtain all tr tags (and their children) where class is not specified. For the example above, that means I want the tr tags where data-row is not 2 or 3.
How do I do this using Beautiful Soup 4?
I tried
tableBody = soup.findAll('tbody')
rows = tableBody[0].findAll(attrs={"class":""})
but this returned a type bs4.element.ResultSet of length 8 (i.e. it included the tr children with td tags) when I wanted a bs4.element.ResultSet of length 4 (one for each tr tag with class = "").
Your method actually works for me when I specify the tr tag name:
>>> from bs4 import BeautifulSoup
>>> data = """
... <tbody>
... <tr class data-row="0">
... <td align="right"></td>
... </tr>
... <tr class data-row="1">
... <td align="right"></td>
... </tr>
... <tr class="thead over_theader" data-row="2">
... <td align="right"></td>
... </tr>
... <tr class="thead" data-row="3">
... <td align="right"></td>
... </tr>
... <tr class data-row="4">
... <td align="right"></td>
... </tr>
... <tr class data-row="5">
... <td align="right"></td>
... </tr>
... </tbody>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> len(soup.find_all("tr", class_=""))
4
Alternatively, you can use a tr[class=""] CSS selector:
>>> len(soup.select('tr[class=""]'))
4
find_all will, by default, search recursively. So the td tags are valid matches.
Docs:
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False
So you might write, for example:
tableBody = soup.findAll('tbody')
rows = tableBody[0].find_all(attrs={"class":""}, recursive=False)
print(len(rows))
for r in rows:
print('---')
print(r)
Output:
4
---
<tr class="" data-row="0">
<td align="right"></td>
</tr>
---
<tr class="" data-row="1">
<td align="right"></td>
</tr>
---
<tr class="" data-row="4">
<td align="right"></td>
</tr>
---
<tr class="" data-row="5">
<td align="right"></td>
</tr>

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

match numbers in multiple lines

I have an HTML text like this
<tr>
<td><strong>Turnover</strong></td>
<td width="20%" class="currency">£348,191</td>
<td width="20%" class="currency">£856,723</td>
<td width="20%" class="currency">£482,177</td>
</tr>
<tr>
<td> Cost of sales</td>
<td width="20%" class="currency">£275,708</td>
<td width="20%" class="currency">£671,345</td>
<td width="20%" class="currency">£357,587</td>
</tr>
<tr>
There's lots of html before and after it. I'd like to parse the numbers. There can be varying number of td columns, so I'd like to parse all of them. In this case, there are three columns, so the result I'm looking for is:
[348191, 856723, 482177]
Ideally, I'd like to parse the Turnover and Cost of Sales data separately into different variables
You can use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """ <tr>
... <td><strong>Turnover</strong></td>
... <td width="20%" class="currency">£348,191</td>
... <td width="20%" class="currency">£856,723</td>
... <td width="20%" class="currency">£482,177</td>
... </tr>
... <tr>
... <td> Cost of sales</td>
... <td width="20%" class="currency">£275,708</td>
... <td width="20%" class="currency">£671,345</td>
... <td width="20%" class="currency">£357,587</td>
... </tr>"""
>>> soup = BS(html)
>>> for i in soup.find_all('tr'):
... if i.find('td').text == "Turnover":
... for x in i.find_all('td', {'class':'currency'}):
... print x.text
...
£348,191
£856,723
£482,177
Explanation
First we convert the HTML to a bs4 type which we can easily navigate through. find_all, no prizes for guessing what it does, finds all the <tr>s.
We loop through each tr and if the first <td> is Turnover, we then go through the rest of the <td>s.
We loop through each td with class="currency" and print its content.

Please help parse this html table using BeautifulSoup and lxml the pythonic way

I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage.
I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine.
Desired output :
Website Last Visited Last Loaded
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
...... more if present
Following is a code sample from a messy web page :
<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">
<a href="http://google.com"</a>
</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
<a href="http://stackoverflow.com"</a>
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>
>>> from lxml import html
>>> table_html = """"
... <table border="2" width="100%">
... <tbody><tr>
... <td width="33%" class="BoldTD">Website</td>
... <td width="33%" class="BoldTD">Last Visited</td>
... <td width="34%" class="BoldTD">Last Loaded</td>
... </tr>
... <tr>
... <td width="33%">
... <a href="http://google.com"</a>
... </td>
... <td width="33%">01/14/2011
... </td>
... <td width="34%">
... </td>
... </tr>
... <tr>
... <td width="33%">
... <a href="http://stackoverflow.com"</a>
... </td>
... <td width="33%">01/10/2011
... </td>
... <td width="34%">
... </td>
... </tr>
... </tbody></table>"""
>>> table = html.fromstring(table_html)
>>> for row in table.xpath('//table[#border="2" and #width="100%"]/tbody/tr'):
... for column in row.xpath('./td[position()=1]/a/#href | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
... print column.strip(),
... print
...
Website Last Visited Last Loaded
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
>>>
voila;)
of course instead of printing you can add your values to nested lists or dicts;)
Here's a version that uses elementtree and the limited XPath it provides:
from xml.etree.ElementTree import ElementTree
doc = ElementTree().parse('table.html')
for t in doc.findall('.//table'):
# there may be multiple tables, check we have the right one
if t.find('./tbody/tr/td').text == 'Website':
for tr in t.findall('./tbody/tr/')[1:]: # skip the header row
tds = tr.findall('./td')
print tds[0][0].attrib['href'], tds[1].text.strip(), tds[2].text.strip()
Results:
http://google.com 01/14/2011
http://stackoverflow.com 01/10/2011
Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.line = ""
self.in_tr = False
self.in_table = False
def handle_starttag(self, tag, attrs):
if self.in_table and tag == "tr":
self.line = ""
self.in_tr = True
if tag=='a':
for attr in attrs:
if attr[0] == 'href':
self.line += attr[1] + " "
def handle_endtag(self, tag):
if tag == 'tr':
self.in_tr = False
if len(self.line):
print self.line
elif tag == "table":
self.in_table = False
def handle_data(self, data):
if data == "Website":
self.in_table = 1
elif self.in_tr:
data = data.strip()
if data:
self.line += data.strip() + " "
if __name__ == '__main__':
myp = MyParser()
myp.feed(open('table.html').read())
Hopefully this addresses everything you need and you can accept this as the answer.
Updated as requested.

Categories

Resources