Use beautiful soup to find elements by textual contents, not text?

Use beautiful soup to find elements by textual contents, not text? - python

Similar to .renderContents here, I want to search by that value: Beautiful Soup [Python] and the extracting of text in a table
Sample HTML:
<table>
<tr>
<td>
This is garbage
</td>
<td>
<td class="thead" style="font-weight:normal">
<!-- status icon and date -->
<a name="post1"><img class="inlineimg" src="img.gif" alt="Old" border="0" title="Old"></a>
19-11-2010, 04:25 PM
<!-- / status icon and date -->
</td>
<td>
This is garbage
</td>
</tr>
</table>
What I tried:
soup.find_all("td", text = re.compile('(AM|PM)'))[0].get_text().strip()
However, the text parameter of find_all seems to not work for this application: IndexError: list index out of range
What do I need to do?

Don't specify the tag name at all and let it find the desired text node. Works for me:
soup.find(text=re.compile('(AM|PM)')).strip()

Related

insert element above specific table row beautifulsoup python

I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?

locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.

To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

Python xPath looping for TR table rows but object empty

Hi we are running this code and it is driving my crazy
we capture a data table in table this works
then grab all th and it's text in sizes this works
then we want to grab all underlying rows in TR; and after loop over columns in rows : does not work! the color_rows object is always empty .. but when testing with xpath in the browser it does! work ... why? how?
My question is: how can I grab the tbody/tr's?
Expected flow
loop over TR's
Access, TR 1 by 1, get 1st TD
Get all TD's data that have class form-control
table = response.xpath('//div[#class="content"]//table[contains(#class,"table")]')
sizes = table.xpath('./thead//th/text()').getall()[1:] #works!
color_rows = table.xpath('./tbody/tr') #does not work! object empty
for color_row in color_rows:
color = color_row.xpath('/td[1]/b/text()').get().strip()
print(color)
stocks = color_row.xpath('/td/div[input[#class="form-control"]]/div//text()').getall()
for size, stock in zip(sizes, stocks)
Our html data looks like this
<table class="table">
<thead>
<tr>
<th id="ctl00_cphCEShop_colColore" class="text-left" colspan="2">Colore</th>
<th>S</th>
<th>M</th>
<th>L</th>
</tr>
</thead>
<tbody>
<tr>
<td id="x">
<b>White</b>
<input type="hidden" name="data" value="3230/201">
</td>
<td id="avail">
Avail:
</td>
<td id="1">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">179</div>
</div>
</td>
<td id="2">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">360</div>
</div>
</td>
etc etc

Apparently tbody tags are often omitted in HTML but aded by the browser.
In this case there was no (real) body tag making the xpath object miss!
And hence the troubles with xpath (if you really think the tbody tag is there)
Why do browsers insert tbody element into table elements?

How to parse HTML file in .TXT format (un-tabbed) in Python?

I have encountered a problem in my programming that has me stumped.
I'm trying to access data stored in a wealth of old HTML-formatted-saved-as-text files. However, when saving the HTML code lost its indentations, tabs, hierarchy, whatever you wish to call it. An example of this can be found below.
......
<tr class="ro">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_RevenueFromContractWithCustomerExcludingAssessedTax', window );">Net sales</a></td>
<td class="nump">$ 123,897<span></span>
</td>
<td class="nump">$ 122,136<span></span>
</td>
<td class="nump">$ 372,586<span></span>
</td>
<td class="nump">$ 360,611<span></span>
</td>
</tr>
<tr class="re">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherIncome', window );">Membership and other income</a></td>
<td class="nump">997<span></span>
</td>
<td class="nump">1,043<span></span>
</td>
<td class="nump">3,026<span></span>
</td>
<td class="nump">3,465<span></span>
</td>
</tr>
<tr class="rou">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Revenues', window );">Total revenues</a></td>
<td class="nump">124,894<span></span>
</td>
<td class="nump">123,179<span></span>
</td>
<td class="nump">375,612<span></span>
</td>
<td class="nump">364,076<span></span>
</td>
</tr>
I typically would employ Beautiful Soup here and get to work parsing the data that way, but I've not found a good workflow since technically there is no hierarchy here; I can't tell BS to look within something other than the document itself-which is huge and might be way too time consuming (see next statement).
I also need to find a thorough solution and not a quick-fix because I have hundreds, if not thousands, of these same HTML-to-text files to parse.
So my question here is, if I want to return, in all the files, the first number for "Membership and other Income" (997 in this case), how could I go about doing that?
Two samples files can be found here:
(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt) (https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)
EDIT - 4/16
Thanks for the replies everyone! I've written some code that returns the tags I'm looking for.
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
# load the data
soup = BeautifulSoup(data.text, 'html.parser')
# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
db = [td.text.strip() for td in tr.find_all('td')]
print(db)
The problem is there are a TON of returns and most contain nothing of use. Is there a way to filter based on these tags' grandparent? I've tried the same approach as above using head, title, body, etc. but I can't quite get BS to identify the FILENAME..
<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm**
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text"> <span></span>
</td>
.....removed for brevity
</tr>

Just so you are aware, HTML does not care about indentation. If you really wanted to, it could all be on the same line with no spaces in between. A HTML parser will just look at the structure of the tags.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all['<tag you are looking for>'][0]

xPath: Difficulties matching expression with actual source code

From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?

I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use beautiful soup to find elements by textual contents, not text? - python

Don't specify the tag name at all and let it find the desired text node. Works for me: soup.find(text=re.compile('(AM|PM)')).strip()

Related

insert element above specific table row beautifulsoup python

BeautifulSoup how to only return class objects

Python xPath looping for TR table rows but object empty

How to parse HTML file in .TXT format (un-tabbed) in Python?

xPath: Difficulties matching expression with actual source code

Categories

Resources