I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?
locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>
Related
I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.
To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED
I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care
I'm currently learning how to use selenium in python, I have a table, and I want to retrieve the element but currently facing some trouble.
<table class="table" id="SearchTable">
<thead>..</thead>
<tfoot>..</tfoot>
<tbody>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here</b>
</a>
</td>
</tr>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here as well</b>
</a>
</td>
</tr>
</table>
Inside this table, I want to access the text in the bold tag but my program isn't returning the correct number of tr, in fact I'm not even sure if its searching the correct stuff.
I have backtracked my problem from the end text and found that the errors started appearing from the line with comment. (I think the code afterwards is wrong as well but I'm focusing on getting the correct table row first)
My code is:
search_table = driver.find_element_by_id("SearchTable")
search_table_body = search_table.find_element(By.TAG_NAME, "tbody")
trs = search_table_body.find_elements(By.TAG_NAME, "tr")
print(trs) # this does not return correct number of tr)
for tr in trs:
tds = tr.find_elements(By.TAG_NAME, "td")
for td in tds:
href = td.find_element_by_class_name("qtooltip")
print(href.get_attribute("innerHtml"))
I'm supposed to get the correct number of tr count so I can return the text in the anchor tag but I am stuck. Any help is appreciated. Thanks!
You can get all <b> tags which are children of <a> tag having class attribute of qtooltip and living inside a table cell using a single XPath selector
//table/descendant::a[#class='qtooltip']/b
Example code:
elements = driver.find_elements_by_xpath("//table/descendant::a[#class='qtooltip']/b")
for element in elements:
print(element.text)
Demo:
References:
XPath Tutorial
XPath Axes
XPath Operators & Functions
I'm running into an issue when trying to get the parent node of a tr element whilst iterating through them all.
Here's a basic table that I'm working with.
<table border=1>
<tbody>
<tr>
<td>
<p>Some text</p>
</td>
<td>
<p>Some more text</p>
</td>
</tr>
<tr>
<td>
<p> Some more text</p>
</td>
<td>
<p> Some more text</p>
</td>
</tr>
<tr>
<td>
<p> Some more text</p>
</td>
<td>
<p> Some more text</p>
</td>
</tr>
</tbody>
</table>
And here's my Python script to get the parent node using lxml
import lxml.html
htm = lxml.html.parse('plaintable.htm')
tr = htm.xpath('//tr')
for x in tr:
tbody = tr.getparent()
if tbody.index(tr) == 1:
print ('Success!')
print ('Finished')
I'm getting this error when I run the script:
AttributeError: 'list' object has no attribute 'getparent'
I'm quite new to Python so it could be something simple I'm messing up. I read through the lxml documents and I couldn't find an answer.
Any help would be great!
tr is actually a list of xpath matches. x corresponds to individual tr elements - call getparent() method on it instead:
tr = htm.xpath('//tr')
for x in tr:
tbody = x.getparent()
# ...
Though, I don't see much sense in getting the same parent over and over again in a loop, in case you have a single table and tbody element. Why don't locate it beforehand:
tbody = htm.xpath("//tbody")[0]
for x in tbody.xpath(".//tr"):
# ...
I need to find the first tr in every table to build it properly
As for this - I would iterate over all table elements and find the first tr element:
tables = htm.xpath("//table")
for table in tables:
first_tr = table.xpath(".//tr")[0]
Similar to .renderContents here, I want to search by that value: Beautiful Soup [Python] and the extracting of text in a table
Sample HTML:
<table>
<tr>
<td>
This is garbage
</td>
<td>
<td class="thead" style="font-weight:normal">
<!-- status icon and date -->
<a name="post1"><img class="inlineimg" src="img.gif" alt="Old" border="0" title="Old"></a>
19-11-2010, 04:25 PM
<!-- / status icon and date -->
</td>
<td>
This is garbage
</td>
</tr>
</table>
What I tried:
soup.find_all("td", text = re.compile('(AM|PM)'))[0].get_text().strip()
However, the text parameter of find_all seems to not work for this application: IndexError: list index out of range
What do I need to do?
Don't specify the tag name at all and let it find the desired text node. Works for me:
soup.find(text=re.compile('(AM|PM)')).strip()