Python lxml iterating through tr elements

Python lxml iterating through tr elements - python

I'm running into an issue when trying to get the parent node of a tr element whilst iterating through them all.
Here's a basic table that I'm working with.
<table border=1>
<tbody>
<tr>
<td>
<p>Some text</p>
</td>
<td>
<p>Some more text</p>
</td>
</tr>
<tr>
<td>
<p> Some more text</p>
</td>
<td>
<p> Some more text</p>
</td>
</tr>
<tr>
<td>
<p> Some more text</p>
</td>
<td>
<p> Some more text</p>
</td>
</tr>
</tbody>
</table>
And here's my Python script to get the parent node using lxml
import lxml.html
htm = lxml.html.parse('plaintable.htm')
tr = htm.xpath('//tr')
for x in tr:
tbody = tr.getparent()
if tbody.index(tr) == 1:
print ('Success!')
print ('Finished')
I'm getting this error when I run the script:
AttributeError: 'list' object has no attribute 'getparent'
I'm quite new to Python so it could be something simple I'm messing up. I read through the lxml documents and I couldn't find an answer.
Any help would be great!

tr is actually a list of xpath matches. x corresponds to individual tr elements - call getparent() method on it instead:
tr = htm.xpath('//tr')
for x in tr:
tbody = x.getparent()
# ...
Though, I don't see much sense in getting the same parent over and over again in a loop, in case you have a single table and tbody element. Why don't locate it beforehand:
tbody = htm.xpath("//tbody")[0]
for x in tbody.xpath(".//tr"):
# ...
I need to find the first tr in every table to build it properly
As for this - I would iterate over all table elements and find the first tr element:
tables = htm.xpath("//table")
for table in tables:
first_tr = table.xpath(".//tr")[0]

Related

Scraping all tbody tags in table using selenium in python

I tried to get all tbody tags in table and click on each.
When I do iteration by using command below, I find them all by storing in the list, then I iterate over them and click on each. When I try to click I get error: stale element reference: element is not attached to the page document
WebDriverWait(self._browser, 15).until(
EC.visibility_of_all_elements_located((By.CLASS_NAME, 'exactMatch')))
I think that while I store all found tbody tags in list, the page might be updated and links in the list are not relevant. So I tried to click on each tbody tag once I find it using index. But my code doesnt work:
numberOfRows = len(WebDriverWait(self._browser, 15).until(
EC.visibility_of_all_elements_located((By.CLASS_NAME, 'exactMatch'))))
for iRow in range(numberOfRows):
currentRow = WebDriverWait(self._browser, 5).until(EC.presence_of_element_located(
(By.XPATH, "(//table[#id='searchResultsTable']/tbody[" + str(iRow + 1) + "]")))
currentRow.click()
Example of HTML:
<table class="searchResultsTable" ng-class="{searchIsStopped: ctrl.isSearchStopped()}" cellspacing="0">
<tbody class="resultItem exactMatch">
<tr id="LS4KvUFe" class="qa-match-row resultSummary unread">
<td class="new"> </td> </tr>
</tbody>
<tbody class="resultItem exactMatch">
<tr id="LS4KvUFd" class="qa-match-row resultSummary unread">
<td class="new"> </td> </tr>
</tbody>
</table>

insert element above specific table row beautifulsoup python

I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?

locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>

Python xPath looping for TR table rows but object empty

Hi we are running this code and it is driving my crazy
we capture a data table in table this works
then grab all th and it's text in sizes this works
then we want to grab all underlying rows in TR; and after loop over columns in rows : does not work! the color_rows object is always empty .. but when testing with xpath in the browser it does! work ... why? how?
My question is: how can I grab the tbody/tr's?
Expected flow
loop over TR's
Access, TR 1 by 1, get 1st TD
Get all TD's data that have class form-control
table = response.xpath('//div[#class="content"]//table[contains(#class,"table")]')
sizes = table.xpath('./thead//th/text()').getall()[1:] #works!
color_rows = table.xpath('./tbody/tr') #does not work! object empty
for color_row in color_rows:
color = color_row.xpath('/td[1]/b/text()').get().strip()
print(color)
stocks = color_row.xpath('/td/div[input[#class="form-control"]]/div//text()').getall()
for size, stock in zip(sizes, stocks)
Our html data looks like this
<table class="table">
<thead>
<tr>
<th id="ctl00_cphCEShop_colColore" class="text-left" colspan="2">Colore</th>
<th>S</th>
<th>M</th>
<th>L</th>
</tr>
</thead>
<tbody>
<tr>
<td id="x">
<b>White</b>
<input type="hidden" name="data" value="3230/201">
</td>
<td id="avail">
Avail:
</td>
<td id="1">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">179</div>
</div>
</td>
<td id="2">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">360</div>
</div>
</td>
etc etc

Apparently tbody tags are often omitted in HTML but aded by the browser.
In this case there was no (real) body tag making the xpath object miss!
And hence the troubles with xpath (if you really think the tbody tag is there)
Why do browsers insert tbody element into table elements?

How to wrap all contents of a tag?

I have the following XML portion:
<table>
<tr>
<td>Hello</td>
<td>Hello</td>
<td>
<p>Hello already in P</p>
</td>
<td>
This one has some naked text
<span>and some span wrapped text</span>
</td>
</tr>
</table>
I would like to wrap (in a p tag) the contents of each cell that is not already wrapped in a p tag. So that the output is:
<table>
<tr>
<td><p>Hello</p></td>
<td><p>Hello</p></td>
<td>
<p>Hello already in p tag</p>
</td>
<td>
<p>
This one has some text
<span>and some span wrapped text</span>
</p>
</td>
</tr>
</table>
I'm using lxml etree in my project but the library doesn't seem to have a "wrap" method or something similar.
Now I'm thinking maybe this is a job for XSLT transformations but I'd like to avoid adding another layer of complexity + other dependencies in my Python project.
The content of td's can be of any depth

I don't use the lxml package myself but try the following:
def wrap(root):
# find <td> elements that do not have a <p> element
cells = etree.XPath("//td[not(p)]")(root)
for cell in cells:
# Create new <p> element
e = Element("p")
# Set the <p> element text from the parent
e.text = cell.text
# Clear the parent text because it is now in the <p> element
cell.text = None
# Move the parents children and make them the <p> element's children
# (because the span on line 10 of the input file should be nested)
for child in cell.getchildren():
# This actually moves the child from the <td> element to the <p> element
e.append(child)
# Set the new <p> element as the cell's child
cell.append(e)

How to find nodes using Beautiful Soup

I have a html code:
<table>
<tr>
<td><table><tr><td>1</td></tr><tr><td>2</td></tr></table></td>
</tr>
<tr>
<td><table><tr><td>3</td></tr><tr><td>4</td></tr></table></td>
</tr>
</table>
I want to find all tr in first table.
I usually using
for tr in soup.findAll('tr'):
But i will get all tr (tr in main table and in sub table). How to get tr in main table only?

How about this?
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<table>
<tr>
<td><table><tr><td>1</td></tr><tr><td>2</td></tr></table></td>
</tr>
<tr>
<td><table><tr><td>3</td></tr><tr><td>4</td></tr></table></td>
</tr>
</table>
""")
for tr in soup.find('table').find_all('tr', recursive=False):
print tr
recursive=False helps to find only top-level tags (see docs).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python lxml iterating through tr elements - python

Related

Scraping all tbody tags in table using selenium in python

insert element above specific table row beautifulsoup python

Python xPath looping for TR table rows but object empty

How to wrap all contents of a tag?

How to find nodes using Beautiful Soup

Categories

Resources