I have a complex html DOM tree of the following nature:
<table>
...
<tr>
<td>
...
</td>
<td>
<table>
<tr>
<td>
<!-- inner most table -->
<table>
...
</table>
<h2>This is hell!</h2>
<td>
</tr>
</table>
</td>
</tr>
</table>
I have some logic to find out the inner most table. But after having found it, I need to get the next sibling element (h2). Is there anyway you can do this?
If tag is the innermost table, then
tag.findNextSibling('h2')
will be
<h2>This is hell!</h2>
To literally get the next sibling, you could use tag.nextSibling,
which in this case, is u'\n'.
If you want the next sibling that is not a NavigableString (such as u'\n'), then you could use
tag.findNextSibling(text=None)
If you want the second sibling (no matter what it is), you could use
tag.nextSibling.nextSibling
(but note that if tag does not have a next sibling, then tag.nextSibling will be None, and tag.nextSibling.nextSibling will raise an AttributeError.)
Every tag object has a nextSibling attribute that's exactly what you're looking for -- the next sibling (or None for a tag that's the last child of its parent tag, of course).
Related
I'm working with BeautifulSoup 4 and want to find a specific table row and insert a row element above it.
Take the html as a sample:
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
There are many more tables in the document, but this is a typical structure. The tables do make use of names or ids and cannot be modified.
My goal is to locate "Sample Text", find that tr in which it belongs and set focus to it so that I can dynamically insert a new table row directly above it.
I've tried something like in order to capture the top root table row:
for elm in index(text='Sample Text'):
elm.parent.parent.parent.parent
Doesn't seem robust though. Any suggestions for a cleaner approach?
locate the text "Sample Text" using the text= argument.
Find the previous <tr> using find_previous().
Use insert_before() to add a new element to the soup.
from bs4 import BeautifulSoup
html = """
<table>
<tr>
<td>
<p>
<span>Sample Text</span>
</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find("span", text="Sample Text"):
tag.find_previous("tr").insert_before("MY NEW TAG")
print(soup.prettify())
Output:
<table>
MY NEW TAG
<tr>
<td>
<p>
<span>
Sample Text
</span>
</p>
</td>
</tr>
</table>
I have the following XML portion:
<table>
<tr>
<td>Hello</td>
<td>Hello</td>
<td>
<p>Hello already in P</p>
</td>
<td>
This one has some naked text
<span>and some span wrapped text</span>
</td>
</tr>
</table>
I would like to wrap (in a p tag) the contents of each cell that is not already wrapped in a p tag. So that the output is:
<table>
<tr>
<td><p>Hello</p></td>
<td><p>Hello</p></td>
<td>
<p>Hello already in p tag</p>
</td>
<td>
<p>
This one has some text
<span>and some span wrapped text</span>
</p>
</td>
</tr>
</table>
I'm using lxml etree in my project but the library doesn't seem to have a "wrap" method or something similar.
Now I'm thinking maybe this is a job for XSLT transformations but I'd like to avoid adding another layer of complexity + other dependencies in my Python project.
The content of td's can be of any depth
I don't use the lxml package myself but try the following:
def wrap(root):
# find <td> elements that do not have a <p> element
cells = etree.XPath("//td[not(p)]")(root)
for cell in cells:
# Create new <p> element
e = Element("p")
# Set the <p> element text from the parent
e.text = cell.text
# Clear the parent text because it is now in the <p> element
cell.text = None
# Move the parents children and make them the <p> element's children
# (because the span on line 10 of the input file should be nested)
for child in cell.getchildren():
# This actually moves the child from the <td> element to the <p> element
e.append(child)
# Set the new <p> element as the cell's child
cell.append(e)
I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!
It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)
I'm reading an HTML file with BeautifulSoup. I have a table in the HTML from which I need to read data, but the HTML contains more than one table.
To distinguish between the tables, I need to see the number of columns on each line by counting <td> tags.
I count like this:
for i in soup.find_all('tr'):
for x in i.findallnext('td'):
This returns all <td> tags after the <tr> until the end of the document. But I need to know the numbers of <td> tags between the start of a line (<tr>) and the and of that line (</tr>).
<tr> <!-- Should return 2 columns, but will return 4 in script. -->
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
Replace findallnext with find_all.
findallnext gives all tags after the until the end of the document as you said.
find_all gives you the child elements.
Let say I have an html structure like this:
<html><head></head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<td>Left</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Center</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Right</td>
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I would like to construct CSS selectors to access the three sub tables, which are only distinguished by the contents of a table data item in their first row.
How can I do this?
I think there no such method available in css selector to verify the inner text.
You can achieve that by using xpath or jQuery path.
xpath :
"//td[contains(text(),'Left')]"
or
"//td[text()='Right']"
jQuery path
jQuery("td:contains('Centre')")
Using below logic you can execute jQuery paths in WebDriver automation.
JavascriptExecutor js = (JavascriptExecutor) driver;
WebElement element=(WebElement)js.executeScript(locator);
the .text method on an element returns the text of an element.
tables = page.find_elements_by_xpath('.//table')
contents = "Left Center Right".split()
results = []
for table in tables:
if table.find_element_by_xpath('.//td').text in contents: # returns only the first element
results.append(table)
You can narrow the search field by setting 'page' to the first 'table' element, and then running your search over that. There are all kinds of ways to improve performance like this. Note, this method will be fairly slow if there are a lot of extraneous tables present. Each webpage will have it's quirks on how it chooses to represent information, make sure you work around those to gain efficiency.
You can also use list comprehension to return your results.
results = [t for t in tables if t.find_element_by_xpath('.//td').text in contents]