Strange results of lxml xpath element search using string-value

Strange results of lxml xpath element search using string-value - python

Can't understand the results of the following xpath query:
from lxml import etree
from io import StringIO
s = '<aaa><bbb>f<ccc>e</ccc>d</bbb></aaa>'
tree = etree.parse(StringIO(s))
print(tree.xpath('//bbb[.="fed"]')) #prints an empty list!
According to the XPath specification,
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
So I expect to get bbb element.
Even more puzzled is that each of the following queries return bbb:
tree.xpath('//bbb[contains(.,"fed")]')
tree.xpath('//bbb[normalize-space(.)="fed"]')
tree.xpath('//bbb[string-length(.)=3]')
Where am I wrong? Or is it a bug in lxml?

//bbb[.="fed"] XPath means you are expecting bbb tag which is having text as fed
Validate if your xpath is correct or do not return multiple value from same xpath.
Post your DOM/link if you like us to create a XPath for you
Hope it will help you :)

Turned out to be a bug. Now fixed (checked for lxml v. 4.5.1).

Related

Parsing a wiki-styled web page, XPath error

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.
I have a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//text()'))
It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//*[not(#class="reference")].text()'))
What is the right XPath expression? Thanks in advance :)

Probably, the error occured because of .text() instead of /text().
If you want include also text of p elements then you have to use the descendant-or-self XPath axis:
//*[#id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(#class="reference")]/text()

search entire tree with etree

I am using xml.etree.ElementTree as ET, this seems like the go-to library but if there is something else/better for the job I'm intrigued.
Let's say I have a tree like:
doc = """
<top>
<second>
<third>
<subthird></subthird>
<subthird2>
<subsubthird>findme</subsubthird>
</subthird2>
</third>
</second>
</top>"""
and for the sake of this problem, let's say this is already in an elementree named myTree
I want to update findme to found, is there a simple way to do it other than iterating like:
myTree.getroot().getchildren()[0].getchildren()[0].getchildren() \
[1].getchildren()[0].text = 'found'
The issue is I have a large xml tree and I want to update these values and I can't find a clear and pythonic way to do this.

You can use XPath expressions to get a specific tagname like this:
for el in myTree.getroot().findall(".//subsubthird"):
el.text = 'found'
If you need to find all tags with a specific text value, take a look at this answer: Find element by text with XPath in ElementTree.

I use lxml with XPath expressions. ElementTree has an abbreviated XPath syntax but since I don't use it, I don't know how extensive it is. The thing about XPath is that you can write as complex an element selector as you need. In this example, its based on nesting:
import lxml.etree
doc = """
<top>
<second>
<third>
<subthird></subthird>
<subthird2>
<subsubthird>findme</subsubthird>
</subthird2>
</third>
</second>
</top>"""
root = lxml.etree.XML(doc)
for elem in root.xpath('second/third/subthird2/subsubthird'):
elem.text = 'found'
print(lxml.etree.tostring(root, pretty_print=True, encoding='unicode'))
But suppose there was something else identifying, such as a unique attribute,
<subthird2 class="foo"><subsubthird>findme</subsubthird></subthird2>
then you xpath would be //subthird2[#class="foo"]/subsubthird.

What does etree.xpath function of python lxml library do?

I understand that etree object from lxml library is a tree representation of a xml document. It is not clear to me what .xpath function does. I just need to know how to interpret its argument and its output. I saw the following example of use:
tree.xpath('.//' + tagname)
html.xpath("string()")
html.xpath("//text()")
What do all these string() and //text() mean?

What you see there is XPath. https://de.wikipedia.org/wiki/XPath
The xpath() function will take a valid XPath and return the result. You could say XPath is a query language for xml documents.
This should clear things up: http://lxml.de/xpathxslt.html#the-xpath-method

XPath failing using lxml

I have used xpaths to great effect with both HTML and XML before, but can't seem to get any results this time.
The data is from http://www.ahrefs.com/api/, under "Example of an answer", saved to an .xml file
My code:
from lxml import etree
doc = etree.XML(open('example.xml').read())
print doc.xpath('//result')
which doesn't give any results.
Where am I going wrong?

You need to take the namespace of the document into account:
from lxml import etree
doc = etree.parse('example.xml')
print doc.xpath('//n:result',
namespaces={'n': "http://ahrefs.com/schemas/api/links/1"})
=>
[<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d670>,
<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d698>]

My experience is from using XPath in C#, but I believe the XML namespace is causing your query to fail. You'll need to either use some variation of the local() operator, or check your documentation for some way of defining the namespace beforehand.

what is the simplest way to get element content by XPath with Python?

I need to get content for this XPath:
/html/body/div/table[2]/tbody/tr/td[2]
It's copied from FireBug. How can I do this? I have a very large HTML document, so I don't want (and don't know how:) ) to grep it. Thanks.

lxml can handle html (and provides pretty good xpath support):
>>> import lxml.html
>>> tree = lxml.html.parse('test.html')
>>> for node in tree.xpath('/html/body/div/table[2]/tbody/tr/td[2]'):
... print node.text
...
first row, second column
second row, second column
Just make sure that you use it's html parser.

import lxml.html as h
tree = h.parse("keys_results.html")
text = tree.xpath("string(//*[contains(text(),'needed_text')])")
print text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange results of lxml xpath element search using string-value - python

//bbb[.="fed"] XPath means you are expecting bbb tag which is having text as fed Validate if your xpath is correct or do not return multiple value from same xpath. Post your DOM/link if you like us to create a XPath for you Hope it will help you :)

Turned out to be a bug. Now fixed (checked for lxml v. 4.5.1).

Related

Parsing a wiki-styled web page, XPath error

search entire tree with etree

What does etree.xpath function of python lxml library do?

XPath failing using lxml

what is the simplest way to get element content by XPath with Python?

Categories

Resources