What does etree.xpath function of python lxml library do? - python

I understand that etree object from lxml library is a tree representation of a xml document. It is not clear to me what .xpath function does. I just need to know how to interpret its argument and its output. I saw the following example of use:
tree.xpath('.//' + tagname)
html.xpath("string()")
html.xpath("//text()")
What do all these string() and //text() mean?

What you see there is XPath. https://de.wikipedia.org/wiki/XPath
The xpath() function will take a valid XPath and return the result. You could say XPath is a query language for xml documents.
This should clear things up: http://lxml.de/xpathxslt.html#the-xpath-method

Related

Strange results of lxml xpath element search using string-value

Can't understand the results of the following xpath query:
from lxml import etree
from io import StringIO
s = '<aaa><bbb>f<ccc>e</ccc>d</bbb></aaa>'
tree = etree.parse(StringIO(s))
print(tree.xpath('//bbb[.="fed"]')) #prints an empty list!
According to the XPath specification,
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
So I expect to get bbb element.
Even more puzzled is that each of the following queries return bbb:
tree.xpath('//bbb[contains(.,"fed")]')
tree.xpath('//bbb[normalize-space(.)="fed"]')
tree.xpath('//bbb[string-length(.)=3]')
Where am I wrong? Or is it a bug in lxml?
//bbb[.="fed"] XPath means you are expecting bbb tag which is having text as fed
Validate if your xpath is correct or do not return multiple value from same xpath.
Post your DOM/link if you like us to create a XPath for you
Hope it will help you :)
Turned out to be a bug. Now fixed (checked for lxml v. 4.5.1).

Python lxml: how to get human-readable XPath for XML element?

I have a short XML document:
<tag1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.com/2009/namespace">
<tag2>
<tag3/>
<tag3/>
</tag2>
</tag1>
A short Python program loads this XML file like this:
from lxml import etree
f = open( 'myxml.xml' )
tree = etree.parse(f)
MY_NAMESPACE = 'http://example.com/2009/namespace'
xpath = etree.XPath( '/f:tag1/f:tag2/f:tag3', namespaces = { 'f': MY_NAMESPACE } )
# get first element that matches xpath
elem = xpath(tree)[0]
# get xpath for an element
print tree.getpath(elem)
I am expecting to get a meaningful, human-readable xpath with this code, however, instead I get a string like /*/*/*[1].
Any idea what could be causing this and how I can diagnose this issue?
Note: Using Python 2.7.9 and lxml 2.3
It looks like getpath() (underlying libxml2 call xmlGetNodePath) produces positional expression xpath for namespaced documents.
User mzjn in the comments section pointed out that since lxml v3.4.0 a function getelementpath() produces a human-readable xpath with fully qualified tag names (using "Clark notation"). This function generates xpath by traversing the tree from the node up to the root instead of using libxml2 API call.
Similarly, if lxml v3.4+ is not available one can write a tree traversal function of their own.

lxml: keep the same tags in fromstring and tostring

I'm using lxml to parse an XML message. What I want to do is convert the string into an xml message, extract some informations thanks to xpath directives, edit a few attributes and then dump the XML into a string again.
lxml is doing a wonderful job at it, except for one thing : It won't respect the tag declaration that were originally provided. What I mean by this, is that if in your input you have :
xml_str = "<root><tag><tutu/></tag></root>"
or
xml_str = "<root><tag><tutu></tutu></tag></root>"
The following code will return the same thing:
>>> from lxml import etree
>>> root = etree.XML(xml_str)
>>> print etree.tostring(root)
<root><tag><tutu/></tag></root>
The tutu tag will be rendered no matter what as <tutu/>
I found here that by setting the text of the element to '' we can force the closing tag to be explicitly rendered.
My issue is the following : I need to have the exact same tag rendering before and after calling lxml (because some external program will perform a string comparison on both strings and a mismatch will be detected on <tutu/> and <tutu></tutu>)
I know we can create a custom ElementTree class as well as a custom parser...What I was thinking was while parsing the string, to save in the custom ElementTree what type of tag we have (short or extended) and then before calling tostring function, update the tree and set the text to None or '' to keep the same type of tag as in the input
The question is : How may I know what type of tag I have? Or do you have any other idea on how to solve this issue?
Thanks a lot for your help

Can Etree handle these kinds of XPath Queries

Can Pythons XML Parsing library Etree take complex XPath queries like the following?
# Note the "[text()=\"USER_4D\"]"
assert root.find("Group/EnvConfig/Overrides/Override/Key[text()=\"USER_4D\"]") != None
I am getting the error SyntaxError: invalid predicate on the above line. If I remove the 'predicate' [text()=\"USER_4D\"] then no expection/error is raised.
Whats causing this error? Is my XPath incorrectly formatted or can Etree not perform these kinds of XPath queries? Can you provide advice on how to fix this?
I hope I dont need to use a custom XML Parsing library because I am just trying to make some simple unit tests using Pythons in built XML Parsing libraries. Are there other native Python libraries than can handle this XPath query?
ElementTree does not support the kind of XPath expression used in the question. See http://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support.
To assert the presence of a Key element with the text content USER_4D, you could use something like this:
Key = root.find("Group/EnvConfig/Overrides/Override/Key")
assert Key != None # Element exists
assert Key.text == "USER_4D" # Element has specific content

XPath failing using lxml

I have used xpaths to great effect with both HTML and XML before, but can't seem to get any results this time.
The data is from http://www.ahrefs.com/api/, under "Example of an answer", saved to an .xml file
My code:
from lxml import etree
doc = etree.XML(open('example.xml').read())
print doc.xpath('//result')
which doesn't give any results.
Where am I going wrong?
You need to take the namespace of the document into account:
from lxml import etree
doc = etree.parse('example.xml')
print doc.xpath('//n:result',
namespaces={'n': "http://ahrefs.com/schemas/api/links/1"})
=>
[<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d670>,
<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d698>]
My experience is from using XPath in C#, but I believe the XML namespace is causing your query to fail. You'll need to either use some variation of the local() operator, or check your documentation for some way of defining the namespace beforehand.

Categories

Resources