I have a XML file like this:
$ cat sample.xml
<Requests>
<Request>
<ID>123</ID>
<Items>
<Item>a item</Item>
<Item>b item</Item>
<Item>c item</Item>
</Items>
</Request>
<Request>
<ID>456</ID>
<Items>
<Item>d item</Item>
<Item>e item</Item>
</Items>
</Request>
</Requests>
I simply want to extract the XML of Request elements which has certain value for their grandchild element Item. Here is code:
bash-4.2$ cat xsearch.py
import sys
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse(sys.argv[1])
root = tree.getroot()
for request in root.findall(".//Item[.='c item']/../.."):
#for request in root.findall(".//Request[Items/Item = 'c item']"):
print(request)
I got "invalid predicate" error:
bash-4.2$ python3 xsearch.py sample.xml
Traceback (most recent call last):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 263, in iterfind
selector = _cache[cache_key]
KeyError: (".//Item[.='c item']/../..", None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xsearch.py", line 8, in <module>
for request in root.findall(".//Item[.='c item']/../.."):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 304, in findall
return list(iterfind(elem, path, namespaces))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 277, in iterfind
selector.append(ops[token[0]](next, token))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 233, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
Could any one point out where I got it wrong?
In general, an XPath invalid predicate error means something is syntactically wrong with one of the XPath's predicates, the code between the [ and ].
Specifically in your case, there are two issues:
The SyntaxError("invalid predicate") is because there's an extra ) in the predicate:
for request in root.findall(".//Item[.='c item')]/../.."):
^
Note also that you can hoist the predicate to avoid navigating down and then back up (../..):
Instead of
.//Item[.='c item']/../..
consider
.//Request[Items/Item = 'c item']
to select the Request element with the targeted Item.
The XPath library you're using, ElementTree, is not a full implementation of the XPath standard. You can waste a lot of time trying to identify what ElementTree does support (".//Items[Item='c item']/.." happens to work here) and does not support, but it'd be better to just use a more compliant library such as lxml.
Related
I am trying to parse an xpath but it is giving Invalid expression error.
The code that should work:
x = tree.xpath("//description/caution[1]/preceding-sibling::*/name()!='warning'")
print(x)
Expected result is a boolean value but it is showing error:
Traceback (most recent call last):
File "poc_xpath2.0_v1.py", line 9, in <module>
x = tree.xpath("//description/caution[1]/preceding-sibling::*/name()!='warning'")
File "src\lxml\etree.pyx", line 2276, in lxml.etree._ElementTree.xpath
File "src\lxml\xpath.pxi", line 359, in lxml.etree.XPathDocumentEvaluator.__call__
File "src\lxml\xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
The exception is because name() isn't a valid node type. Your XPath would only be valid as XPath 2.0 or greater. lxml only supports XPath 1.0.
You would need to move the name() != 'warning' into a predicate.
Also, if you want a True/False result, wrap the xpath in boolean()...
tree.xpath("boolean(//description/caution[1]/preceding-sibling::*[name()!='warning'])")
Full example...
from lxml import etree
xml = """
<doc>
<description>
<warning></warning>
<caution></caution>
</description>
</doc>"""
tree = etree.fromstring(xml)
x = tree.xpath("boolean(//description/caution[1]/preceding-sibling::*[name()!='warning'])")
print(x)
This would print False.
My XML file looks something like this:
<SCAN_LIST_OUTPUT>
<RESPONSE>
<DATETIME>2018-05-21T11:29:05Z</DATETIME>
<SCAN_LIST>
<SCAN>
<REF>scan/1526727908.25005</REF>
<TITLE><![CDATA[ACRS_Scan]]></TITLE>
<LAUNCH_DATETIME>2018-05-19T11:05:08Z</LAUNCH_DATETIME>
</SCAN>
<SCAN>
<REF>scan/1526549903.07613</REF>
<TITLE><![CDATA[testScan]]></TITLE>
<LAUNCH_DATETIME>2018-05-17T09:38:23Z</LAUNCH_DATETIME>
</SCAN>
</SCAN_LIST>
</RESPONSE>
</SCAN_LIST_OUTPUT>
Now when I try to find the REF element of the first element using an absolute path where I know the LAUNCH_DATETIME it gives me an error saying invalid predicate.
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(response))
groot = tree.getroot()
path = './/REF[../LAUNCH_DATETIME="2018-05-19T11:05:08Z"]'
scan_id = tree.find(path)
Here is the following traceback call:
KeyError: ('.//REF[../LAUNCH_DATETIME="2018-05-19T11:05:08Z"]', None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/doomsday/PycharmProjects/untitled/venv/ScanList.py", line 44, in <module>
scan_id = tree.find(path)
File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 651, in find
return self._root.find(path, namespaces)
File "/usr/lib/python3.5/xml/etree/ElementPath.py", line 298, in find
return next(iterfind(elem, path, namespaces), None)
File "/usr/lib/python3.5/xml/etree/ElementPath.py", line 277, in iterfind
selector.append(ops[token[0]](next, token))
File "/usr/lib/python3.5/xml/etree/ElementPath.py", line 233, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
When I use the same absolute path on an online xpath evaluator, it gives me the desired output. But when I try the same in my code, it fails. If anyone could tell what the problem is and how it can be resolved, would be great.
Thanks in advance.
ElementTree's xpath support is limited. Instead of trying to go back up the tree with .. in a predicate on REF, add the predicate to SCAN.
Example...
path = './/SCAN[LAUNCH_DATETIME="2018-05-19T11:05:08Z"]/REF'
I'm trying to write a simple script that parses my XML document to get name from all <xs:element> tags. I'm using minidom (is there a better way?) Here is my code so far:
import csv
from xml.dom import minidom
xmldoc = minidom.parse('core.xml')
core = xmldoc.getElementsByTagName('xs:element')
print(len(core))
print(core[0].attributes['name'].value)
for x in core:
print(x.attributes['name'].value)
I'm getting this error:
Traceback (most recent call last):
File "C:/Users/user/Desktop/XML Parsing/test.py", line 9, in <module>
print(core[0].attributes['name'].value)
File "C:\Python27\lib\xml\dom\minidom.py", line 522, in __getitem__
return self._attrs[attname_or_tuple]
KeyError: 'name'
As you have the tag name, you don't need to add the index.
Just replace with the following code:
print(core.attributes['name'].value)
I have some xml pieces like this:
<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
<player_birthday>1979-09-23</player_birthday>
<player_name>Orene Ai'i</player_name>
<player_team>Blues</player_team>
<player_id>453</player_id>
<player_height>170</player_height>
<player_position>F&W</player_position> <---- a '&' here.
<player_weight>75</player_weight>
</record>
Is there any way to validate whether the xml pieces is well-formatted?
Is there any way to validate the xml against a DTD or XML Scheme?
For various reasons I can't use any third-party packages.
e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.
Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.
>>> a = """<record>
... <player_birthday>1979-09-23</player_birthday>
... <player_name>Orene Ai'i</player_name>
... <player_team>Blues</player_team>
... <player_id>453</player_id>
... <player_height>170</player_height>
... <player_position>F&W</player_position> <---- a '&' here.
... <player_weight>75</player_weight>
... </record>"""
>>>
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
parser.feed(text)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24
You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).
Just do:
import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')
You will get a xml.parsers.expat.ExpatError if the XML is invalid.
I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php