i am trying to parse an xml like
<document>
<pages>
<page>
<paragraph>XBV</paragraph>
<paragraph>GHF</paragraph>
</page>
<page>
<paragraph>ash</paragraph>
<paragraph>lplp</paragraph>
</page>
</pages>
</document>
and here is my code
import xml.etree.ElementTree as ET
tree = ET.parse("../../xml/test.xml")
root = tree.getroot()
path="./pages/page/paragraph[text()='GHF']"
print root.findall(path)
but i get an error
print root.findall(path)
File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
return ElementPath.findall(self, path, namespaces)
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
return list(iterfind(elem, path, namespaces))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
what is wrong with my xpath?
Follow up
Thanks falsetru, your solution worked. I have a follow up. Now, i want to get all the paragraph elements that come before the paragraph with text GHF. So in this case i only need the XBV element. I want to ignore the ash and lplp. i guess one way to do this would be
result = []
for para in root.findall('./pages/page/'):
t = para.text.encode("utf-8", "ignore")
if t == "GHF":
break
else:
result.append(para)
but is there a better way to do this?
ElementTree's XPath support is limited. Use other library like lxml:
import lxml.etree
root = lxml.etree.parse('test.xml')
path = "./pages/page/paragraph[text()='GHF']"
print(root.xpath(path))
As #falsetru mentioned, ElementTree doesn't support text() predicate, but it supports matching child element by text, so in this example, it is possible to search for a page that has a paragraph with specific text, using the path ./pages/page[paragraph='GHF']. The problem here is that there are multiple paragraph tags in a page, so one would have to iterate for the specific paragraph. In my case, I needed to find the version of a dependency in a maven pom.xml, and there is only a single version child so the following worked:
In [1]: import xml.etree.ElementTree as ET
In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"}
In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text
Out[1]: '1.2.3'
Related
I have an XML file with the following data:
<?xml version="1.0" encoding="utf-8"?>
<metadata>
<filter>
<regex>ATL|LAX|DFW</regex >
<start_char>3</start_char>
<end_char></end_char>
<action>remove</action>
</filter>
<filter>
<regex>DFW.+\.$</regex >
<start_char>3</start_char>
<end_char>-1</end_char>
<action>remove</action>
</filter>
<filter>
<regex>\-</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
<filter>
<regex>\s</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
</metadata>
I am trying to read in the xml file into my python code and loop through all the filter tags and see if the action tag is 'remove'. If the action tag is 'remove', I want to remove the part of the mfn_pn that matches the text within the regex tag.
Next, I want it to see if the action tag is 'substitute'. If it is 'substitute', I want it to substitute the text within the regex tag with what's in the replacement tag.
However, I keep getting the error
File "C:\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 50, column 13".
Not sure what "not well-formed (invalid token)" is referring to.
from xml.etree.ElementTree import ElementTree
# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")
It looks like the error occurs in the first 4 lines of your script. As such, the rest of the script is not needed for a minimal reproducible example.
Having said that, interestingly the example from the documentation yields the same error.
Finally, I managed to resolve the issue by following the solution provided here.
I would like to parse following XML file using the Python xml ElementTree API.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<foos>
<foo_table>
<!-- bar -->
<fooelem>
<fname>BBBB</fname>
<group>SOMEGROUP</group>
<module>some module</module>
</fooelem>
<fooelem>
<fname>AAAA</fname>
<group>other group</group>
<module>other module</module>
</fooelem>
<!-- bar -->
</foo_table>
</foos>
In this example code I try to find all the elements under /foos/foo_table/fooelem/fname but obviously findall doesn't find anything when running this code.
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file="min.xml")
for i in tree.findall("./foos/foo_table/fooelem/fname"):
print i
root = tree.getroot()
for i in root.findall("./foos/foo_table/fooelem/fname"):
print i
I am not experienced with the ElementTree API, but I've used the example under https://docs.python.org/2/library/xml.etree.elementtree.html#example. Why is it not working in my case?
foos is your root, you would need to start findall below, e.g.
root = tree.getroot()
for i in root.findall("foo_table/fooelem/fname"):
print i.text
Output:
BBBB
AAAA
This is because the path you are using begins BEFORE the root element (foos).
Use this instead: foo_table/fooelem/fname
findall doesn't work, but this does:
e = xml.etree.ElementTree.parse(myfile3).getroot()
mylist=list(e.iter('checksum'))
print (len(mylist))
mylist will have the proper length.
I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text
I am new to Python, As per project requirement i want to launch web request for diffrent test cases. lets say (Refer below Employee_req.xml) for one test case i want to launch web services with all organizations. but for another one i want to launch web services in which all first-name tags should be removed.I am using ElementTree in python for dealing with XML. Please find below code segment.Modification of tag and attribute values works fine without any issue. but while removing certain tags it was throwing error. I am not correct with Xpath so will you please suggest possible ways?
Emp_req.xml
<request>
<orgaqnization>
<name>org1</name>
<employee>
<first-name>abc</first-name>
<last-name>def</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
<orgaqnization>
<name>org2</name>
<employee>
<first-name>abc2</first-name>
<last-name>def2</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
<orgaqnization>
<name>org3</name>
<employee>
<first-name>abc3</first-name>
<last-name>def3</last-name>
<dob>19870909</dob>
</employee>
</orgaqnization>
</request>
Python:: Test.py
modify_query("Remove",tag_name=".//first-name")
import xml.etree.ElementTree as query_xml
def modifiy_query(self,*args,**kwargs):
root = query_tree.getroot()
operation_type=args[0]
tag_name=kwargs['tagname']
try:
if operation_type=="Remove":
logger.info("Removing %s Tag from XML" % tag_name)
root.remove(tag_name)
elif operation_type=="Insert":
logger.info("Inserting %s tag to xml" % tag_name)
else:
raise InvalidXMLOperationError("Operation " + operation_type + " is invalid")
except InvalidXMLOperationError,e:
logger.error("Invalid XML operation %s" % operation_type)
The error message (Flow could be differ because i am running this code from some other program):
File "Test.py", line 161, in <module> testsuite.scheduler()
File "Test.py", line 91, in scheduler self.launched_query_with("Without_date_range")
File "Test.py", line 55, in launched_query_with test.modifiy_query("Remove",tagname='.//first-name')
File "/home/XXX/YYYY/common.py", line 287, in modifiy_query parent.remove(child)
File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 337, in remove self._children.remove(element)
ValueError: list.remove(x): x not in list
Thanks,
Priyank Shah
remove takes an element as parameter, not a xpath.
Instead of:
root.remove(tag_name)
you should have:
elements = root.findall(tag_name)
for element in elements:
root.remove(element)
In Python 2.6 using ElementTree, what's a good way to fetch the XML (as a string) inside a particular element, like what you can do in HTML and javascript with innerHTML?
Here's a simplified sample of the XML node I am starting with:
<label attr="foo" attr2="bar">This is some text and a link in embedded HTML</label>
I'd like to end up with this string:
This is some text and a link in embedded HTML
I've tried iterating over the parent node and concatenating the tostring() of the children, but that gave me only the subnodes:
# returns only subnodes (e.g. and a link)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
I can hack up a solution using regular expressions, but was hoping there'd be something less hacky than this:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))
How about:
from xml.etree import ElementTree as ET
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
root = ET.fromstring(xml)
def content(tag):
return tag.text + ''.join(ET.tostring(e) for e in tag)
print content(root)
print content(root.find('child2'))
Resulting in:
start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here
here as well<sub2 /><sub3 />
This is based on the other solutions, but the other solutions did not work in my case (resulted in exceptions) and this one worked:
from xml.etree import Element, ElementTree
def inner_xml(element: Element):
return (element.text or '') + ''.join(ElementTree.tostring(e, 'unicode') for e in element)
Use it the same way as in Mark Tolonen's answer.
The following worked for me:
from xml.etree import ElementTree as etree
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
dom = etree.XML(xml)
(dom.text or '') + ''.join(map(etree.tostring, dom)) + (dom.tail or '')
# 'start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here'
dom.text or '' is used to get the text at the start of the root element. If there is no text dom.text is None.
Note that the result is not a valid XML - a valid XML should have only one root element.
Have a look at the ElementTree docs about mixed content.
Using Python 2.6.5, Ubuntu 10.04