Element Tree: How to parse subElements of child nodes - python

I have an XML tree, which I'd like to parse using Elementtree. My XML looks something like
<?xml version="1.0" encoding="UTF-8"?>
<GetOrdersResponse xmlns="urn:ebay:apis:eBLBaseComponents">
<Ack>Success</Ack>
<Version>857</Version>
<Build>E857_INTL_APIXO_16643800_R1</Build>
<PaginationResult>
<TotalNumberOfPages>1</TotalNumberOfPages>
<TotalNumberOfEntries>2</TotalNumberOfEntries>
</PaginationResult>
<HasMoreOrders>false</HasMoreOrders>
<OrderArray>
<Order>
<OrderID>221362908003-1324471823012</OrderID>
<CheckoutStatus>
<eBayPaymentStatus>NoPaymentFailure</eBayPaymentStatus>
<LastModifiedTime>2014-02-03T12:08:51.000Z</LastModifiedTime>
<PaymentMethod>PaisaPayEscrow</PaymentMethod>
<Status>Complete</Status>
<IntegratedMerchantCreditCardEnabled>false</IntegratedMerchantCreditCardEnabled>
</CheckoutStatus>
</Order>
<Order> ...
</Order>
<Order> ...
</Order>
</OrderArray>
</GetOrdersResponse>
I want to parse the 6th child of the XML () I am able to get the value of subelements by index. E.g if I want OrderID of first order, i can use root[5][0][0].text. But, I would like to get the values of subElements by name. I tried the following code, but it does not print anything:
tree = ET.parse('response.xml')
root = tree.getroot()
for child in root:
try:
for ids in child.find('Order').find('OrderID'):
print ids.text
except:
continue
Could someone please help me on his. Thanks

Since the XML document has a namespace declaration (xmlns="urn:ebay:apis:eBLBaseComponents"), you have to use universal names when referring to elements in the document. For example, you need {urn:ebay:apis:eBLBaseComponents}OrderID instead of just OrderID.
This snippet prints all OrderIDs in the document:
from xml.etree import ElementTree as ET
NS = "urn:ebay:apis:eBLBaseComponents"
tree = ET.parse('response.xml')
for elem in tree.iter("*"): # Use tree.getiterator("*") in Python 2.5 and 2.6
if elem.tag == '{%s}OrderID' % NS:
print elem.text
See http://effbot.org/zone/element-namespaces.htm for details about ElementTree and namespaces.

Try to avoid chaining your finds. If your first find does not find anything, it will return None.
for child in root:
order = child.find('Order')
if order is not None:
ids = order.find('OrderID')
print ids.text

You can find an OrderArray first and then just iterate its children by name:
tree = ET.parse('response.xml')
root = tree.getroot()
order_array = root.find("OrderArray")
for order in order_array.findall('Order'):
order_id_element = order.find('OrderID')
if order_id_element is not None:
print order_id_element.text
A side note. Never ever use except: continue. It hides any exception you get and makes debugging really hard.

Related

How to find if there are empty attributes in XML?

Having a XML like this one (located in /home/user/):
<?xml version="1.0" ?>
<DataClient xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cnmc="http://www.example.com/Tipos_DataClient" xmlns="http://www.example.com/DataClient">
<PersonalData Operation="3" Date="2022-09-06">
<ExtendedData>
<Person Code="XXX" OtherCode="Y12354"/>
</ExtendedData>
<Home Type="Street" Num="10" Code="12003" Poblation="Imaginary street"/>
</PersonalData>
</DataClient>
How could I identify if the "Num" attribute is empty? And then generate a list of all those elements that have the "Num" empty...
I tried to count all those with "None" as value, but it always returns 0:
#! /usr/bin/python3
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
root = tree.getroot()
b = None
a = sum(1 for s in root.findall('./DataClient/PersonalData/ExtendedData/Num') if s.b)
print (a)
Since Python's etree API maps attributes to dictionaries, consider dict.get to check for specific attribute. Also, you need to use namespaces argument of findall since XML contains a default namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
nmsp = {"doc": "http://www.example.com/DataClient"}
xpath = "./doc:DataClient/doc:PersonalData/doc:Home"
a = sum(1 for node in tree.findall(xpath, nmsp) if node.attrib.get("Num") is None)

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.
You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Python3 parse XML into dictionary

It seems the original post was too vague, so I'm narrowing down the focus of this post. I have an XML file from which I want to pull values from specific branches, and I am having difficulty in understanding how to effectively navigate the XML paths. Consider the XML file below. There are several <mi> branches. I want to store the <r> value of certain branches, but not others. In this example, I want the <r> values of counter1 and counter3, but not counter2.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="Data.xsl" ?>
<!DOCTYPE mdc SYSTEM "Data.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<mfh>
<vn>TEST</vn>
<cbt>20140126234500.0+0000</cbt>
</mfh>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter1</mt>
<mv>
<moid>DEFAULT</moid>
<r>58</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter2</mt>
<mv>
<moid>DEFAULT</moid>
<r>100</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter3</mt>
<mv>
<moid>DEFAULT</moid>
<r>7</r>
</mv>
</mi>
</mdc>
From that I would like to build a tuple with the following:
('20140126234500.0+0000', 58, 7)
where 20140126234500.0+0000 is taken from <cbt>, 58 is taken from the <r> value of the <mi> element that has <mt>counter1</mt> and 7 is taken from the <mi> element that has <mt>counter3</mt>.
I would like to use xml.etree.cElementTree since it seems to be standard and should be more than capable for my purposes. But I am having difficulty in navigating the tree and extracting the values I need. Below is some of what I have tried.
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='Data.xml')
root = tree.getroot()
for mi in root.iter('mi'):
print(mi.tag)
for mt in mi.findall("./mt") if mt.value == 'counter1':
print(mi.find("./mv/r").value) #I know this is invalid syntax, but it's what I want to do :)
From a pseudo code standpoint, what I am wanting to do is:
find the <cbt> value and store it in the first position of the tuple.
find the <mi> element where <mt>counter1</mt> exists and store the <r> value in the second position of the tuple.
find the <mi> element where <mt>counter3</mt> exists and store the <r> value in the third position of the tuple.
I'm not clear when to use element.iter() or element.findall(). Also, I'm not having the best of luck with using XPath within the functions, or being able to extract the info I'm needing.
Thanks,
Rusty
Starting with:
import xml.etree.cElementTree as ET # or with try/except as per your edit
xml_data1 = """<?xml version="1.0"?> and the rest of your XML here"""
tree = ET.fromstring(xml_data) # or `ET.parse(<filename>)`
xml_dict = {}
Now tree has the xml tree and xml_dict will be the dictionary you're trying to get the result.
# first get the key & val for 'cbt'
cbt_val = tree.find('mfh').find('cbt').text
xml_dict['cbt'] = cbt_val
The counters are in 'mi':
for elem in tree.findall('mi'):
counter_name = elem.find('mt').text # key
counter_val = elem.find('mv').find('r').text # value
xml_dict[counter_name] = counter_val
At this point, xml_dict is:
>>> xml_dict
{'counter2': '100', 'counter1': '58', 'cbt': '20140126234500.0+0000', 'counter3': '7'}
Some shortening, though possibly not as read-able: the code in the for elem in tree.findall('mi'): loop can be:
xml_dict[elem.find('mt').text] = elem.find('mv').find('r').text
# that combines the key/value extraction to one line
Or further, building the xml_dict can be done in just two lines with the counters first and cbt after:
xml_dict = {elem.find('mt').text: elem.find('mv').find('r').text for elem in tree.findall('mi')}
xml_dict['cbt'] = tree.find('mfh').find('cbt').text
Edit:
From the docs, Element.findall() finds only elements with a tag which are direct children of the current element.
find() only finds the first direct child.
iter() iterates over all the elements recursively.

Where should I be looking for reverse tree traversal of an lxml.etree?

I'm not sure of the name of the exact type of traversal I want to do, but basically I want to read the document element by element in reverse order from the current element.
The iterdescendants() method doesn't seem to do anything, and the iterancestors() method doesn't walk into the subelements, it just steps up and out, if you know what I mean.
Maybe something like this?
import lxml.etree as et
xmldata = """
<data>
<mango>1</mango>
<kiwi>2</kiwi>
<banana>3</banana>
<plum>4</plum>
</data>
"""
tree = et.fromstring(xmldata)
el = tree.find('plum')
print el.text
while True:
el = el.getprevious()
if el is None:
break
print el.text
Result:
4
3
2
1

Empty XML element handling in Python

I'm puzzled by minidom parser handling of empty element, as shown in following code section.
import xml.dom.minidom
doc = xml.dom.minidom.parseString('<value></value>')
print doc.firstChild.nodeValue.__repr__()
# Out: None
print doc.firstChild.toxml()
# Out: <value/>
doc = xml.dom.minidom.Document()
v = doc.appendChild(doc.createElement('value'))
v.appendChild(doc.createTextNode(''))
print v.firstChild.nodeValue.__repr__()
# Out: ''
print doc.firstChild.toxml()
# Out: <value></value>
How can I get consistent behavior? I'd like to receive empty string as value of empty element (which IS what I put in XML structure in the first place).
Cracking open xml.dom.minidom and searching for "/>", we find this:
# Method of the Element(Node) class.
def writexml(self, writer, indent="", addindent="", newl=""):
# [snip]
if self.childNodes:
writer.write(">%s"%(newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write("%s</%s>%s" % (indent,self.tagName,newl))
else:
writer.write("/>%s"%(newl))
We can deduce from this that the short-end-tag form only occurs when childNodes is an empty list. Indeed, this seems to be true:
>>> doc = Document()
>>> v = doc.appendChild(doc.createElement('v'))
>>> v.toxml()
'<v/>'
>>> v.childNodes
[]
>>> v.appendChild(doc.createTextNode(''))
<DOM Text node "''">
>>> v.childNodes
[<DOM Text node "''">]
>>> v.toxml()
'<v></v>'
As pointed out by Lloyd, the XML spec makes no distinction between the two. If your code does make the distinction, that means you need to rethink how you want to serialize your data.
xml.dom.minidom simply displays something differently because it's easier to code. You can, however, get consistent output. Simply inherit the Element class and override the toxml method such that it will print out the short-end-tag form when there are no child nodes with non-empty text content. Then monkeypatch the module to use your new Element class.
value = thing.firstChild.nodeValue or ''
Xml spec does not distinguish these two cases.

Categories

Resources