Python3 parse XML into dictionary

Python3 parse XML into dictionary - python

It seems the original post was too vague, so I'm narrowing down the focus of this post. I have an XML file from which I want to pull values from specific branches, and I am having difficulty in understanding how to effectively navigate the XML paths. Consider the XML file below. There are several <mi> branches. I want to store the <r> value of certain branches, but not others. In this example, I want the <r> values of counter1 and counter3, but not counter2.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="Data.xsl" ?>
<!DOCTYPE mdc SYSTEM "Data.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<mfh>
<vn>TEST</vn>
<cbt>20140126234500.0+0000</cbt>
</mfh>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter1</mt>
<mv>
<moid>DEFAULT</moid>
<r>58</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter2</mt>
<mv>
<moid>DEFAULT</moid>
<r>100</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter3</mt>
<mv>
<moid>DEFAULT</moid>
<r>7</r>
</mv>
</mi>
</mdc>
From that I would like to build a tuple with the following:
('20140126234500.0+0000', 58, 7)
where 20140126234500.0+0000 is taken from <cbt>, 58 is taken from the <r> value of the <mi> element that has <mt>counter1</mt> and 7 is taken from the <mi> element that has <mt>counter3</mt>.
I would like to use xml.etree.cElementTree since it seems to be standard and should be more than capable for my purposes. But I am having difficulty in navigating the tree and extracting the values I need. Below is some of what I have tried.
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='Data.xml')
root = tree.getroot()
for mi in root.iter('mi'):
print(mi.tag)
for mt in mi.findall("./mt") if mt.value == 'counter1':
print(mi.find("./mv/r").value) #I know this is invalid syntax, but it's what I want to do :)
From a pseudo code standpoint, what I am wanting to do is:
find the <cbt> value and store it in the first position of the tuple.
find the <mi> element where <mt>counter1</mt> exists and store the <r> value in the second position of the tuple.
find the <mi> element where <mt>counter3</mt> exists and store the <r> value in the third position of the tuple.
I'm not clear when to use element.iter() or element.findall(). Also, I'm not having the best of luck with using XPath within the functions, or being able to extract the info I'm needing.
Thanks,
Rusty

Starting with:
import xml.etree.cElementTree as ET # or with try/except as per your edit
xml_data1 = """<?xml version="1.0"?> and the rest of your XML here"""
tree = ET.fromstring(xml_data) # or `ET.parse(<filename>)`
xml_dict = {}
Now tree has the xml tree and xml_dict will be the dictionary you're trying to get the result.
# first get the key & val for 'cbt'
cbt_val = tree.find('mfh').find('cbt').text
xml_dict['cbt'] = cbt_val
The counters are in 'mi':
for elem in tree.findall('mi'):
counter_name = elem.find('mt').text # key
counter_val = elem.find('mv').find('r').text # value
xml_dict[counter_name] = counter_val
At this point, xml_dict is:
>>> xml_dict
{'counter2': '100', 'counter1': '58', 'cbt': '20140126234500.0+0000', 'counter3': '7'}
Some shortening, though possibly not as read-able: the code in the for elem in tree.findall('mi'): loop can be:
xml_dict[elem.find('mt').text] = elem.find('mv').find('r').text
# that combines the key/value extraction to one line
Or further, building the xml_dict can be done in just two lines with the counters first and cbt after:
xml_dict = {elem.find('mt').text: elem.find('mv').find('r').text for elem in tree.findall('mi')}
xml_dict['cbt'] = tree.find('mfh').find('cbt').text
Edit:
From the docs, Element.findall() finds only elements with a tag which are direct children of the current element.
find() only finds the first direct child.
iter() iterates over all the elements recursively.

Related

How to find if there are empty attributes in XML?

Having a XML like this one (located in /home/user/):
<?xml version="1.0" ?>
<DataClient xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cnmc="http://www.example.com/Tipos_DataClient" xmlns="http://www.example.com/DataClient">
<PersonalData Operation="3" Date="2022-09-06">
<ExtendedData>
<Person Code="XXX" OtherCode="Y12354"/>
</ExtendedData>
<Home Type="Street" Num="10" Code="12003" Poblation="Imaginary street"/>
</PersonalData>
</DataClient>
How could I identify if the "Num" attribute is empty? And then generate a list of all those elements that have the "Num" empty...
I tried to count all those with "None" as value, but it always returns 0:
#! /usr/bin/python3
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
root = tree.getroot()
b = None
a = sum(1 for s in root.findall('./DataClient/PersonalData/ExtendedData/Num') if s.b)
print (a)

Since Python's etree API maps attributes to dictionaries, consider dict.get to check for specific attribute. Also, you need to use namespaces argument of findall since XML contains a default namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('/home/user/file.xml')
nmsp = {"doc": "http://www.example.com/DataClient"}
xpath = "./doc:DataClient/doc:PersonalData/doc:Home"
a = sum(1 for node in tree.findall(xpath, nmsp) if node.attrib.get("Num") is None)

Get children elements of multiple instances of the same name tag using ElementTree

I have an xml file looking like this:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<boundary_conditions>
<rot>
<rot_instance>
<name>BC_1</name>
<rpm>200</rpm>
<parts>
<name>rim_FL</name>
<name>tire_FL</name>
<name>disk_FL</name>
<name>center_FL</name>
</parts>
</rot_instance>
<rot_instance>
<name>BC_2</name>
<rpm>100</rpm>
<parts>
<name>tire_FR</name>
<name>disk_FR</name>
</parts>
</rot_instance>
</data>
I actually know how to extract data corresponding to each instance. So I can do this for the names tag as follows:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
names= tree.findall('.//boundary_conditions/rot/rot_instance/name')
for val in names:
print(val.text)
which gives me:
BC_1
BC_2
But if I do the same thing for the parts tag:
names= tree.findall('.//boundary_conditions/rot/rot_instance/parts/name')
for val in names:
print(val.text)
It will give me:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
Which combines all data corresponding to parts/name together. I want output that gives me the 'parts' sub-element for each instance as separate lists. So this is what I want to get:
instance_BC_1 = ['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
instance_BC_2 = ['tire_FR', 'disk_FR']
Any help is appreciated,
Thanks.

You've got to first find all parts elements, then from each parts element find all name tags.
Take a look:
parts = tree.findall('.//boundary_conditions/rot/rot_instance/parts')
for part in parts:
for val in part.findall("name"):
print(val.text)
print()
instance_BC_1 = [val.text for val in parts[0].findall("name")]
instance_BC_2 = [val.text for val in parts[1].findall("name")]
print(instance_BC_1)
print(instance_BC_2)
Output:
rim_FL
tire_FL
disk_FL
center_FL
tire_FR
disk_FR
['rim_FL', 'tire_FL', 'disk_FL', 'center_FL']
['tire_FR', 'disk_FR']

Extract the Kth tag data in XML using ElementTree

Below is my current XML file (output.xml), and I hope that I can get its tag value using Python.
It is an XML file with namespace.
<data xmlns="urn:ietf:params:ns:netconf:base:1.0">
<interfaces xmlns="http://namespace.net">
<interface>
<name>Interface0</name>
</interface>
<interface>
<name>Interface1</name>
</interface>
<interface>
<name>Interface2</name>
</interface>
</interfaces>
</data>
And...below is my Python code to extract the value of tag <interface>:
from xml.etree import cElementTree as ET
tree = ET.ElementTree(file="output.xml")
root = tree.getroot()
nsmap = {'':'http://namespace.net'} # namespace
for name in root.iterfind('./interfaces/interface/name', namespaces=nsmap):
print(name.text)
My question is:
Is it possible to only fetch "Interface0", "Interface1", or "Interface2"?
If there are multiple <interface> tags, can I only fetch the values of the tags within the kth <interface>?

Use enumerate to get an enumerate object.
for index, name in enumerate(root.iterfind('./interfaces/interface/name', namespaces=nsmap)):
# if index <= kth:
if index == 0:# 1, 2?
print(name.text)

If Interface0, Interface1, ..., InterfaceN always come in sorted order then you don't require the sorted() function. Just remove it in that case.
To pick Kth value:-
import xml.etree.ElementTree as ET
tree = ET.parse("output.xml")
root = tree.getroot()
k = 3 # put the value assuming index starts with 0
sorted(list(map(lambda x: x.text, root.findall("./interfaces/interface/name", namespaces=nsmap))))[k]
To get within Kth values:-
import xml.etree.ElementTree as ET
tree = ET.parse("output.xml")
root = tree.getroot()
k = 3 # put the limit assuming index starts with 0
sorted(list(map(lambda x: x.text, root.findall("./interfaces/interface/name", namespaces=nsmap))))[:k]

Merge two XML files by matching elements by attribute value

I have two XML files that I'm trying to merge. I looked at other previous questions, but I don't feel like I can solve my problem from reading those. What I think makes my situation unique is that I have to find elements by attribute value and then merge to the opposite file.
I have two files. One is an English translation catalog and the second is a Japanese translation catalog. Pleas see below.
In the code below you'll see the XML has three elements which I will be merging children on - MessageCatalogueEntry, MessageCatalogueFormEntry, and MessageCatalogueFormItemEntry. I have hundreds of files and each file has thousands of lines. There may be more elements than the three I just listed, but I know for sure that all the elements have a "key" attribute.
My plan:
Iterate through File 1 and create a list of all the values of the "key" attribute.
In this example, the list would be key_values = [321, 260, 320]
Next, I'll go through the key_value list one by one.
I'll search File 1 for an element with attribute key=321.
Next, grab the child of the element with key=321 from File 1.
Next, In File 2,find the element with key=321 and add the child element I previously grabbed from File 1.
Next I'll continue the same process looping through the key_values list.
Next, I'll write the new xml root to a file being careful to keep the utf8 encoding.
File 1:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
File 2:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue[]>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
I'm having trouble just even grabbing elements, nevermind grabbing them by key value. For example, I've been playing with the elementtree library and I wrote this code hoping to get just the MessageCatalogueEntry but I'm only getting their children:
from xml.etree import ElementTree as et
tree_japanese = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue_JA.xml')
root_japanese = tree_japanese.getroot()
MC_japanese = root_japanese.findall("MessageCatalogue")
for x in MC_japanese:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
tree_english = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue.xml')
root_english = tree_english.getroot()
MC_english = root_english.findall("MessageCatalogue")
for x in MC_english:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
Any help would be appreciated. I've been at this for a few work days now and I'm not any closer to finishing than I was when I first started!

Actually, you are getting the MessageCatalogEntry's. The problem is in the print statement. An element acts like a list, so m[0] is the first child of the MessageCatalogEntry. In
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
change the print to print et.tostring(m, encoding='utf8') to see the right element.
I personally prefer lxml to elementtree. Assuming you want to associate entries by the 'key' attribute, you could use xpath to index one of the docs and then pull them into other doc.
import lxml.etree
tree_english = lxml.etree.parse('english.xml')
tree_japanese = lxml.etree.parse('japanese.xml')
# index the japanese catalog
j_index = {}
for catalog in tree_japanese.xpath('MessageCatalogue/*[#key]'):
j_index[catalog.get('key')] = catalog
# find catalog entries in english and merge the japanese
for catalog in tree_english.xpath('MessageCatalogue/*[#key]'):
j_catalog = j_index.get(catalog.get('key'))
if j_catalog is not None:
print 'found match'
for child in j_catalog:
print 'add one'
catalog.append(child)
print lxml.etree.tostring(tree_english, pretty_print=True, encoding='utf8')

Element Tree: How to parse subElements of child nodes

I have an XML tree, which I'd like to parse using Elementtree. My XML looks something like
<?xml version="1.0" encoding="UTF-8"?>
<GetOrdersResponse xmlns="urn:ebay:apis:eBLBaseComponents">
<Ack>Success</Ack>
<Version>857</Version>
<Build>E857_INTL_APIXO_16643800_R1</Build>
<PaginationResult>
<TotalNumberOfPages>1</TotalNumberOfPages>
<TotalNumberOfEntries>2</TotalNumberOfEntries>
</PaginationResult>
<HasMoreOrders>false</HasMoreOrders>
<OrderArray>
<Order>
<OrderID>221362908003-1324471823012</OrderID>
<CheckoutStatus>
<eBayPaymentStatus>NoPaymentFailure</eBayPaymentStatus>
<LastModifiedTime>2014-02-03T12:08:51.000Z</LastModifiedTime>
<PaymentMethod>PaisaPayEscrow</PaymentMethod>
<Status>Complete</Status>
<IntegratedMerchantCreditCardEnabled>false</IntegratedMerchantCreditCardEnabled>
</CheckoutStatus>
</Order>
<Order> ...
</Order>
<Order> ...
</Order>
</OrderArray>
</GetOrdersResponse>
I want to parse the 6th child of the XML () I am able to get the value of subelements by index. E.g if I want OrderID of first order, i can use root[5][0][0].text. But, I would like to get the values of subElements by name. I tried the following code, but it does not print anything:
tree = ET.parse('response.xml')
root = tree.getroot()
for child in root:
try:
for ids in child.find('Order').find('OrderID'):
print ids.text
except:
continue
Could someone please help me on his. Thanks

Since the XML document has a namespace declaration (xmlns="urn:ebay:apis:eBLBaseComponents"), you have to use universal names when referring to elements in the document. For example, you need {urn:ebay:apis:eBLBaseComponents}OrderID instead of just OrderID.
This snippet prints all OrderIDs in the document:
from xml.etree import ElementTree as ET
NS = "urn:ebay:apis:eBLBaseComponents"
tree = ET.parse('response.xml')
for elem in tree.iter("*"): # Use tree.getiterator("*") in Python 2.5 and 2.6
if elem.tag == '{%s}OrderID' % NS:
print elem.text
See http://effbot.org/zone/element-namespaces.htm for details about ElementTree and namespaces.

Try to avoid chaining your finds. If your first find does not find anything, it will return None.
for child in root:
order = child.find('Order')
if order is not None:
ids = order.find('OrderID')
print ids.text

You can find an OrderArray first and then just iterate its children by name:
tree = ET.parse('response.xml')
root = tree.getroot()
order_array = root.find("OrderArray")
for order in order_array.findall('Order'):
order_id_element = order.find('OrderID')
if order_id_element is not None:
print order_id_element.text
A side note. Never ever use except: continue. It hides any exception you get and makes debugging really hard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 parse XML into dictionary - python

Related

How to find if there are empty attributes in XML?

Get children elements of multiple instances of the same name tag using ElementTree

Extract the Kth tag data in XML using ElementTree

Merge two XML files by matching elements by attribute value

Element Tree: How to parse subElements of child nodes

Categories

Resources