Python - Deep XML file for loop - python

I am working with a XML file that looks like the code below, the real one has a lot more spreekbeurt sessions but I made it readable. My goal is to get from all the spreekbeurt sessions the text in the voorvoegsel and achternaam part.
<?xml version="1.0" encoding="utf-8"?>
<officiele-publicatie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://technische-documentatie.oep.overheid.nl/schema/op-xsd-2012-2">
<metadata>
<meta name="OVERHEIDop.externMetadataRecord" scheme="" content="https://zoek.officielebekendmakingen.nl/h-tk-20122013-4-2/metadata.xml" />
</metadata>
<handelingen>
<spreekbeurt nieuw="ja">
<spreker>
<voorvoegsels>De heer</voorvoegsels>
<naam>
<achternaam>Marcouch</achternaam>
</naam> (<politiek>PvdA</politiek>):</spreker>
<tekst status="goed">
<al>Sample Text</al>
</tekst>
</spreekbeurt>
</agendapunt>
</handelingen>
</officiele-publicatie>
I use a for loop to loop through all the spreekbeurt elemets in my XML file. But how do I print out the voorvoegsels and achternaam for every spreekbeurt in my XML file?
import xml.etree.ElementTree as ET
tree = ET.parse('...\directory')
root = tree.getroot()
for spreekbeurt in root.iter('spreekbeurt'):
print spreekbeurt.attrib
This code prints:
{'nieuw': 'nee'}
{'nieuw': 'ja'}
{'nieuw': 'nee'}
{'nieuw': 'nee'}
but how do I get the children printed out of the spreekbeurt?
Thanks in advance!

You can use find() passing path* to the target element to find individual element within a parent/ancestor, for example :
>>> for spreekbeurt in root.iter('spreekbeurt'):
... v = spreekbeurt.find('spreker/voorvoegsels')
... a = spreekbeurt.find('spreker/naam/achternaam')
... print v.text, a.text
...
De heer Marcouch
*) in fact it supports more than just simple path, but subset of XPath 1.0 expressions.

Related

Adding subElement at a specific location with xml.dom.minidom (appendChild)

I intend to insert a sub element at a specified location. However, I do not know how to do that using appendChild in xml.dom
Here is my xml code:
<?xml version='1.0' encoding='UTF-8'?>
<VOD>
<root>
<ab>sdsd
<pp>pras</pp>
<ps>sinha</ps>
</ab>
<ab>prashu</ab>
<ab>sakshi</ab>
<cd>dfdf</cd>
</root>
<root>
<ab>pratik</ab>
</root>
<root>
<ab>Mum</ab>
</root>
</VOD>
I would like to insert another sub element "new" in first "root" element just before the "cd" tag. The result should look like this:
<ab>prashu</ab>
<ab>sakshi</ab>
<new>Anydata</new>
<cd>dfdf</cd>
The code I used for this is:
import xml.dom.minidom as m
doc = m.parse("file_notes.xml")
root=doc.getElementsByTagName("root")
valeurs = doc.getElementsByTagName("root")[0]
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
valeurs.appendChild(element)
doc.writexml(open("newxmlfile.xml","w"))
In what way can I achieve my goal?
Thank you in advance..!!
Try using insertBefore instead. Something along these lines:
element = doc.createElement("new")
element.appendChild(doc.createTextNode("Anydata"))
cd = doc.getElementsByTagName("cd")[0]
cd.parentNode.insertBefore(element, cd)
To insert new nodes based on an index you can just do:
cd_list = doc.getElementsByTagName("cd")
cd_list[0].parentNode.insertBefore(element, cd_list[0])

Python3 parse XML into dictionary

It seems the original post was too vague, so I'm narrowing down the focus of this post. I have an XML file from which I want to pull values from specific branches, and I am having difficulty in understanding how to effectively navigate the XML paths. Consider the XML file below. There are several <mi> branches. I want to store the <r> value of certain branches, but not others. In this example, I want the <r> values of counter1 and counter3, but not counter2.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="Data.xsl" ?>
<!DOCTYPE mdc SYSTEM "Data.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<mfh>
<vn>TEST</vn>
<cbt>20140126234500.0+0000</cbt>
</mfh>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter1</mt>
<mv>
<moid>DEFAULT</moid>
<r>58</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter2</mt>
<mv>
<moid>DEFAULT</moid>
<r>100</r>
</mv>
</mi>
<mi>
<mts>20140126235000.0+0000</mts>
<mt>counter3</mt>
<mv>
<moid>DEFAULT</moid>
<r>7</r>
</mv>
</mi>
</mdc>
From that I would like to build a tuple with the following:
('20140126234500.0+0000', 58, 7)
where 20140126234500.0+0000 is taken from <cbt>, 58 is taken from the <r> value of the <mi> element that has <mt>counter1</mt> and 7 is taken from the <mi> element that has <mt>counter3</mt>.
I would like to use xml.etree.cElementTree since it seems to be standard and should be more than capable for my purposes. But I am having difficulty in navigating the tree and extracting the values I need. Below is some of what I have tried.
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='Data.xml')
root = tree.getroot()
for mi in root.iter('mi'):
print(mi.tag)
for mt in mi.findall("./mt") if mt.value == 'counter1':
print(mi.find("./mv/r").value) #I know this is invalid syntax, but it's what I want to do :)
From a pseudo code standpoint, what I am wanting to do is:
find the <cbt> value and store it in the first position of the tuple.
find the <mi> element where <mt>counter1</mt> exists and store the <r> value in the second position of the tuple.
find the <mi> element where <mt>counter3</mt> exists and store the <r> value in the third position of the tuple.
I'm not clear when to use element.iter() or element.findall(). Also, I'm not having the best of luck with using XPath within the functions, or being able to extract the info I'm needing.
Thanks,
Rusty
Starting with:
import xml.etree.cElementTree as ET # or with try/except as per your edit
xml_data1 = """<?xml version="1.0"?> and the rest of your XML here"""
tree = ET.fromstring(xml_data) # or `ET.parse(<filename>)`
xml_dict = {}
Now tree has the xml tree and xml_dict will be the dictionary you're trying to get the result.
# first get the key & val for 'cbt'
cbt_val = tree.find('mfh').find('cbt').text
xml_dict['cbt'] = cbt_val
The counters are in 'mi':
for elem in tree.findall('mi'):
counter_name = elem.find('mt').text # key
counter_val = elem.find('mv').find('r').text # value
xml_dict[counter_name] = counter_val
At this point, xml_dict is:
>>> xml_dict
{'counter2': '100', 'counter1': '58', 'cbt': '20140126234500.0+0000', 'counter3': '7'}
Some shortening, though possibly not as read-able: the code in the for elem in tree.findall('mi'): loop can be:
xml_dict[elem.find('mt').text] = elem.find('mv').find('r').text
# that combines the key/value extraction to one line
Or further, building the xml_dict can be done in just two lines with the counters first and cbt after:
xml_dict = {elem.find('mt').text: elem.find('mv').find('r').text for elem in tree.findall('mi')}
xml_dict['cbt'] = tree.find('mfh').find('cbt').text
Edit:
From the docs, Element.findall() finds only elements with a tag which are direct children of the current element.
find() only finds the first direct child.
iter() iterates over all the elements recursively.

Extract all the text from xml data with python

I'm new to xml data processing. I want to extract the text data in the following xml file:
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
so that expected result is:
['12345','45667', 'abcde'] Currently I have tried:
tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]
But the result only shows ['12345','45667'] . 'abcde' is missing. Can someone help me? Thanks in advance!
Try doing this using xpath and lxml :
import lxml.etree as etree
string = '''
<data>
<p>12345<strong>45667</strong>abcde</p>
</data>
'''
tree = etree.fromstring(string)
print(tree.xpath('//p//text()'))
The Xpath expression means: "select all p elements wich containing text recursively"
OUTPUT:
['12345', '45667', 'abcde']
getiterator() (or it's replacement iter()) iterates over child tags/elements, while abcde is a text node, a tail of the strong tag.
You can use itertext() method:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
print list(tree.find('p').itertext())
Prints:
['12345', '45667', 'abcde']

Merge two XML files by matching elements by attribute value

I have two XML files that I'm trying to merge. I looked at other previous questions, but I don't feel like I can solve my problem from reading those. What I think makes my situation unique is that I have to find elements by attribute value and then merge to the opposite file.
I have two files. One is an English translation catalog and the second is a Japanese translation catalog. Pleas see below.
In the code below you'll see the XML has three elements which I will be merging children on - MessageCatalogueEntry, MessageCatalogueFormEntry, and MessageCatalogueFormItemEntry. I have hundreds of files and each file has thousands of lines. There may be more elements than the three I just listed, but I know for sure that all the elements have a "key" attribute.
My plan:
Iterate through File 1 and create a list of all the values of the "key" attribute.
In this example, the list would be key_values = [321, 260, 320]
Next, I'll go through the key_value list one by one.
I'll search File 1 for an element with attribute key=321.
Next, grab the child of the element with key=321 from File 1.
Next, In File 2,find the element with key=321 and add the child element I previously grabbed from File 1.
Next I'll continue the same process looping through the key_values list.
Next, I'll write the new xml root to a file being careful to keep the utf8 encoding.
File 1:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
File 2:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue[]>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
I'm having trouble just even grabbing elements, nevermind grabbing them by key value. For example, I've been playing with the elementtree library and I wrote this code hoping to get just the MessageCatalogueEntry but I'm only getting their children:
from xml.etree import ElementTree as et
tree_japanese = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue_JA.xml')
root_japanese = tree_japanese.getroot()
MC_japanese = root_japanese.findall("MessageCatalogue")
for x in MC_japanese:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
tree_english = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue.xml')
root_english = tree_english.getroot()
MC_english = root_english.findall("MessageCatalogue")
for x in MC_english:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
Any help would be appreciated. I've been at this for a few work days now and I'm not any closer to finishing than I was when I first started!
Actually, you are getting the MessageCatalogEntry's. The problem is in the print statement. An element acts like a list, so m[0] is the first child of the MessageCatalogEntry. In
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
change the print to print et.tostring(m, encoding='utf8') to see the right element.
I personally prefer lxml to elementtree. Assuming you want to associate entries by the 'key' attribute, you could use xpath to index one of the docs and then pull them into other doc.
import lxml.etree
tree_english = lxml.etree.parse('english.xml')
tree_japanese = lxml.etree.parse('japanese.xml')
# index the japanese catalog
j_index = {}
for catalog in tree_japanese.xpath('MessageCatalogue/*[#key]'):
j_index[catalog.get('key')] = catalog
# find catalog entries in english and merge the japanese
for catalog in tree_english.xpath('MessageCatalogue/*[#key]'):
j_catalog = j_index.get(catalog.get('key'))
if j_catalog is not None:
print 'found match'
for child in j_catalog:
print 'add one'
catalog.append(child)
print lxml.etree.tostring(tree_english, pretty_print=True, encoding='utf8')

Element Tree: How to parse subElements of child nodes

I have an XML tree, which I'd like to parse using Elementtree. My XML looks something like
<?xml version="1.0" encoding="UTF-8"?>
<GetOrdersResponse xmlns="urn:ebay:apis:eBLBaseComponents">
<Ack>Success</Ack>
<Version>857</Version>
<Build>E857_INTL_APIXO_16643800_R1</Build>
<PaginationResult>
<TotalNumberOfPages>1</TotalNumberOfPages>
<TotalNumberOfEntries>2</TotalNumberOfEntries>
</PaginationResult>
<HasMoreOrders>false</HasMoreOrders>
<OrderArray>
<Order>
<OrderID>221362908003-1324471823012</OrderID>
<CheckoutStatus>
<eBayPaymentStatus>NoPaymentFailure</eBayPaymentStatus>
<LastModifiedTime>2014-02-03T12:08:51.000Z</LastModifiedTime>
<PaymentMethod>PaisaPayEscrow</PaymentMethod>
<Status>Complete</Status>
<IntegratedMerchantCreditCardEnabled>false</IntegratedMerchantCreditCardEnabled>
</CheckoutStatus>
</Order>
<Order> ...
</Order>
<Order> ...
</Order>
</OrderArray>
</GetOrdersResponse>
I want to parse the 6th child of the XML () I am able to get the value of subelements by index. E.g if I want OrderID of first order, i can use root[5][0][0].text. But, I would like to get the values of subElements by name. I tried the following code, but it does not print anything:
tree = ET.parse('response.xml')
root = tree.getroot()
for child in root:
try:
for ids in child.find('Order').find('OrderID'):
print ids.text
except:
continue
Could someone please help me on his. Thanks
Since the XML document has a namespace declaration (xmlns="urn:ebay:apis:eBLBaseComponents"), you have to use universal names when referring to elements in the document. For example, you need {urn:ebay:apis:eBLBaseComponents}OrderID instead of just OrderID.
This snippet prints all OrderIDs in the document:
from xml.etree import ElementTree as ET
NS = "urn:ebay:apis:eBLBaseComponents"
tree = ET.parse('response.xml')
for elem in tree.iter("*"): # Use tree.getiterator("*") in Python 2.5 and 2.6
if elem.tag == '{%s}OrderID' % NS:
print elem.text
See http://effbot.org/zone/element-namespaces.htm for details about ElementTree and namespaces.
Try to avoid chaining your finds. If your first find does not find anything, it will return None.
for child in root:
order = child.find('Order')
if order is not None:
ids = order.find('OrderID')
print ids.text
You can find an OrderArray first and then just iterate its children by name:
tree = ET.parse('response.xml')
root = tree.getroot()
order_array = root.find("OrderArray")
for order in order_array.findall('Order'):
order_id_element = order.find('OrderID')
if order_id_element is not None:
print order_id_element.text
A side note. Never ever use except: continue. It hides any exception you get and makes debugging really hard.

Categories

Resources