Recursive XML parsing python using ElementTree

Recursive XML parsing python using ElementTree - python

I'm trying to parse below XML using Python ElementTree to product output as below. I'm trying to write modules for top elements to print them. However It is slightly tricky as category element may or may not have property and cataegory element may have a category element inside.
I've referred to previous question in this topic, but they did not consist of nested elements with same name
My Code:
http://pastebin.com/Fsv2Xzqf
work.xml:
<suite id="1" name="MainApplication">
<displayNameKey>my Application</displayNameKey>
<displayName>my Application</displayName>
<application id="2" name="Sub Application1">
<displayNameKey>sub Application1</displayNameKey>
<displayName>sub Application1</displayName>
<category id="2423" name="about">
<displayNameKey>subApp.about</displayNameKey>
<displayName>subApp.about</displayName>
<category id="2423" name="comms">
<displayNameKey>subApp.comms</displayNameKey>
<displayName>subApp.comms</displayName>
<property id="5909" name="copyright" type="string_property" width="40">
<value>2014</value>
</property>
<property id="5910" name="os" type="string_property" width="40">
<value>Linux 2.6.32-431.29.2.el6.x86_64</value>
</property>
</category>
<property id="5908" name="releaseNumber" type="string_property" width="40">
<value>9.1.0.3.0.54</value>
</property>
</category>
</application>
</suite>
Output should be as below:
Suite: MainApplication
Application: Sub Application1
Category: about
property: releaseNumber | 9.1.0.3.0.54
category: comms
property: copyright | 2014
property: os | Linux 2.6.32-431.29.2.el6.x86_64
Any pointers in right direction would be of help.

import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='work.xml')
indent = 0
ignoreElems = ['displayNameKey', 'displayName']
def printRecur(root):
"""Recursively prints the tree."""
if root.tag in ignoreElems:
return
print ' '*indent + '%s: %s' % (root.tag.title(), root.attrib.get('name', root.text))
global indent
indent += 4
for elem in root.getchildren():
printRecur(elem)
indent -= 4
root = tree.getroot()
printRecur(root)
OUTPUT:
Suite: MainApplication
Application: Sub Application1
Category: about
Category: comms
Property: copyright
Value: 2014
Property: os
Value: Linux 2.6.32-431.29.2.el6.x86_64
Property: releaseNumber
Value: 9.1.0.3.0.54
This is closest I could get in 5 minutes. You should just recursively call a processor function and that would take care. You can improve on from this point :)
You can also define handler function for each tag and put all of them in a dictionary for easy lookup. Then you can check if you have an appropriate handler function for that tag, then call that else just continue with blindly printing. For example:
HANDLERS = {
'property': 'handle_property',
<tag_name>: <handler_function>
}
def handle_property(root):
"""Takes property root element and prints the values."""
data = ' '*indent + '%s: %s ' % (root.tag.title(), root.attrib['name'])
values = []
for elem in root.getchildren():
if elem.tag == 'value':
values.append(elem.text)
print data + '| %s' % (', '.join(values))
# printRecur would get modified accordingly.
def printRecur(root):
"""Recursively prints the tree."""
if root.tag in ignoreElems:
return
global indent
indent += 4
if root.tag in HANDLERS:
handler = globals()[HANDLERS[root.tag]]
handler(root)
else:
print ' '*indent + '%s: %s' % (root.tag.title(), root.attrib.get('name', root.text))
for elem in root.getchildren():
printRecur(elem)
indent -= 4
Output with above:
Suite: MainApplication
Application: Sub Application1
Category: about
Category: comms
Property: copyright | 2014
Property: os | Linux 2.6.32-431.29.2.el6.x86_64
Property: releaseNumber | 9.1.0.3.0.54
I find this very useful rather than putting tons of if/else in the code.

If you want a barebones XML recursive tree parser snippet:
from xml.etree import ElementTree
tree = ElementTree.parse('english_saheeh.xml')
root = tree.getroot()
def walk_tree_recursive(root):
#do whatever with .tags here
for child in root:
walk_tree_recursive(child)
walk_tree_recursive(root)

if you want a kind of universal xml importer, creating a record per xml element
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
def rij(elem,level,tags,rtag,mtag,keys,rootkey,data):
otag=mtag
mtag=elem.tag
mtag=mtag[mtag.rfind('}')+1:]
tags.append(mtag)
if level==1:
rtag=mtag
if elem.keys() is not None:
mkey=[]
if len(elem.keys())>1:
for key in elem.keys():
mkey.append(elem.attrib.get(key))
rootkey=mkey
else:
for key in elem.keys():
rootkey=elem.attrib.get(key)
else:
if elem.keys() is not None:
mkey=[]
lkey=[]
for key in elem.keys():
if len(elem.keys())>1:
mkey.append(elem.attrib.get(key))
keys=mkey
else:
for key in elem.keys():
keys=elem.attrib.get(key)
lkey=key
if elem.text is not None:
if elem.text!='\n ':
data.append([rootkey,tags,rtag,otag,mtag,lkey,keys,elem.text])
else:
data.append([rootkey,tags,rtag,otag,mtag,lkey,keys,''])
#print(data)
level+=1
for chil in elem.getchildren():
data = rij(chil, level,tags,rtag,mtag, keys,rootkey,data)
level-=1
mtag=elem.tag
mtag=mtag[mtag.rfind('}')+1:]
tags.remove(mtag)
return data
data = rij(root,0,[],'','', [],[],[])

Related

How to find a XML child element which has a default namespace in Python？

My goal is to find the XML child element which has a default name.
XML:
<?xml version='1.0' encoding='UTF-8'?>
<all:config xmlns:all="urn:base:1.0">
<interfaces xmlns="urn:ietf-interfaces">
<interface>
<name>eth0</name>
<enabled>true</enabled>
<ipv4 xmlns="urn:b-ip">
<enabled>true</enabled>
</ipv4>
<tagging xmlns="urn:b:interfaces:1.0">true</tagging>
<mac xmlns="urn:b:interfaces:1.0">00:00:10:00:00:11</mac>
</interface>
</interfaces>
</all:config>
I want to find the following element:
<mac xmlns="urn:b:interfaces:1.0">00:00:10:00:00:11</mac>
and change mac's text.
I have the following questions:
What is the xpath of mac?
How can I find "mac" using xpath since it has the default namespace?
My code does not work:
def set_element_value(file_name, element, new_value, order):
filename = file_name
tree = etree.parse(filename)
root = tree.getroot()
xml_string = etree.tostring(tree).decode('utf-8')
my_own_namespace_mapping = {'prefix': 'urn:b:interfaces:1.0'}
myele = root.xpath('.//prefix:mac', namespaces=my_own_namespace_mapping)
myele[0].text = "aaa"
for ele in root.xpath('.//prefix:mac', namespaces=my_own_namespace_mapping):
if count_order == order:
ele.text = str(new_value)
count_order += 1
def main():
filename ="./template/b.xml"
element = ".//interfaces/interface/mac"
new_value = "10"
order = 0
set_element_value(filename, element, new_value, order)
if __name__ == '__main__':
main()
I tried to dig out in the stackoverflow, but no similar answer.
Could you please give me some tips?
Thank you!

Thanks to Jack's methods, I fixed this issue:
The new code:
def set_element_value(file_name, element, new_value, order):
filename = file_name
tree = etree.parse(filename)
tag_list = tree.xpath('.//*[local-name()="mac"]')
print("tag:", tag_list, " and tag value:", tag_list[0].text)
tag_list[0].text = "10"
xml_string = etree.tostring(tree).decode('utf-8')
print(xml_string)
def main():
filename ="./template/b.xml"
element = "mac"
new_value = "10"
order = 1
set_element_value(filename, element, new_value, order)
if __name__ == '__main__':
main()
output:
tag: [<Element {urn:ietf-interfaces}mac at 0x298fc4b69c0>] and tag value: 10
<all:config xmlns:all="urn:base:1.0">
<interfaces xmlns="urn:ietf-interfaces">
<interface>
<name>eth0</name>
<enabled>true</enabled>
<ipv4 xmlns="urn:b-ip">
<enabled>true</enabled>
</ipv4>
<tagging xmlns="urn:b:interfaces:1.0">true</tagging>
<mac xmln="urn:b:interfaces:1.0">10</mac>
</interface>
</interfaces>
</all:config>

Your code seems to be a little too complicated than necessary. Try the following to get to the mac address:
ns = {"x":"urn:b:interfaces:1.0"}
root.xpath('//x:mac/text()',namespaces=ns)[0]
or if you don't want to deal with namespaces:
root.xpath('//*[local-name()="mac"]/text()')[0]
Output in either case is
00:00:10:00:00:11

Write 'xsi:' in front of attribute with lxml for python 3

I'm adding elements to an xml file.
The document's root is as follows
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
And elements to add look like
<Element xsi:type="some type">
<Sub1>Some text</Sub1>
<Sub2>More text</Sub2>
...
</Element>
I'm trying to find a way for lxml to write 'xsi:' in front of my Element's attibute. This xml file is used by a program to which's source code I do not have access to. I read in a few other questions how to do it by declaring the nsmap of the xml's root, and then again in the child's attribute, which I tried but it didn't work. So far I have (that's what didn't work, the ouput file did not contain the xsi prefix):
element = SubElement(_parent=parent,
_tag='some tag',
attrib={'{%s}type' % XSI: 'some type'}
nsmap={'xsi': XSI}) # Where XSI = namespace address
The namespace is declared properly in the xml file I parse, so I don't know why this isn't working.
The output I get is the element as shown above without the 'xsi:' prefix and all on one line:
<Element type="some type"><Sub1>Some text</Sub1><Sub2>More text</Sub2>...</Element>
If anyone can also point out why in this line
self.tree.write(self.filename, pretty_print=True, encoding='utf-8')
the 'pretty_print' option doesn't work (all printed out in one line), it would be greatly appreciated.
Here is a code example of my script:
from math import floor
from lxml import etree
from lxml.etree import SubElement
def Element(root, sub1: str):
if not isinstance(sub1, str):
raise TypeError
else:
element = SubElement(root, 'Element')
element_sub1 = SubElement(element, 'Sub1')
element_sub1.text = sub1
# ...
# Omitted additional SubElements
# ...
return element
def Sub(root, sub5_sub: str):
XSI = "http://www.w3.org/2001/XMLSchema-instance"
if not isinstance(sub5_sub, str):
raise TypeError
else:
sub = SubElement(root, 'Sub5_Sub', {'{%s}type' % XSI: 'SomeType'}, nsmap={'xsi': XSI})
# ...
# Omitted additional SubElements
# ...
return sub
class Generator:
def __init__(self) -> None:
self.filename = None
self.csv_filename = None
self.csv_content = []
self.tree = None
self.root = None
self.panel = None
self.panels = None
def mainloop(self) -> None:
"""App's mainloop"""
while True:
# Getting files from user
xml_filename = input('Enter path to xml file : ')
# Parsing files
csv_content = [{'field1': 'ElementSub1', 'field2': 'something'},
{'field1': 'ElementSub1', 'field2': 'something'},
{'field1': 'ElementSub2', 'field2': 'something'}] # Replaces csv file that I use
tree = etree.parse(xml_filename)
root = tree.getroot()
elements = root.find('Elements')
for element in elements:
if element.find('Sub1').text in ['ElementSub1', 'ElementSub2']:
for line in csv_content:
if element.find('Sub5') is not None:
Sub(root=element.find('Sub5'),
sub5_sub=line['field2'])
tree.write(xml_filename, pretty_print=True, encoding='utf-8')
if input('Continue? (Y) Quit (n)').upper().startswith('Y'):
elements.clear()
continue
else:
break
#staticmethod
def get_x(x: int) -> str:
if not isinstance(x, int):
x = int(x)
return str(int(floor(9999 / 9 * x)))
#staticmethod
def get_y(y: int) -> str:
if not isinstance(y, int):
y = int(y)
return str(int(floor(999 / 9 * y)))
def quit(self) -> None:
quit()
if __name__ == "__main__":
app = Generator()
app.mainloop()
app.quit()
Here is what it outputs:
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Elements>
<Element>
<Sub1>ElementSub1</Sub1>
<Sub5>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/></Sub5>
</Element>
<Element>
<Sub1>ElementSub1</Sub1>
<Sub5>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/>
<Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/><Sub5_Sub xsi:type="SomeType"/></Sub5>
</Element>
<Element>
<Sub1>ElementSub1</Sub1>
</Element>
</Elements>
</Root>
For some reason, this piece of code does what I want but my real code doesn't. I've come to realize that it does put a prefix on some sub elements with the type attribute, but not all and on those it puts the prefix, it isn't always just 'xsi:'. I found a quick and dirty way to fix this problem which is less than ideal (find and replace through the file for xsi-type -> accepted by lxml's api to xsi:type). What still isn't working though is that it's all printed out in one line despite the pretty_print parameter being true.

I just recently encountered this scenario and was able to successfully create an attribute with the xsi:
qname = etree.QName("http://www.w3.org/2001/XMLSchema-instance", "type")
element = etree.Element('Element', {qname: "some type")
root.append(element)
this outputs something like
<Element xsi:type="some type">

Return all tags and values of a given object within xml document in Python

I have xml which looks like this...
<CONFIG2>
<OBJECT id="{2D3474AA-9A0F-4696-979C-1BCE9350F3BD}" type="3" name="Test2" rev="1">
<RULEITEM>
<ID>{BF7D00C5-57BC-4187-9B07-064BA5744A12}</ID>
<PROPERTIES>
<columnname/>
<columnvalue/>
<days>|</days>
<enabled>1</enabled>
<eventdate>0001-01-01</eventdate>
<eventtime>00:00:00</eventtime>
<function>average</function>
<parameters/>
<stattype>standard</stattype>
</PROPERTIES>
<ITEM>
<ID>{61C82F62-8F31-4754-A705-7CCBB34C6FD4}</ID>
<PROPERTIES>
<actionindex>0</actionindex>
<actiontype>eventlog</actiontype>
<severity>error</severity>
</PROPERTIES>
</ITEM>
</RULEITEM>
<PROPERTIES>
<groups>|</groups>
</PROPERTIES>
</OBJECT>
I am trying to return everything within OBJECT and /OBJECT.
For example, I would like to return all tags and values for the OBJECT where type="3" and name="test2".
Here is my current python script...
import xml.etree.ElementTree as ET
tree = ET.parse(r'C:\Users\User\Desktop\xml\config.xml')
root = tree.getroot()
objectType = input("What object type are you looking for?: ")
for item in root.findall('OBJECT'):
if item.attrib['type'] == objectType:
print(item.get('name'))
objectName = input("What is the name of the object you are looking for?: ")
for item in root.findall('OBJECT'):
if item.attrib['type'] == objectType and item.get('name') == objectName:
print(list(item))
This returns...
<Element "RULEITEM' at 0x00064F3C0>, <Element 'PROPERTIES' at 0x0064FB40>
I like it to return the entire object and all tags and values. Does anyone know how I could do this?
Thanks!

Consider adjusting second loop by using another findall() iteration on all of OBJECT's children and grandchildren and so forth using (//*). Also, consider appending to a pre-defined list instead of list() (which would have broken each character of the loops' text item) and condition it to remove None values (i.e., empty nodes). Finally, use the tag and text property of the node:
... same code as above...
values = []
for item in root.findall('OBJECT'):
if item.attrib['type'] == objectType and item.get('name') == objectName:
for data in root.findall("OBJECT//*"):
if data.text is not None:
values.append(data.tag + ': ' + data.text)
for v in values:
print(v)
#What object type are you looking for?: 3
#Test2
#What is the name of the object you are looking for?: Test2
#RULEITEM:
#
#ID: {BF7D00C5-57BC-4187-9B07-064BA5744A12}
#PROPERTIES:
#
#days: |
#enabled: 1
#eventdate: 0001-01-01
#eventtime: 00:00:00
#function: average
#stattype: standard
#ITEM:
#
#ID: {61C82F62-8F31-4754-A705-7CCBB34C6FD4}
#PROPERTIES:
#
#actionindex: 0
#actiontype: eventlog
#severity: error
#PROPERTIES:
#
#groups: |

Reading Maven Pom xml in Python

I have a pom file that has the following defined:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.welsh</groupId>
<artifactId>my-site</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<profiles>
<profile>
<build>
<plugins>
<plugin>
<groupId>org.welsh.utils</groupId>
<artifactId>site-tool</artifactId>
<version>1.0</version>
<executions>
<execution>
<configuration>
<mappings>
<property>
<name>homepage</name>
<value>/content/homepage</value>
</property>
<property>
<name>assets</name>
<value>/content/assets</value>
</property>
</mappings>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
And I am looking to build a dictionary off the name & value elements under property under the mappings element.
So what I'm trying to figure out how to get all possible mappings elements (Incase of multiple build profiles) so I can get all property elements under it and from reading about Supported XPath syntax the following should print out all possible text/value elements:
import xml.etree.ElementTree as xml
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
for mapping in root.findall('*/mappings'):
for prop in mapping.findall('.//property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which is returning nothing. I tried just printing out all the mappings elements and get:
>>> print root.findall('*/mappings')
[]
And when I print out the everything from root I get:
>>> print root.findall('*')
[<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x10b38bd50>, <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x10b38bd90>, <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x10b38bf10>, <Element '{http://maven.apache.org/POM/4.0.0}version' at 0x10b3900d0>, <Element '{http://maven.apache.org/POM/4.0.0}packaging' at 0x10b390110>, <Element '{http://maven.apache.org/POM/4.0.0}name' at 0x10b390150>, <Element '{http://maven.apache.org/POM/4.0.0}properties' at 0x10b390190>, <Element '{http://maven.apache.org/POM/4.0.0}build' at 0x10b390310>, <Element '{http://maven.apache.org/POM/4.0.0}profiles' at 0x10b390390>]
Which made me try to print:
>>> print root.findall('*/{http://maven.apache.org/POM/4.0.0}mappings')
[]
But that's not working either.
Any suggestions would be great.
Thanks,

The main issues of the code in the question are
that it doesn't specify namespaces, and
that it uses */ instead of // which only matches direct children.
As you can see at the top of the XML file, Maven uses the namespace http://maven.apache.org/POM/4.0.0. The attribute xmlns in the root node defines the default namespace. The attribute xmlns:xsi defines a namespace that is only used for xsi:schemaLocation.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
To specify tags like profile in methods like find, you have to specify the namespace as well. For example, you could write the following to find all profile-tags.
import xml.etree as xml
pom = xml.parse('pom.xml')
for profile in pom.findall('//{http://maven.apache.org/POM/4.0.0}profile'):
print(repr(profile))
Also note that I'm using //. Using */ would have the same result for your specific xml file above. However, it would not work for other tags like mappings. Since * represents only one level, */child can be expanded to parent/tag or xyz/tag but not to xyz/parent/tag.
Now, you should be able to come up with something like this to find all mappings:
pom = xml.parse('pom.xml')
map = {}
for mapping in pom.findall('//{http://maven.apache.org/POM/4.0.0}mappings'
'/{http://maven.apache.org/POM/4.0.0}property'):
name = mapping.find('{http://maven.apache.org/POM/4.0.0}name').text
value = mapping.find('{http://maven.apache.org/POM/4.0.0}value').text
map[name] = value
Specifying the namespaces like this is quite verbose. To make it easier to read, you can define a namespace map and pass it as second argument to find and findall:
# ...
nsmap = {'m': 'http://maven.apache.org/POM/4.0.0'}
for mapping in pom.findall('//m:mappings/m:property', nsmap):
name = mapping.find('m:name', nsmap).text
value = mapping.find('m:value', nsmap).text
map[name] = value

Ok, found out that when I remove maven stuff from the project element so its just <project> I can do this:
for mapping in root.findall('*//mappings'):
logging.info(mapping)
for prop in mapping.findall('./property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which would result in:
INFO:root:<Element 'mappings' at 0x10d72d350>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, if I leave the Maven stuff in at the top I can do this:
for mapping in root.findall('*//{http://maven.apache.org/POM/4.0.0}mappings'):
logging.info(mapping)
for prop in mapping.findall('./{http://maven.apache.org/POM/4.0.0}property'):
logging.info(prop.find('{http://maven.apache.org/POM/4.0.0}name').text + " => " + prop.find('{http://maven.apache.org/POM/4.0.0}value').text)
Which results in:
INFO:root:<Element '{http://maven.apache.org/POM/4.0.0}mappings' at 0x10aa7f310>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, I'd love to be able to figure out how to avoid having to account for the maven stuff since it locks me into this one format.
EDIT:
Ok, I managed to get something a bit more verbose:
import xml.etree.ElementTree as xml
def getMappingsNode(node, nodeName):
if node.findall('*'):
for n in node.findall('*'):
if nodeName in n.tag:
return n
else:
return getMappingsNode(n, nodeName)
def getMappings(rootNode):
mappingsNode = getMappingsNode(rootNode, 'mappings')
mapping = {}
for prop in mappingsNode.findall('*'):
key = ''
val = ''
for child in prop.findall('*'):
if 'name' in child.tag:
key = child.text
if 'value' in child.tag:
val = child.text
if val and key:
mapping[key] = val
return mapping
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
mappings = getMappings(root)
print mappings

Accessing XMLNS attribute with Python Elementree?

How can one access NS attributes through using ElementTree?
With the following:
<data xmlns="http://www.foo.net/a" xmlns:a="http://www.foo.net/a" book="1" category="ABS" date="2009-12-22">
When I try to root.get('xmlns') I get back None, Category and Date are fine, Any help appreciated..

I think element.tag is what you're looking for. Note that your example is missing a trailing slash, so it's unbalanced and won't parse. I've added one in my example.
>>> from xml.etree import ElementTree as ET
>>> data = '''<data xmlns="http://www.foo.net/a"
... xmlns:a="http://www.foo.net/a"
... book="1" category="ABS" date="2009-12-22"/>'''
>>> element = ET.fromstring(data)
>>> element
<Element {http://www.foo.net/a}data at 1013b74d0>
>>> element.tag
'{http://www.foo.net/a}data'
>>> element.attrib
{'category': 'ABS', 'date': '2009-12-22', 'book': '1'}
If you just want to know the xmlns URI, you can split it out with a function like:
def tag_uri_and_name(elem):
if elem.tag[0] == "{":
uri, ignore, tag = elem.tag[1:].partition("}")
else:
uri = None
tag = elem.tag
return uri, tag
For much more on namespaces and qualified names in ElementTree, see effbot's examples.

Look at the effbot namespaces documentation/examples; specifically the parse_map function. It shows you how to add an *ns_map* attribute to each element which contains the prefix/URI mapping that applies to that specific element.
However, that adds the ns_map attribute to all the elements. For my needs, I found I wanted a global map of all the namespaces used to make element look up easier and not hardcoded.
Here's what I came up with:
import elementtree.ElementTree as ET
def parse_and_get_ns(file):
events = "start", "start-ns"
root = None
ns = {}
for event, elem in ET.iterparse(file, events):
if event == "start-ns":
if elem[0] in ns and ns[elem[0]] != elem[1]:
# NOTE: It is perfectly valid to have the same prefix refer
# to different URI namespaces in different parts of the
# document. This exception serves as a reminder that this
# solution is not robust. Use at your own peril.
raise KeyError("Duplicate prefix with different URI found.")
ns[elem[0]] = "{%s}" % elem[1]
elif event == "start":
if root is None:
root = elem
return ET.ElementTree(root), ns
With this you can parse an xml file and obtain a dict with the namespace mappings. So, if you have an xml file like the following ("my.xml"):
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"\
>
<feed>
<item>
<title>Foo</title>
<dc:creator>Joe McGroin</dc:creator>
<description>etc...</description>
</item>
</feed>
</rss>
You will be able to use the xml namepaces and get info for elements like dc:creator:
>>> tree, ns = parse_and_get_ns("my.xml")
>>> ns
{u'content': '{http://purl.org/rss/1.0/modules/content/}',
u'dc': '{http://purl.org/dc/elements/1.1/}'}
>>> item = tree.find("/feed/item")
>>> item.findtext(ns['dc']+"creator")
'Joe McGroin'

Try this:
import xml.etree.ElementTree as ET
import re
import sys
with open(sys.argv[1]) as f:
root = ET.fromstring(f.read())
xmlns = ''
m = re.search('{.*}', root.tag)
if m:
xmlns = m.group(0)
print(root.find(xmlns + 'the_tag_you_want').text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursive XML parsing python using ElementTree - python

If you want a barebones XML recursive tree parser snippet: from xml.etree import ElementTree tree = ElementTree.parse('english_saheeh.xml') root = tree.getroot() def walk_tree_recursive(root): #do whatever with .tags here for child in root: walk_tree_recursive(child) walk_tree_recursive(root)

Related

How to find a XML child element which has a default namespace in Python？

Write 'xsi:' in front of attribute with lxml for python 3

Return all tags and values of a given object within xml document in Python

Reading Maven Pom xml in Python

Accessing XMLNS attribute with Python Elementree?

Categories

Resources