Parsing XML with ElementTree in Python

Parsing XML with ElementTree in Python - python

I have XML like this:
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
I want to parse the XML and extract the <value> entry which is just below the <name> entry marked 'swisspro'. I.e. I want to parse and extract the 'Q8H6N2' value.
How would I do this using ElementTree?

It would by much easier to do via lxml, but here' a solution using ElementTree library:
import xml.etree.ElementTree as ET
data = """<parameters>
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
</parameter>
</parameters>"""
tree = ET.fromstring(data)
for parameter in tree.iter(tag='parameter'):
name = parameter.find('name')
if name is not None and name.text == 'swisspro':
print parameter.find('value').text
break
prints:
Q8H6N2
The idea is pretty simple: iterate over all parameter tags, check the value of the name tag and if it is equal to swisspro, get the value element.
Hope that helps.

Here is an example:
xml file
<span style="font-size:13px;"><?xml version="1.0" encoding="utf-8"?>
<root>
<person age="18">
<name>hzj</name>
<sex>man</sex>
</person>
<person age="19" des="hello">
<name>kiki</name>
<sex>female</sex>
</person>
</root></span>
parse method
from xml.etree import ElementTree
def print_node(node):
'''print basic info'''
print "=============================================="
print "node.attrib:%s" % node.attrib
if node.attrib.has_key("age") > 0 :
print "node.attrib['age']:%s" % node.attrib['age']
print "node.tag:%s" % node.tag
print "node.text:%s" % node.text
def read_xml(text):
'''read xml file'''
# root = ElementTree.parse(r"D:/test.xml") #first method
root = ElementTree.fromstring(text) #second method
# get element
# 1 by getiterator
lst_node = root.getiterator("person")
for node in lst_node:
print_node(node)
# 2 by getchildren
lst_node_child = lst_node[0].getchildren()[0]
print_node(lst_node_child)
# 3 by .find
node_find = root.find('person')
print_node(node_find)
#4. by findall
node_findall = root.findall("person/name")[1]
print_node(node_findall)
if __name__ == '__main__':
read_xml(open("test.xml").read())

Related

how to parse XML with namespace and attribute in Python?

hi I am trying to parse xml with namespace and attribute.
I am almost close by using root.findall() and .get()
However still struggling to get the accurate values from xml file.
How to get the xml attribute values ?
Input:
<?xml version="1.0" encoding="UTF-8"?><message:GenericData
xmlns:message="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message"
xmlns:common="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:generic="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic"
xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXMessage.xsd
http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXCommon.xsd
http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic https://sdw-
wsrest.ecb.europa.eu:443/vocabulary/sdmx/2_1/SDMXDataGeneric.xsd">
<generic:Obs>
<generic:ObsDimension value="1999-01"/>
<generic:ObsValue value="0.7029125"/>
</generic:Obs>
<generic:Obs>
<generic:ObsDimension value="1999-02"/>
<generic:ObsValue value="0.688505"/>
</generic:Obs>
Code:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
for x in root.findall('.//'):
print(x.tag, " ", x.get('value'))
Output:
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}Obs None
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsDimension 1999-01
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsValue 0.7029125
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}Obs None
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsDimension 1999-02
{http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic}ObsValue 0.688505
Expected_Output:
1999-01 0.7029125
1999-02 0.688505

How about this:
for parent in root:
print(' '.join([child.get('value', "") for child in parent]))

How to find a XML child element which has a default namespace in Python？

My goal is to find the XML child element which has a default name.
XML:
<?xml version='1.0' encoding='UTF-8'?>
<all:config xmlns:all="urn:base:1.0">
<interfaces xmlns="urn:ietf-interfaces">
<interface>
<name>eth0</name>
<enabled>true</enabled>
<ipv4 xmlns="urn:b-ip">
<enabled>true</enabled>
</ipv4>
<tagging xmlns="urn:b:interfaces:1.0">true</tagging>
<mac xmlns="urn:b:interfaces:1.0">00:00:10:00:00:11</mac>
</interface>
</interfaces>
</all:config>
I want to find the following element:
<mac xmlns="urn:b:interfaces:1.0">00:00:10:00:00:11</mac>
and change mac's text.
I have the following questions:
What is the xpath of mac?
How can I find "mac" using xpath since it has the default namespace?
My code does not work:
def set_element_value(file_name, element, new_value, order):
filename = file_name
tree = etree.parse(filename)
root = tree.getroot()
xml_string = etree.tostring(tree).decode('utf-8')
my_own_namespace_mapping = {'prefix': 'urn:b:interfaces:1.0'}
myele = root.xpath('.//prefix:mac', namespaces=my_own_namespace_mapping)
myele[0].text = "aaa"
for ele in root.xpath('.//prefix:mac', namespaces=my_own_namespace_mapping):
if count_order == order:
ele.text = str(new_value)
count_order += 1
def main():
filename ="./template/b.xml"
element = ".//interfaces/interface/mac"
new_value = "10"
order = 0
set_element_value(filename, element, new_value, order)
if __name__ == '__main__':
main()
I tried to dig out in the stackoverflow, but no similar answer.
Could you please give me some tips?
Thank you!

Thanks to Jack's methods, I fixed this issue:
The new code:
def set_element_value(file_name, element, new_value, order):
filename = file_name
tree = etree.parse(filename)
tag_list = tree.xpath('.//*[local-name()="mac"]')
print("tag:", tag_list, " and tag value:", tag_list[0].text)
tag_list[0].text = "10"
xml_string = etree.tostring(tree).decode('utf-8')
print(xml_string)
def main():
filename ="./template/b.xml"
element = "mac"
new_value = "10"
order = 1
set_element_value(filename, element, new_value, order)
if __name__ == '__main__':
main()
output:
tag: [<Element {urn:ietf-interfaces}mac at 0x298fc4b69c0>] and tag value: 10
<all:config xmlns:all="urn:base:1.0">
<interfaces xmlns="urn:ietf-interfaces">
<interface>
<name>eth0</name>
<enabled>true</enabled>
<ipv4 xmlns="urn:b-ip">
<enabled>true</enabled>
</ipv4>
<tagging xmlns="urn:b:interfaces:1.0">true</tagging>
<mac xmln="urn:b:interfaces:1.0">10</mac>
</interface>
</interfaces>
</all:config>

Your code seems to be a little too complicated than necessary. Try the following to get to the mac address:
ns = {"x":"urn:b:interfaces:1.0"}
root.xpath('//x:mac/text()',namespaces=ns)[0]
or if you don't want to deal with namespaces:
root.xpath('//*[local-name()="mac"]/text()')[0]
Output in either case is
00:00:10:00:00:11

Reshape xml using python?

I have a xml like this
<data>
<B>Head1</B>
<I>Inter1</I>
<I>Inter2</I>
<I>Inter3</I>
<I>Inter4</I>
<I>Inter5</I>
<O>,</O>
<B>Head2</B>
<I>Inter6</I>
<I>Inter7</I>
<I>Inter8</I>
<I>Inter9</I>
<O>,</O>
<O> </O>
</data>
and I want the XML to look like
<data>
<combined>Head1 Inter1 Inter2 Inter3 Inter4 Inter5</combined>,
<combined>Head2 Inter6 Inter7 Inter8 Inter9</combined>
</data>
I tried to get all values of "B"
for value in mod.getiterator(tag='B'):
print (value.text)
Head1
Head2
for value in mod.getiterator(tag='I'):
print (value.text)
Inter1
Inter2
Inter3
Inter4
Inter5
Inter6
Inter7
Inter8
Inter9
Now How should I save the first iteration value to one tag and then the second one in diffrent tag. ie. How do make the iteration to start at tag "B" find all the tag "I" which are following it and then iterate again if I again find a tag "B" and save them all in a new tag.
tag "O" will always be present at the end

You can use ElementTree module from xml.etree:
from xml.etree import ElementTree
struct = """
<data>
{}
</data>
"""
def reformat(tree):
root = tree.getroot()
seen = []
for neighbor in root.iter('data'):
for child in neighbor.getchildren():
tx = child.text
if tx == ',':
yield "<combined>{}<combined>".format(' '.join(seen))
seen = []
else:
seen.append(tx)
with open('test.xml') as f:
tree = ElementTree.parse(f)
print(struct.format(',\n'.join(reformat(tree))))
result:
<data>
<combined>Head1 Inter1 Inter2 Inter3 Inter4 Inter5<combined>,
<combined>Head2 Inter6 Inter7 Inter8 Inter9<combined>
</data>
Note that if you're not sure all the blocks are separated wit comma you can simply change the condition if tx == ',': according your file format. You can also check when the tx is started with 'Head' then if seen is not empty yield the seen and clear its content, otherwise append the tx and continue.

Remove XML node if childnode's childnode contains specific value

I need to filter an XML file for certain values, if the node contains this value, the node should be removed.
<?xml version="1.0" encoding="utf-8" ?>
<ogr:FeatureCollection
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ogr.maptools.org/ TZwards.xsd"
xmlns:ogr="http://ogr.maptools.org/"
xmlns:gml="http://www.opengis.net/gml">
<gml:boundedBy></gml:boundedBy>
<gml:featureMember>
<ogr:TZwards fid="F0">
<ogr:Region_Nam>TARGET</ogr:Region_Nam>
<ogr:District_N>Kondoa</ogr:District_N>
<ogr:Ward_Name>Bumbuta</ogr:Ward_Name>
</ogr:TZwards>
</gml:featureMember>
<gml:featureMember>
<ogr:TZwards fid="F1">
<ogr:Region_Nam>REMOVE</ogr:Region_Nam>
<ogr:District_N>Kondoa</ogr:District_N>
<ogr:Ward_Name>Pahi</ogr:Ward_Name>
</ogr:TZwards>
</gml:featureMember>
</ogr:FeatureCollection>
The Python script should keep the <gml:featureMember> node if the <ogr:Region_Nam> contains TARGET and remove all other nodes.
from xml.dom import minidom
import xml.etree.ElementTree as ET
tree = ET.parse('input.xml').getroot()
removeList = list()
for child in tree.iter('gml:featureMember'):
if child.tag == 'ogr:TZwards':
name = child.find('ogr:Region_Nam').text
if (name == 'TARGET'):
removeList.append(child)
for tag in removeList:
parent = tree.find('ogr:TZwards')
parent.remove(tag)
out = ET.ElementTree(tree)
out.write(outputfilepath)
Desired output:
<?xml version="1.0" encoding="utf-8" ?>
<ogr:FeatureCollection>
<gml:boundedBy></gml:boundedBy>
<gml:featureMember>
<ogr:TZwards fid="F0">
<ogr:Region_Nam>TARGET</ogr:Region_Nam>
<ogr:District_N>Kondoa</ogr:District_N>
<ogr:Ward_Name>Bumbuta</ogr:Ward_Name>
</ogr:TZwards>
</gml:featureMember>
</ogr:FeatureCollection>
My output still contains all nodes..

You need to declare the namespaces in the python code:
from xml.dom import minidom
import xml.etree.ElementTree as ET
tree = ET.parse('/tmp/input.xml').getroot()
namespaces = {'gml': 'http://www.opengis.net/gml', 'ogr':'http://ogr.maptools.org/'}
for child in tree.findall('gml:featureMember', namespaces=namespaces):
if len(child.find('ogr:TZwards', namespaces=namespaces)):
name = child.find('ogr:TZwards', namespaces=namespaces).find('ogr:Region_Nam', namespaces=namespaces).text
if name != 'TARGET':
tree.remove(child)
out = ET.ElementTree(tree)
out.write("/tmp/out.xml")

Parsing subelements with elementTree

I have code in a XML file, which I parse using et.parse:
<VIAFCluster xmlns="http://viaf.org/viaf/terms#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:void="http://rdfs.org/ns/void#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<viafID>15</viafID>
<nameType>Personal</nameType>
</VIAFCluster>
<mainHeadings>
<data>
<text>
Gondrin de Pardaillan de Montespan, Louis-Antoine de, 1665-1736
</text>
</data>
</mainHeadings>
and I want to parse it as:
[15, "Personal", "Gondrin etc."]
I can't seem to print any of the string information with:
import xml.etree.ElementTree as ET
tree = ET.parse('/Users/user/Documents/work/oneline.xml')
root = tree.getroot()
for node in tree.iter():
name = node.find('nameType')
print(name)
as it appears as 'None' ... what am I doing wrong?

I'm still not sure exactly what you are wanting to do, but hopefully if you run the code below, it will help get you on your way. Using the getiterator function to iter through the elements will let you see what's going on. You can pick up the stuff you want as you come to them:
import xml.etree.ElementTree as et
xml = '''
<VIAFCluster xmlns="http://viaf.org/viaf/terms#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:void="http://rdfs.org/ns/void#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<viafID>15</viafID>
<nameType>Personal</nameType>
<mainHeadings>
<data>
<text>
Gondrin de Pardaillan de Montespan, Louis-Antoine de, 1665-1736
</text>
</data>
</mainHeadings>
</VIAFCluster>
'''
tree = et.fromstring(xml)
lst = []
for i in tree.getiterator():
t = i.text.strip()
if t:
lst.append(t)
print i.tag
print t
You will end up with a list as you wanted. I had to clean up your xml because you had more than one top level element, which is a no-no. Maybe that was your problem all along.
good luck, Mike

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML with ElementTree in Python - python

Related

how to parse XML with namespace and attribute in Python?

How to find a XML child element which has a default namespace in Python？

Reshape xml using python?

Remove XML node if childnode's childnode contains specific value

Parsing subelements with elementTree

Categories

Resources