Generate XML Document in Python 3 using Namespaces and ElementTree - python

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")

I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

Related

store content of a tag in a string with elementtree in python3

I'm using Python 3.7.2 and elementtree to copy the content of a tag in an XML file.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.003.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>nBblsUR-uH..6jmGgZNHLQAAAXgXN1Lu</MsgId>
<CreDtTm>2016-11-10T12:00:00.000+01:00</CreDtTm>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>6</CtrlSum>
<InitgPty>
<Nm>TC 03000 Kunde 55 Protokollr ckf hrung</Nm>
</InitgPty>
</GrpHdr>
</CstmrCdtTrfInitn>
</Document>
I want to copy the content of the 'MsgId' tag and save it as a string.
I've manage to do this with minidom before, but due to new circumstances, I have to settle with elementtree for now.
This is that code with minidom:
dom = xml.dom.minidom.parse('H:\\app_python/in_spsh/{}'.format(filename_string))
message = dom.getElementsByTagName('MsgId')
for MsgId in message:
print(MsgId.firstChild.nodeValue)
Now I want to do the exact same thing with elementtree. How can I achieve this?
To get the text value of a single element, you can use the findtext() method. The namespace needs to be taken into account.
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml") # Your XML document
msgid = tree.findtext('.//{urn:iso:std:iso:20022:tech:xsd:pain.001.003.03}MsgId')
With Python 3.8 and later, it is possible to use a wildcard for the namespace:
msgid = tree.findtext('.//{*}MsgId')

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

Python module xml.etree.ElementTree modifies xml namespace keys automatically

I've noticed that python ElementTree module, changes the xml data in the following simple example :
import xml.etree.ElementTree as ET
tree = ET.parse("./input.xml")
tree.write("./output.xml")
I wouldn't expect it to change, as I've done simple read and write test without any modification. however, the results shows a different story, especially in the namespace indices (nonage --> ns0 , d3p1 --> ns1 , i --> ns2 ) :
input.xml:
<?xml version="1.0" encoding="utf-8"?>
<ServerData xmlns:i="http://www.a.org" xmlns="http://schemas.xxx/2004/07/Server.Facades.ImportExport">
<CreationDate>0001-01-01T00:00:00</CreationDate>
<Processes>
<Processes xmlns:d3p1="http://schemas.datacontract.org/2004/07/Management.Interfaces">
<d3p1:ProtectedProcess>
<d3p1:Description>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Description>
<d3p1:DiscoveredMachine i:nil="true" />
<d3p1:Id>0</d3p1:Id>
<d3p1:Name>/applications/safari.app/contents/macos/safari</d3p1:Name>
<d3p1:Path>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Path>
<d3p1:ProcessHashes xmlns:d5p1="http://schemas.datacontract.org/2004/07/Management.Interfaces.WildFire" />
<d3p1:Status>1</d3p1:Status>
<d3p1:Type>Protected</d3p1:Type>
</d3p1:ProtectedProcess>
</Processes>
</Processes>
and output.xml:
<ns0:ServerData xmlns:ns0="http://schemas.xxx/2004/07/Server.Facades.ImportExport" xmlns:ns1="http://schemas.datacontract.org/2004/07/Management.Interfaces" xmlns:ns2="http://www.a.org">
<ns0:CreationDate>0001-01-01T00:00:00</ns0:CreationDate>
<ns0:Processes>
<ns0:Processes>
<ns1:ProtectedProcess>
<ns1:Description>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Description>
<ns1:DiscoveredMachine ns2:nil="true" />
<ns1:Id>0</ns1:Id>
<ns1:Name>/applications/safari.app/contents/macos/safari</ns1:Name>
<ns1:Path>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Path>
<ns1:ProcessHashes />
<ns1:Status>1</ns1:Status>
<ns1:Type>Protected</ns1:Type>
</ns1:ProtectedProcess>
</ns0:Processes>
</ns0:Processes>
You would need to register the namespaces for your xml as well as their prefixes with ElementTree before reading/writing the xml using ElementTree.register_namespace function. Example -
import xml.etree.ElementTree as ET
ET.register_namespace('','http://schemas.xxx/2004/07/Server.Facades.ImportExport')
ET.register_namespace('i','http://www.a.org')
ET.register_namespace('d3p1','http://schemas.datacontract.org/2004/07/Management.Interfaces')
tree = ET.parse("./input.xml")
tree.write("./output.xml")
Without this ElementTree creates its own prefixes for the corresponding namespaces, which is what happens for your case.
This is given in the documentation -
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. prefix is a namespace prefix. uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
(Emphasis mine)

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.
I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

Categories

Resources