Python module xml.etree.ElementTree modifies xml namespace keys automatically

Python module xml.etree.ElementTree modifies xml namespace keys automatically - python

I've noticed that python ElementTree module, changes the xml data in the following simple example :
import xml.etree.ElementTree as ET
tree = ET.parse("./input.xml")
tree.write("./output.xml")
I wouldn't expect it to change, as I've done simple read and write test without any modification. however, the results shows a different story, especially in the namespace indices (nonage --> ns0 , d3p1 --> ns1 , i --> ns2 ) :
input.xml:
<?xml version="1.0" encoding="utf-8"?>
<ServerData xmlns:i="http://www.a.org" xmlns="http://schemas.xxx/2004/07/Server.Facades.ImportExport">
<CreationDate>0001-01-01T00:00:00</CreationDate>
<Processes>
<Processes xmlns:d3p1="http://schemas.datacontract.org/2004/07/Management.Interfaces">
<d3p1:ProtectedProcess>
<d3p1:Description>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Description>
<d3p1:DiscoveredMachine i:nil="true" />
<d3p1:Id>0</d3p1:Id>
<d3p1:Name>/applications/safari.app/contents/macos/safari</d3p1:Name>
<d3p1:Path>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Path>
<d3p1:ProcessHashes xmlns:d5p1="http://schemas.datacontract.org/2004/07/Management.Interfaces.WildFire" />
<d3p1:Status>1</d3p1:Status>
<d3p1:Type>Protected</d3p1:Type>
</d3p1:ProtectedProcess>
</Processes>
</Processes>
and output.xml:
<ns0:ServerData xmlns:ns0="http://schemas.xxx/2004/07/Server.Facades.ImportExport" xmlns:ns1="http://schemas.datacontract.org/2004/07/Management.Interfaces" xmlns:ns2="http://www.a.org">
<ns0:CreationDate>0001-01-01T00:00:00</ns0:CreationDate>
<ns0:Processes>
<ns0:Processes>
<ns1:ProtectedProcess>
<ns1:Description>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Description>
<ns1:DiscoveredMachine ns2:nil="true" />
<ns1:Id>0</ns1:Id>
<ns1:Name>/applications/safari.app/contents/macos/safari</ns1:Name>
<ns1:Path>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Path>
<ns1:ProcessHashes />
<ns1:Status>1</ns1:Status>
<ns1:Type>Protected</ns1:Type>
</ns1:ProtectedProcess>
</ns0:Processes>
</ns0:Processes>

You would need to register the namespaces for your xml as well as their prefixes with ElementTree before reading/writing the xml using ElementTree.register_namespace function. Example -
import xml.etree.ElementTree as ET
ET.register_namespace('','http://schemas.xxx/2004/07/Server.Facades.ImportExport')
ET.register_namespace('i','http://www.a.org')
ET.register_namespace('d3p1','http://schemas.datacontract.org/2004/07/Management.Interfaces')
tree = ET.parse("./input.xml")
tree.write("./output.xml")
Without this ElementTree creates its own prefixes for the corresponding namespaces, which is what happens for your case.
This is given in the documentation -
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. prefix is a namespace prefix. uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
(Emphasis mine)

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")

I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

Parse XML with Python resolving an external ENTITY reference

In my S1000D xml, it specifies a DOCTYPE with a reference to a public URL that contains references to a number of other files that contain all the valid character entities. I've used xml.etree.ElementTree and lxml to try to parse it and get a parse error with both indicating:
undefined entity −: line 82, column 652
Even though − is a valid entity according to the ENTITY Reference specfied.
The xml top is as follow:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dmodule [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
If you go out and get http://www.s1000d.org/S1000D_4-1/ent/ISOEntities, it will include 20 other ent files with one called iso-tech.ent which contains the line:
<!ENTITY minus "−"> <!-- MINUS SIGN -->
in line 82 of the xml file near column 652 is the following:
....Refer to 70−41....
How can I run a python script to parse this file without get the undefined entity?
Sorry I don't want to specify parser.entity['minus'] = chr(2212) for example. I did that for a quick fix but there are many character entity references.
I would like the parser to check Entity reference that is specified in the xml.
I'm surprised but I've gone around the sun and back and haven't found how to do this (or maybe I have but couldn't follow it).
if I update my xml file and add
<!ENTITY minus "−">
It won't fail, so It's not the xml.
It fails on the parse. Here's code I use for ElementTree
fl = os.path.join(pth, fn)
try:
root = ET.parse(fl)
except ParseError as p:
print("ParseError : ", p)
Here's the code I use for lxml
fl = os.path.join(pth, fn)
try:
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
root = etree.parse(fl, parser=parser)
except etree.XMLSyntaxError as pe:
print("lxml XMLSyntaxError: ", pe)
I would like the parser to load the ENTITY reference so that it knows that − and all the other character entities specified in all the files are valid entity characters.
Thank you so much for your advice and help.

I'm going to answer for lxml. No reason to consider ElementTree if you can use lxml.
I think the piece you're missing is no_network=False in the XMLParser; it's True by default.
Example...
XML Input (test.xml)
<!DOCTYPE doc [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
<doc>
<test>Here's a test of minus: −</test>
</doc>
Python
from lxml import etree
parser = etree.XMLParser(load_dtd=True,
no_network=False)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: −</test>
</doc>
If you wanted the entity reference retained, add resolve_entities=False to the XMLParser.
Also, instead of going out to an external location to resolve the parameter entity, consider setting up an XML Catalog. This will let you resolve public and/or system identifiers to local versions.
Example using same XML input above...
XML Catalog ("catalog.xml" in the directory "catalog test" (space used in directory name for testing))
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- The path in #uri is relative to this file (catalog.xml). -->
<uri name="http://www.s1000d.org/S1000D_4-1/ent/ISOEntities" uri="./ents/ISOEntities_stackoverflow.ent"/>
</catalog>
Entity File ("ISOEntities_stackoverflow.ent" in the directory "catalog test/ents". Changed the value to "BAM!" for testing)
<!ENTITY minus "BAM!">
Python (Changed no_network to True for additional evidence that the local version of http://www.s1000d.org/S1000D_4-1/ent/ISOEntities is being used.)
import os
from urllib.request import pathname2url
from lxml import etree
# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
try:
xcf_env = os.environ['XML_CATALOG_FILES']
except KeyError:
# Path to catalog must be a url.
catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog test/catalog.xml'))}"
# Temporarily set the environment variable.
os.environ['XML_CATALOG_FILES'] = catalog_path
parser = etree.XMLParser(load_dtd=True,
no_network=True)
tree = etree.parse("test.xml", parser=parser)
etree.dump(tree.getroot())
Output
<doc>
<test>Here's a test of minus: BAM!</test>
</doc>

ElementTree find() always returns None

I'm using ElementTree with Python to parse an XML file to find the contents of a subchild
This is the XML file I'm trying to parse:
<?xml version='1.0' encoding='UTF-8'?>
<nvd xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://nvd.nist.gov/feeds/cve/1.2" nvd_xml_version="1.2" pub_date="2016-02-10" xsi:schemaLocation="http://nvd.nist.gov/feeds/cve/1.2 http://nvd.nist.gov/schema/nvdcve_1.2.1.xsd">
<entry type="CVE" name="CVE-1999-0001" seq="1999-0001" published="1999-12-30" modified="2010-12-16" severity="Medium" CVSS_version="2.0" CVSS_score="5.0" CVSS_base_score="5.0" CVSS_impact_subscore="2.9" CVSS_exploit_subscore="10.0" CVSS_vector="(AV:N/AC:L/Au:N/C:N/I:N/A:P)">
<desc>
<descript source="cve">ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.</descript>
</desc>
<loss_types>
<avail/>
</loss_types>
<range>
<network/>
</range>
<refs>
<ref source="OSVDB" url="http://www.osvdb.org/5707">5707</ref>
<ref source="CONFIRM" url="http://www.openbsd.org/errata23.html#tcpfix">http://www.openbsd.org/errata23.html#tcpfix</ref>
</refs>
this is my code:
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse('nvdcve-modified.xml')
root = tree.getroot()
print root.find('entry')
print root[0].find('desc')
the output for both lines in None

Your XML has default namespace defined at the root element level :
xmlns="http://nvd.nist.gov/feeds/cve/1.2"
Descendant elements without prefix inherits ancestor's default namespace implicitly. To find element in namespace, you can map a prefix to the namespace URI and use the prefix like so :
ns = {'d': 'http://nvd.nist.gov/feeds/cve/1.2'}
root.find('d:entry', ns)
or use the namespace URI directly :
root.find('{http://nvd.nist.gov/feeds/cve/1.2}entry')

Remove namespace with xmltodict in Python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XML and cannot identify why? Here is my XML:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<section1
mystatus:field1="data1"
mystatus:field2="data2" />
<section2
mystatus:lineA="outputA"
mystatus:lineB="outputB" />
</status>
And using:
xmltodict.parse(xml,process_namespaces=True,namespaces={'http://localhost/mystatus':None})
I get:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#http://localhost/mystatus:field1', u'data1'), (u'#http://localhost/mystatus:field2', u'data2')])), (u'section2', OrderedDict([(u'#http://localhost/mystatus:lineA', u'outputA'), (u'#http://localhost/mystatus:lineB', u'outputB')]))]))])
instead of:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'field1', u'data1'), (u'field2', u'data2')])), (u'section2', OrderedDict([(u'lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Am I making some simple mistake, or is there something about my XML that prevents the process_namespace modification from working correctly?

xmltodict is based on expat, so namespaces should applied to the class name, not attribute names:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<mystatus:section1 field1="data1" field2="data2" />
<mystatus:section2 lineA="outputA" lineB="outputB" />
</status>
When parsed with:
foo = xmltodict.parse(xml,
process_namespaces=True,
namespaces={'http://localhost/mystatus':None})
outputs:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#field1', u'data1'), (u'#field2', u'data2')])), (u'section2', OrderedDict([(u'#lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Accessing it is easy:
# Get attribute 'lineA' from class 'section2' from class 'status'
>>> foo.get('status').get('section2').get('#lineA')
u'outputA'
Attribute namespaces are only required when you have multiple attributes of the same name (e.g. multiple id's or multiple prices, etc), in which case, I couldn't get expat or xmltodict to parse it correctly. YMMV though.

How to use register_namespace multiple times for same URL with different "anchor tag"?

I am updating an xml file and want to preserve multiple namespaces with same URI but different anchor tag using ET.register_namespace
Following code is what I've tried :
ET.register_namespace('', "http://oval.mitre.org/XMLSchema/oval-definitions-5")
ET.register_namespace('', "http://oval.mitre.org/XMLSchema/oval-definitions-5#windows")
ET.register_namespace('', "http://oval.mitre.org/XMLSchema/oval-definitions-5#independent")
ns = "{http://oval.mitre.org/XMLSchema/oval-definitions-5}"
f = open ("def_ex.xml","ra")
tree = ET.parse(f)
root = tree.getroot()
for defn in root.iter('%stag' %ns):
if "patch" in defn.get("class"): #pick id attrib where class attrib is "patch"
print defn.get("id")
mirr_def = copy.deepcopy(defn)
defn.append(mirr_def)
tree.write("def_ex.xml")
exit()
But the problem is third namespace is overwriting one and two as shown in the following output of the code:
<ns0:tag>
.......
.......
</ns0:tag>
<ns1:tag1>
........
........
</ns1:tag1>
<tag2>
......
......
</tag2>
My final question is how to preserve all namespaces without overwriting each other when there are different "anchor tags" with same URI ?
Updated: def_ex.xml
<oval_definitions xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oval="http://oval.mitre.org/XMLSchema/oval-common-5" xmlns:oval-def="http://oval.mitre.org/XMLSchema/oval-definitions-5" xmlns:windows-def="http://oval.mitre.org/XMLSchema/oval-definitions-5#windows" xmlns:independent-def="http://oval.mitre.org/XMLSchema/oval-definitions-5#independent" xsi:schemaLocation=" http://oval.mitre.org/XMLSchema/oval-definitions-5#windows windows-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5#independent independent-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5 oval-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-common-5 oval-common-schema.xsd">
<tag id="oval:def:1" class="inventory">
...........
...........
...........
</tag>
<tag1 xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5#windows" id="oval:tst:1" version="1">
............
............
</tag1>
<tag2 xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5#independent" id="oval:tst:2" version="1">
............
............
</tag2>
</oval_definitions>

As #mu 無 has said, you will not be able to achieve what you want using register_namespace which explicitly guards against having duplicate prefixes.
I am not sure whether what you are trying to do is legal XML or supported by the library, but one way that might achieve what you want is to implement the behaviour of register_namespace directly:
xml.etree.ElementTree._namespace_map[uri] = prefix # Replace uri and prefix.
And as a function (modified from original python library source code):
import re
import xml.etree.ElementTree
def register_namespace(prefix, uri):
if re.match("ns\d+$", prefix):
raise ValueError("Prefix format reserved for internal use")
xml.etree.ElementTree._namespace_map[uri] = prefix
I do not recommend doing this because it could break the library elsewhere in unexpected ways.
Disclaimer: My code has not been tested.

You are using the same prefix to define all the 3 URIs. As mentioned in the docs, the namespace registry is global, and hence the values are being overwritten.
From the docs:
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. prefix is a namespace prefix. uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
I would suggest you to add namespaces for each of the URI's as follow, and use them accordingly
namespaces = {'ns1': 'http://oval.mitre.org/XMLSchema/oval-definitions-5',
'ns2': 'http://oval.mitre.org/XMLSchema/oval-definitions-5#windows',
'ns3': 'http://oval.mitre.org/XMLSchema/oval-definitions-5#independent'}
for prefix, uri in namespaces.items():
ET.register_namespace(prefix, uri)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python module xml.etree.ElementTree modifies xml namespace keys automatically - python

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

Parse XML with Python resolving an external ENTITY reference

ElementTree find() always returns None

Remove namespace with xmltodict in Python

How to use register_namespace multiple times for same URL with different "anchor tag"?

Categories

Resources