ElementTree find() always returns None - python

I'm using ElementTree with Python to parse an XML file to find the contents of a subchild
This is the XML file I'm trying to parse:
<?xml version='1.0' encoding='UTF-8'?>
<nvd xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://nvd.nist.gov/feeds/cve/1.2" nvd_xml_version="1.2" pub_date="2016-02-10" xsi:schemaLocation="http://nvd.nist.gov/feeds/cve/1.2 http://nvd.nist.gov/schema/nvdcve_1.2.1.xsd">
<entry type="CVE" name="CVE-1999-0001" seq="1999-0001" published="1999-12-30" modified="2010-12-16" severity="Medium" CVSS_version="2.0" CVSS_score="5.0" CVSS_base_score="5.0" CVSS_impact_subscore="2.9" CVSS_exploit_subscore="10.0" CVSS_vector="(AV:N/AC:L/Au:N/C:N/I:N/A:P)">
<desc>
<descript source="cve">ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.</descript>
</desc>
<loss_types>
<avail/>
</loss_types>
<range>
<network/>
</range>
<refs>
<ref source="OSVDB" url="http://www.osvdb.org/5707">5707</ref>
<ref source="CONFIRM" url="http://www.openbsd.org/errata23.html#tcpfix">http://www.openbsd.org/errata23.html#tcpfix</ref>
</refs>
this is my code:
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse('nvdcve-modified.xml')
root = tree.getroot()
print root.find('entry')
print root[0].find('desc')
the output for both lines in None

Your XML has default namespace defined at the root element level :
xmlns="http://nvd.nist.gov/feeds/cve/1.2"
Descendant elements without prefix inherits ancestor's default namespace implicitly. To find element in namespace, you can map a prefix to the namespace URI and use the prefix like so :
ns = {'d': 'http://nvd.nist.gov/feeds/cve/1.2'}
root.find('d:entry', ns)
or use the namespace URI directly :
root.find('{http://nvd.nist.gov/feeds/cve/1.2}entry')

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")
I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

Parse attributes of root node having namespace in XML

I have the following xml which i was trying to parse and wanted to get the value of attributes of root node i.e. xmlns:n1 value. But i am getting the key error using the following value.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Level-1C_Tile_ID xmlns:n1="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd /dpc/app/s2ipf/FORMAT_METADATA_TILE_L1C/02.10.02/scripts/../../../schemas/02.12.05/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd">
<n1:General_Info>
<TILE_ID metadataLevel="Brief">S2A_OPER_MSI_L1C_TL_MTI__20161111T091803_A007252_T43QBA_N02.04</TILE_ID>
<DATASTRIP_ID metadataLevel="Standard">S2A_OPER_MSI_L1C_DS_MTI__20161111T091803_S20161111T053350_N02.04</DATASTRIP_ID>
<DOWNLINK_PRIORITY metadataLevel="Standard">NOMINAL</DOWNLINK_PRIORITY>
<SENSING_TIME metadataLevel="Standard">2016-11-11T05:33:50.068Z</SENSING_TIME>
<Archiving_Info metadataLevel="Expertise">
<ARCHIVING_CENTRE>MTI_</ARCHIVING_CENTRE>
<ARCHIVING_TIME>2016-11-11T10:53:22.600451Z</ARCHIVING_TIME>
</Archiving_Info>
</n1:General_Info>
</n1:Level-1C_Tile_ID>
Code :
from lxml import etree
tree = etree.parse(XML_FILE_CONTENT_PASTED_ABOVE)
root = tree.getroot()
print(root.attrib['xmlns:n1'])
Error :
KeyError: 'xmlns:n1'
Desired output :
https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd
A namespace declaration (xmlns:n1='...') looks like an attribute, but it is not part of the attrib dictionary of an element.
To get the namespace URI associated with the n1 prefix, use nsmap:
print(root.nsmap["n1"])

How to include the namespaces into a xml file using lxml?

I am creating a new xml file from scratch using python and the lxml library.
<route xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.xxxx" version="1.1"
xmlns:stm="http://xxxx/1/0/0"
xsi:schemaLocation="http://xxxx/1/0/0 stm_extensions.xsd">
I need to include this namespace information into the root tag as attributes of the route tag.
I canĀ“t include the information into the root declaration.
from lxml import etree
root = etree.Element("route",
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance",
xmlns = "http://www.xxxxx",
version = "1.1",
xmlns: stm = "http://xxxxx/1/0/0"
)
there is a SyntaxError: invalid syntax
How can I do that ?
Here is how it can be done:
from lxml import etree
attr_qname = etree.QName("http://www.w3.org/2001/XMLSchema-instance", "schemaLocation")
nsmap = {None: "http://www.xxxx",
"stm": "http://xxxx/1/0/0",
"xsi": "http://www.w3.org/2001/XMLSchema-instance"}
root = etree.Element("route",
{attr_qname: "http://xxxx/1/0/0 stm_extensions.xsd"},
version="1.1",
nsmap=nsmap)
print etree.tostring(root)
Output from this code (line breaks have been added for readability):
<route xmlns:stm="http://xxxx/1/0/0"
xmlns="http://www.xxxx"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://xxxx/1/0/0 stm_extensions.xsd"
version="1.1"/>
The main "trick" is to use QName to create the xsi:schemaLocation attribute. An attribute with a colon in its name cannot be used as the name of a keyword argument.
I've added the declaration of the xsi prefix to nsmap, but it can actually be omitted. lxml defines default prefixes for some well-known namespace URIs, including xsi for http://www.w3.org/2001/XMLSchema-instance.

Python module xml.etree.ElementTree modifies xml namespace keys automatically

I've noticed that python ElementTree module, changes the xml data in the following simple example :
import xml.etree.ElementTree as ET
tree = ET.parse("./input.xml")
tree.write("./output.xml")
I wouldn't expect it to change, as I've done simple read and write test without any modification. however, the results shows a different story, especially in the namespace indices (nonage --> ns0 , d3p1 --> ns1 , i --> ns2 ) :
input.xml:
<?xml version="1.0" encoding="utf-8"?>
<ServerData xmlns:i="http://www.a.org" xmlns="http://schemas.xxx/2004/07/Server.Facades.ImportExport">
<CreationDate>0001-01-01T00:00:00</CreationDate>
<Processes>
<Processes xmlns:d3p1="http://schemas.datacontract.org/2004/07/Management.Interfaces">
<d3p1:ProtectedProcess>
<d3p1:Description>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Description>
<d3p1:DiscoveredMachine i:nil="true" />
<d3p1:Id>0</d3p1:Id>
<d3p1:Name>/applications/safari.app/contents/macos/safari</d3p1:Name>
<d3p1:Path>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Path>
<d3p1:ProcessHashes xmlns:d5p1="http://schemas.datacontract.org/2004/07/Management.Interfaces.WildFire" />
<d3p1:Status>1</d3p1:Status>
<d3p1:Type>Protected</d3p1:Type>
</d3p1:ProtectedProcess>
</Processes>
</Processes>
and output.xml:
<ns0:ServerData xmlns:ns0="http://schemas.xxx/2004/07/Server.Facades.ImportExport" xmlns:ns1="http://schemas.datacontract.org/2004/07/Management.Interfaces" xmlns:ns2="http://www.a.org">
<ns0:CreationDate>0001-01-01T00:00:00</ns0:CreationDate>
<ns0:Processes>
<ns0:Processes>
<ns1:ProtectedProcess>
<ns1:Description>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Description>
<ns1:DiscoveredMachine ns2:nil="true" />
<ns1:Id>0</ns1:Id>
<ns1:Name>/applications/safari.app/contents/macos/safari</ns1:Name>
<ns1:Path>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Path>
<ns1:ProcessHashes />
<ns1:Status>1</ns1:Status>
<ns1:Type>Protected</ns1:Type>
</ns1:ProtectedProcess>
</ns0:Processes>
</ns0:Processes>
You would need to register the namespaces for your xml as well as their prefixes with ElementTree before reading/writing the xml using ElementTree.register_namespace function. Example -
import xml.etree.ElementTree as ET
ET.register_namespace('','http://schemas.xxx/2004/07/Server.Facades.ImportExport')
ET.register_namespace('i','http://www.a.org')
ET.register_namespace('d3p1','http://schemas.datacontract.org/2004/07/Management.Interfaces')
tree = ET.parse("./input.xml")
tree.write("./output.xml")
Without this ElementTree creates its own prefixes for the corresponding namespaces, which is what happens for your case.
This is given in the documentation -
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. prefix is a namespace prefix. uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
(Emphasis mine)

Why xml.etree.ElementTree does not support namespace URI case change

I am trying to parse XML, where the URI for the same namespace is not using the same case. (some xml owners decided to lower-case URIs). If I parse data with one type of URI followed by data with the other type, the parser fail finding my data although I update the ns dictionary to match the document URI... Here is an example:
from cStringIO import StringIO
import xml.etree.ElementTree as ET
DATA_lc = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/lower/case/bug">
<item>
<roktatar:author>Boby Mac Gallinger</roktatar:author>
</item>
</container>'''
DATA_UC = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/Lower/Case/Bug">
<item>
<roktatar:author>John-John Le Grandiosant</roktatar:author>
</item>
</container>'''
tree = ET.parse(StringIO(DATA_lc))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/lower/case/bug'}
for item in root.iter('item'):
print item.find('roktatar:author', namespaces=ns).text.strip()
tree = ET.parse(StringIO(DATA_UC))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
print item.find('roktatar:author', namespaces=ns).text.strip()
If each parsing block is processed on it's own, the data gets collected properly, but if they come next to each others, the second always fail. I am missing so reset/cleaning of the parser between documents? Is this a Bug?
Thanks
The ElementTree search code parses arguments to find() and related functions for XPath expressions, and caches the resulting closed-over functions for reuse.
When you search for a roktatar:author, that expression is cached as a search for '{http://www.example.com/lower/case/bug}author', but in your second document the binding changed.
In other words, ElementTree assumes that the same namespace prefix will always map to the same namespace URI.
The better solution to this problem is to use a different prefix here, like roktatar_uc for the title-case version of the URL:
ns = {'roktatar_uc': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
print item.find('roktatar_uc:author', namespaces=ns).text.strip()
but if that is not an option, you'll have to clear the cache instead:
from xml.etree import ElementPath
ElementPath._cache.clear()

Categories

Resources