Parse attributes of root node having namespace in XML - python

I have the following xml which i was trying to parse and wanted to get the value of attributes of root node i.e. xmlns:n1 value. But i am getting the key error using the following value.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Level-1C_Tile_ID xmlns:n1="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd /dpc/app/s2ipf/FORMAT_METADATA_TILE_L1C/02.10.02/scripts/../../../schemas/02.12.05/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd">
<n1:General_Info>
<TILE_ID metadataLevel="Brief">S2A_OPER_MSI_L1C_TL_MTI__20161111T091803_A007252_T43QBA_N02.04</TILE_ID>
<DATASTRIP_ID metadataLevel="Standard">S2A_OPER_MSI_L1C_DS_MTI__20161111T091803_S20161111T053350_N02.04</DATASTRIP_ID>
<DOWNLINK_PRIORITY metadataLevel="Standard">NOMINAL</DOWNLINK_PRIORITY>
<SENSING_TIME metadataLevel="Standard">2016-11-11T05:33:50.068Z</SENSING_TIME>
<Archiving_Info metadataLevel="Expertise">
<ARCHIVING_CENTRE>MTI_</ARCHIVING_CENTRE>
<ARCHIVING_TIME>2016-11-11T10:53:22.600451Z</ARCHIVING_TIME>
</Archiving_Info>
</n1:General_Info>
</n1:Level-1C_Tile_ID>
Code :
from lxml import etree
tree = etree.parse(XML_FILE_CONTENT_PASTED_ABOVE)
root = tree.getroot()
print(root.attrib['xmlns:n1'])
Error :
KeyError: 'xmlns:n1'
Desired output :
https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd

A namespace declaration (xmlns:n1='...') looks like an attribute, but it is not part of the attrib dictionary of an element.
To get the namespace URI associated with the n1 prefix, use nsmap:
print(root.nsmap["n1"])

Related

xml parsing in python with XPath

I am trying to parse an XML file in Python with the built in xml module and Elemnt tree, but what ever I try to do according to the documentation, it does not give me what I need.
I am trying to extract all the value tags into a list
<?xml version="1.0" encoding="UTF-8"?>
<CustomField xmlns="http://soap.sforce.com/2006/04/metadata">
<fullName>testPicklist__c</fullName>
<externalId>false</externalId>
<label>testPicklist</label>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<type>Picklist</type>
<valueSet>
<restricted>true</restricted>
<valueSetDefinition>
<sorted>false</sorted>
<value>
<fullName>a 32</fullName>
<default>false</default>
<label>a 32</label>
</value>
<value>
<fullName>23 432;:</fullName>
<default>false</default>
<label>23 432;:</label>
</value>
and here is the example code that I cant get to work. It's very basic and all I have issues is the xpath.
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
print(root.findall(".//value")
print(root.findall(".//*/value")
print(root.findall("./*/value")
Since the root element has attribute xmlns="http://soap.sforce.com/2006/04/metadata", every element in the document will belong to this namespace. So you're actually looking for {http://soap.sforce.com/2006/04/metadata}value elements.
To search all <value> elements in this document you have to specify the namespace argument in the findall() function
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
# get the namespace of root
ns = root.tag.split('}')[0][1:]
# create a dictionary with the namespace
ns_d = {'my_ns': ns}
# get all the values
values = root.findall('.//my_ns:value', namespaces=ns_d)
# print the values
for value in values:
print(value)
Outputs:
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043ba0>
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043e20>
Alternatively you can just search for the {http://soap.sforce.com/2006/04/metadata}value
# get all the values
values = root.findall('.//{http://soap.sforce.com/2006/04/metadata}value')

python, xml: how to access the 3rd child by element' name

Would you help me, pleace, to get an access to elemnt with name 'id' by the following construction in Python (i have lxml and xml.etree.ElementTree libraries).
Desirable result: '0000000'
Desirable method:
Search in xml-document a child, where it's name is fcsProtocolEF3.
Search in fcsProtocolEF3 an element with name 'id'.
It is crucial to search by element name. Not by ordinal position.
I tried to use something like this: tree.findall('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')[0].findall('{http://zakupki.gov.ru/oos/types/1}id')[0].text
it works, but it requires to input namespaces. XML-document have different namespaces and I don't know how to define them beforehand.
Thank you.
That would be great to use something like XQuery in SQL:
value('(/*:export/*:fcsProtocolEF3/*:id)[1]', 'nvarchar(21)')) AS [id],
XML-document:
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
lxml solution:
xml = '''<?xml version="1.0"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>'''
from lxml import etree as et
root = et.fromstring(xml)
text = root.xpath('//*[local-name()="export"]/*[local-name()="fcsProtocolEF3"]/*[local-name()="id"]/text()')[0]
print(text)
Below is ET based solution. NS are in use.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
'''
def get_id_text():
root = ET.fromstring(xml)
fcs = root.find('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')
# assuming there is one fcs element and one id under fcs
return fcs.find('{http://zakupki.gov.ru/oos/types/1}id').text
print(get_id_text())
output
0000000

ElementTree appends extra information when getting root element

I have an xml like shown below
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//INFO//INFO info 2005-3//EN" "http://url">
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" version="2005-3" xml:lang="ml">
<head>....
</dtbook>
I open the file like so,
with open("filename.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
When I try to get the root tag, I get,
print(root.tag)
{http://www.daisy.org/z3986/2005/dtbook/}dtbook
whereas if I remove all the attributes from the root tag i.e. dtbook, I get the correct output i.e. dtbook
print(root.tag)
dtbook
I cannot remove the attributes. Is there a way to get this working without removing the attributes??
This is called a namespace and is supposed to be in front. You can simply remove the namespace by splitting your string at {}

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

ElementTree find() always returns None

I'm using ElementTree with Python to parse an XML file to find the contents of a subchild
This is the XML file I'm trying to parse:
<?xml version='1.0' encoding='UTF-8'?>
<nvd xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://nvd.nist.gov/feeds/cve/1.2" nvd_xml_version="1.2" pub_date="2016-02-10" xsi:schemaLocation="http://nvd.nist.gov/feeds/cve/1.2 http://nvd.nist.gov/schema/nvdcve_1.2.1.xsd">
<entry type="CVE" name="CVE-1999-0001" seq="1999-0001" published="1999-12-30" modified="2010-12-16" severity="Medium" CVSS_version="2.0" CVSS_score="5.0" CVSS_base_score="5.0" CVSS_impact_subscore="2.9" CVSS_exploit_subscore="10.0" CVSS_vector="(AV:N/AC:L/Au:N/C:N/I:N/A:P)">
<desc>
<descript source="cve">ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.</descript>
</desc>
<loss_types>
<avail/>
</loss_types>
<range>
<network/>
</range>
<refs>
<ref source="OSVDB" url="http://www.osvdb.org/5707">5707</ref>
<ref source="CONFIRM" url="http://www.openbsd.org/errata23.html#tcpfix">http://www.openbsd.org/errata23.html#tcpfix</ref>
</refs>
this is my code:
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse('nvdcve-modified.xml')
root = tree.getroot()
print root.find('entry')
print root[0].find('desc')
the output for both lines in None
Your XML has default namespace defined at the root element level :
xmlns="http://nvd.nist.gov/feeds/cve/1.2"
Descendant elements without prefix inherits ancestor's default namespace implicitly. To find element in namespace, you can map a prefix to the namespace URI and use the prefix like so :
ns = {'d': 'http://nvd.nist.gov/feeds/cve/1.2'}
root.find('d:entry', ns)
or use the namespace URI directly :
root.find('{http://nvd.nist.gov/feeds/cve/1.2}entry')

Categories

Resources