Remove namespace with xmltodict in Python - python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XML and cannot identify why? Here is my XML:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<section1
mystatus:field1="data1"
mystatus:field2="data2" />
<section2
mystatus:lineA="outputA"
mystatus:lineB="outputB" />
</status>
And using:
xmltodict.parse(xml,process_namespaces=True,namespaces={'http://localhost/mystatus':None})
I get:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#http://localhost/mystatus:field1', u'data1'), (u'#http://localhost/mystatus:field2', u'data2')])), (u'section2', OrderedDict([(u'#http://localhost/mystatus:lineA', u'outputA'), (u'#http://localhost/mystatus:lineB', u'outputB')]))]))])
instead of:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'field1', u'data1'), (u'field2', u'data2')])), (u'section2', OrderedDict([(u'lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Am I making some simple mistake, or is there something about my XML that prevents the process_namespace modification from working correctly?

xmltodict is based on expat, so namespaces should applied to the class name, not attribute names:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<mystatus:section1 field1="data1" field2="data2" />
<mystatus:section2 lineA="outputA" lineB="outputB" />
</status>
When parsed with:
foo = xmltodict.parse(xml,
process_namespaces=True,
namespaces={'http://localhost/mystatus':None})
outputs:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#field1', u'data1'), (u'#field2', u'data2')])), (u'section2', OrderedDict([(u'#lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Accessing it is easy:
# Get attribute 'lineA' from class 'section2' from class 'status'
>>> foo.get('status').get('section2').get('#lineA')
u'outputA'
Attribute namespaces are only required when you have multiple attributes of the same name (e.g. multiple id's or multiple prices, etc), in which case, I couldn't get expat or xmltodict to parse it correctly. YMMV though.

Related

Validate XML with namespaces against Schematron using lxml in Python

I am not able to get lxml Schematron validator to recognize namespaces. Validation works fine in code without namespaces.
This is for Python 3.7.4 and lxml 4.4.0 on MacOS 10.15
Here is the schematron file
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:ns1="http://foo">
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
and here is the xml file
<?xml version="1.0" encoding="UTF-8"?>
<zip xmlns:ns1="http://foo">
<ns1:bar>3</ns1:bar>
</zip>
here is the python code
from lxml import etree, isoschematron
from plumbum import local
schematron_doc = etree.parse(local.path('rules.sch'))
schematron = isoschematron.Schematron(schematron_doc)
xml_doc = etree.parse(local.path('test.xml'))
is_valid = schematron.validate(xml_doc)
assert not is_valid
What I get: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '//ns1:bar'
If I remove ns1 from both the XML file and the Schematron file, the example works perfectly-- no error message.
There must be a trick to registering namespaces in lxml Schematron that I am missing. Has anyone done this?
As it turns out, there is a specific way to register namespaces in Schematron. It is described in the Schematron ISO standard
It only required a small change to the Schematron file, adding the "ns" element in as follows:
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
<ns uri="http://foo" prefix="ns1"/>
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
I won't remove the question, since there is a dearth of examples of Schematron rules using namespaces. Hopefully it can be helpful to someone.

Element is not an element of the schema

I want to validate an XML file from my bank against an iso20022 XSD, but it fails claiming the first element (Document) is not an element of the scheme. I can see the 'Document' element defined in the XSD though.
I downloaded the XSD mentioned in the header from here: https://www.iso20022.org/documents/messages/camt/schemas/camt.052.001.06.zip Then I wrote a litte script to validate the XML file:
import xmlschema
schema = xmlschema.XMLSchema('camt.052.001.06.xsd')
schema.validate('minimal_example.xml')
(use 'pip install xmlschema' to install the xmlschema package)
minimal_example.xml is just the first element of my bank's XML file without any children.
<?xml version="1.0" ?>
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
The above script fails, claiming document was not an element of the XSD:
xmlschema.validators.exceptions.XMLSchemaValidationError: failed validating <Element 'Document' at 0x7fbda11e4138> with XMLSchema10(basename='camt.052.001.06.xsd', namespace='urn:iso:std:iso:20022:tech:xsd:camt.052.001.06'):
Reason: <Element 'Document' at 0x7fbda11e4138> is not an element of the schema
Instance:
<Document>
</Document>
But the document element is defined right at the top of camt.052.001.06.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Standards Editor (build:R1.6.5.6) on 2016 Feb 12 18:17:13, ISO 20022 version : 2013-->
<xs:schema xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
<xs:element name="Document" type="Document"/>
[...]
Why does the validation fail and how can I correct this?
The XSD has
targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06"
on the xs:schema element, indicating that it governs that namespace.
Your XML has a root element,
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
which places the Document in no namespace. To place it in the namespace governed by the XSD, change it to
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
or
<ns2:Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</ns2:Document>
See also xmlns, xmlns:xsi, xsi:schemaLocation, and targetNamespace?

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

I'm trying to locate fields in a SOAP xml file using lxml (3.6.0)
...
<soap:Body>
<Request xmlns="http://localhost/">
<Test>
<field1>hello</field1>
<field2>world</field2>
</Test>
</Request>
</soap:Body>
...
In this example I'm trying to find field1 and field2.
I need to add a path to the search term, to find the field:
print (myroot.find(".//{http://localhost/}field1").tag) # prints 'field1'
without it, I don't find anything
print (myroot.find("field1").tag) # finds 'None'
Is there any other way to search for the field tag (here field1) without giving path info?
Full example below:
from lxml import etree
example = """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><Request xmlns="http://localhost/">
<Test><field1>hello</field1><field2>world</field2></Test>
</Request></soap:Body></soap:Envelope>
"""
myroot = etree.fromstring(example)
# this works
print (myroot.find(".//{http://localhost/}field1").text)
print (myroot.find(".//{http://localhost/}field2").text)
# this fails
print (myroot.find(".//field1").text)
print (myroot.find("field1").text)
Comment: The input of the SOAP request is given, I can't change any of it in real live to make things easier.
There is a way to ignore namespace when selecting element using XPath, but that isn't a good practice. Namespace is there for a reason. Anyway, there is a cleaner way to reference element in namespace i.e by using namespace prefix that was mapped to the namespace uri, instead of using the actual namespace uri every time :
.....
>>> ns = {'d': 'http://localhost/'}
>>> print (myroot.find(".//d:field1", ns).text)
hello
>>> print (myroot.find(".//d:field2", ns).text)
world

Python module xml.etree.ElementTree modifies xml namespace keys automatically

I've noticed that python ElementTree module, changes the xml data in the following simple example :
import xml.etree.ElementTree as ET
tree = ET.parse("./input.xml")
tree.write("./output.xml")
I wouldn't expect it to change, as I've done simple read and write test without any modification. however, the results shows a different story, especially in the namespace indices (nonage --> ns0 , d3p1 --> ns1 , i --> ns2 ) :
input.xml:
<?xml version="1.0" encoding="utf-8"?>
<ServerData xmlns:i="http://www.a.org" xmlns="http://schemas.xxx/2004/07/Server.Facades.ImportExport">
<CreationDate>0001-01-01T00:00:00</CreationDate>
<Processes>
<Processes xmlns:d3p1="http://schemas.datacontract.org/2004/07/Management.Interfaces">
<d3p1:ProtectedProcess>
<d3p1:Description>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Description>
<d3p1:DiscoveredMachine i:nil="true" />
<d3p1:Id>0</d3p1:Id>
<d3p1:Name>/applications/safari.app/contents/macos/safari</d3p1:Name>
<d3p1:Path>/Applications/Safari.app/Contents/MacOS/Safari</d3p1:Path>
<d3p1:ProcessHashes xmlns:d5p1="http://schemas.datacontract.org/2004/07/Management.Interfaces.WildFire" />
<d3p1:Status>1</d3p1:Status>
<d3p1:Type>Protected</d3p1:Type>
</d3p1:ProtectedProcess>
</Processes>
</Processes>
and output.xml:
<ns0:ServerData xmlns:ns0="http://schemas.xxx/2004/07/Server.Facades.ImportExport" xmlns:ns1="http://schemas.datacontract.org/2004/07/Management.Interfaces" xmlns:ns2="http://www.a.org">
<ns0:CreationDate>0001-01-01T00:00:00</ns0:CreationDate>
<ns0:Processes>
<ns0:Processes>
<ns1:ProtectedProcess>
<ns1:Description>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Description>
<ns1:DiscoveredMachine ns2:nil="true" />
<ns1:Id>0</ns1:Id>
<ns1:Name>/applications/safari.app/contents/macos/safari</ns1:Name>
<ns1:Path>/Applications/Safari.app/Contents/MacOS/Safari</ns1:Path>
<ns1:ProcessHashes />
<ns1:Status>1</ns1:Status>
<ns1:Type>Protected</ns1:Type>
</ns1:ProtectedProcess>
</ns0:Processes>
</ns0:Processes>
You would need to register the namespaces for your xml as well as their prefixes with ElementTree before reading/writing the xml using ElementTree.register_namespace function. Example -
import xml.etree.ElementTree as ET
ET.register_namespace('','http://schemas.xxx/2004/07/Server.Facades.ImportExport')
ET.register_namespace('i','http://www.a.org')
ET.register_namespace('d3p1','http://schemas.datacontract.org/2004/07/Management.Interfaces')
tree = ET.parse("./input.xml")
tree.write("./output.xml")
Without this ElementTree creates its own prefixes for the corresponding namespaces, which is what happens for your case.
This is given in the documentation -
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. prefix is a namespace prefix. uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
(Emphasis mine)

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

I'm having problems getting lxml to successfully validate some xml. The XSD schema and XML file are both from Amazon documentation so should be compatible. But the XML itself refers to another schema that's not being loaded.
Here is my code, which is based on the lxml validation tutorial:
xsd_doc = etree.parse('ProductImage.xsd')
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse('ProductImage_sample.xml')
xsd.validate(xml)
print xsd.error_log
"ProductImage_sample.xml:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element 'AmazonEnvelope': No matching global declaration available for the validation root."
I get no errors if I validate against amzn-envelope.xsd instead of ProductImage.xsd, but that defeats the point of seeing if a given Image feed is valid. All xsd & xml files mentioned are in my working directory along with my python script by the way.
Here is a snippet of the sample xml, which should definately be valid:
<?xml version="1.0"?>
<AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">
<Header>
<DocumentVersion>1.01</DocumentVersion>
<MerchantIdentifier>Q_M_STORE_123</MerchantIdentifier>
</Header>
<MessageType>ProductImage</MessageType>
<Message>
<MessageID>1</MessageID>
<OperationType>Update</OperationType>
<ProductImage>
<SKU>1234</SKU>
Here is a snippet of the schema (this file is not public so I can't show all of it):
<?xml version="1.0"?>
<!-- Revision="$Revision: #5 $" -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xsd:include schemaLocation="amzn-base.xsd"/>
<xsd:element name="ProductImage">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="SKU"/>
I can say that following the include to amzn-base.xsd does not end up reaching a definition of the AmazonEnvelope tag. So my questions is: can lxml load schemas via a tag like <AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">. And if not, how can I validate my Image feed?
The answer is I should validate by the parent schema file, which as mentioned at the top of the XML file is amzn-envelope.xsd as this contains the line:
<xsd:include schemaLocation="ProductImage.xsd"/>
In general then, lxml won't read such a declaration as xsi:noNamespaceSchemaLocation="amzn-envelope.xsd" but if you can find the parent schema to validate against then this should hopefully include the specific schema you're interested in.

Categories

Resources