Element is not an element of the schema - python

I want to validate an XML file from my bank against an iso20022 XSD, but it fails claiming the first element (Document) is not an element of the scheme. I can see the 'Document' element defined in the XSD though.
I downloaded the XSD mentioned in the header from here: https://www.iso20022.org/documents/messages/camt/schemas/camt.052.001.06.zip Then I wrote a litte script to validate the XML file:
import xmlschema
schema = xmlschema.XMLSchema('camt.052.001.06.xsd')
schema.validate('minimal_example.xml')
(use 'pip install xmlschema' to install the xmlschema package)
minimal_example.xml is just the first element of my bank's XML file without any children.
<?xml version="1.0" ?>
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
The above script fails, claiming document was not an element of the XSD:
xmlschema.validators.exceptions.XMLSchemaValidationError: failed validating <Element 'Document' at 0x7fbda11e4138> with XMLSchema10(basename='camt.052.001.06.xsd', namespace='urn:iso:std:iso:20022:tech:xsd:camt.052.001.06'):
Reason: <Element 'Document' at 0x7fbda11e4138> is not an element of the schema
Instance:
<Document>
</Document>
But the document element is defined right at the top of camt.052.001.06.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Standards Editor (build:R1.6.5.6) on 2016 Feb 12 18:17:13, ISO 20022 version : 2013-->
<xs:schema xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
<xs:element name="Document" type="Document"/>
[...]
Why does the validation fail and how can I correct this?

The XSD has
targetNamespace="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06"
on the xs:schema element, indicating that it governs that namespace.
Your XML has a root element,
<Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
which places the Document in no namespace. To place it in the namespace governed by the XSD, change it to
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</Document>
or
<ns2:Document xmlns:ns2="urn:iso:std:iso:20022:tech:xsd:camt.052.001.06">
</ns2:Document>
See also xmlns, xmlns:xsi, xsi:schemaLocation, and targetNamespace?

Related

Does the XPath collection () function work with lxml and XSLT?

I tried recently to transform an XML file with the lxml package and a XSL stylesheet containing a variable with XPath collection() function however I get the following error when i'm running my code:
lxml.etree.XSLTApplyError: Failed to evaluate the expression of variable 'name'.
Here are the details of my files:
XML source : catalog.xml
<?xml version="1.0" encoding="UTF-8"?>
<collection>
<doc href="./IR_041698.xml"/>
<doc href="./IR_051379.xml"/>
</collection>
XSL file : test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:tei="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="tei">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" method="xml"/>
<xsl:template match="/">
<xsl:variable name="name" select="collection('catalog.xml')/descendant::archdesc/did/origination/persname/text()"/>
<teiHeader xmlns="http://www.tei-c.org/ns/1.0">
<fileDesc>
<titleStmt>
<title>
<xsl:value-of select="$name"/>
</title>
</titleStmt>
</fileDesc>
</teiHeader>
</xsl:template>
</xsl:stylesheet>
Python code :
from lxml import etree as ET
source = ET.parse("catalog.xml")
xslt = ET.parse("test.xsl")
transform = ET.XSLT(xslt)
newdom = transform(source)
print(ET.tostring(newdom, pretty_print=True))
I am a little bit surprised because when I launched the transformation under Oxygen XML editor it works but not in Python.
Do you have any suggestions ? Is XPath collection() function a problem with lxml?
thank you in advance
The collection function is part of XPath and XSLT 2 and later and as such not supported by lxml. You can however, in XSLT, use the document function as document(document('catalog.xml')/*/doc/#href)) to select the "collection" of documents selected by the href attributes of the doc element nodes in the catalog.xml document.
Saxon 9.9 is also available as a Python module https://www.saxonica.com/saxon-c/doc/html/saxonc.html as part of Saxon C 1.2.1 (Download http://saxonica.com/download/c.xml, documentation: http://www.saxonica.com/saxon-c/documentation/index.html) so you might consider switching from lxml to Saxon-C if you want to use XSLT 3 in Python.

I want to remove the curly braces and XML namspace using lxml and just report the tag name

So I have the following XML document It is much longer:
<?xml version ="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE fmresultset PUBLIC "-//FMI//DTD fmresultset//EN" "http://localhost:16020/fmi/xml/fmresultset.dtd">
<fmresultset xmlns="http://www.filemaker.com/xml/fmresultset" version="1.0">
<error code="0">
</error>
<product build="11/11/2014" name="FileMaker Web Publishing Engine" version="13.0.5.518">
</product>
I use the following python to extract some of the tag names:
doc = etree.fromstring(resulttxt)
print( doc.attrib)
print(doc.tag)
print(doc[4][0][0].tag)
if(doc[4][0][0].tag == 'field'):
print 'hi'
What I'm getting though is:
{'version': '1.0'}
{http://www.filemaker.com/xml/fmresultset}fmresultset
{http://www.filemaker.com/xml/fmresultset}field
The xmlns doesn't show up as an attribute of the root tag but it is there.
And it is placed in front of each tag name which makes it difficult to loop through and use conditionals. I want doc.tag just to show the tag and not the namespace and the tag.
This is day 1 for me using this. could anyone help out?
You need to handle namespaces, in your case an empty one:
from lxml import etree as ET
data = """<?xml version ="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE fmresultset PUBLIC "-//FMI//DTD fmresultset//EN" "http://localhost:16020/fmi/xml/fmresultset.dtd">
<fmresultset xmlns="http://www.filemaker.com/xml/fmresultset" version="1.0">
<error code="0">
</error>
<product build="11/11/2014" name="FileMaker Web Publishing Engine" version="13.0.5.518">
</product>
</fmresultset>
"""
namespaces = {
"myns": "http://www.filemaker.com/xml/fmresultset"
}
tree = ET.fromstring(data)
print tree.find("myns:product", namespaces=namespaces).attrib.get("name")
Prints:
FileMaker Web Publishing Engine

Remove namespace with xmltodict in Python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XML and cannot identify why? Here is my XML:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<section1
mystatus:field1="data1"
mystatus:field2="data2" />
<section2
mystatus:lineA="outputA"
mystatus:lineB="outputB" />
</status>
And using:
xmltodict.parse(xml,process_namespaces=True,namespaces={'http://localhost/mystatus':None})
I get:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#http://localhost/mystatus:field1', u'data1'), (u'#http://localhost/mystatus:field2', u'data2')])), (u'section2', OrderedDict([(u'#http://localhost/mystatus:lineA', u'outputA'), (u'#http://localhost/mystatus:lineB', u'outputB')]))]))])
instead of:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'field1', u'data1'), (u'field2', u'data2')])), (u'section2', OrderedDict([(u'lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Am I making some simple mistake, or is there something about my XML that prevents the process_namespace modification from working correctly?
xmltodict is based on expat, so namespaces should applied to the class name, not attribute names:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<mystatus:section1 field1="data1" field2="data2" />
<mystatus:section2 lineA="outputA" lineB="outputB" />
</status>
When parsed with:
foo = xmltodict.parse(xml,
process_namespaces=True,
namespaces={'http://localhost/mystatus':None})
outputs:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#field1', u'data1'), (u'#field2', u'data2')])), (u'section2', OrderedDict([(u'#lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Accessing it is easy:
# Get attribute 'lineA' from class 'section2' from class 'status'
>>> foo.get('status').get('section2').get('#lineA')
u'outputA'
Attribute namespaces are only required when you have multiple attributes of the same name (e.g. multiple id's or multiple prices, etc), in which case, I couldn't get expat or xmltodict to parse it correctly. YMMV though.

Removing all XML elements that belong to specific namespace

I am an XML beginner. I am using lxml python libs to process a SAML document, however my question is not really related to SAML or SSO.
Quite Simply I need to remove all elements that exist in this XML document which belong to the "ds" namespace. I looked at an Xpath Search, I looked at findall() however I do not know how to work with namespaces.
The original document looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<ds:Signature>
<ds:SignedInfo>
<ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
<ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/>
<ds:Reference URI="#redacted">
<ds:Transforms>
<ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/>
<ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
</ds:Transforms>
<ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
<ds:DigestValue>redacted</ds:DigestValue>
</ds:Reference>
</ds:SignedInfo>
<ds:SignatureValue>redacted==</ds:SignatureValue>
<ds:KeyInfo>
<ds:X509Data>
<ds:X509Certificate>certificateredacted=</ds:X509Certificate>
</ds:X509Data>
<ds:KeyValue>
<ds:RSAKeyValue>
<ds:Modulus>modulusredacted==</ds:Modulus>
<ds:Exponent>AQAB</ds:Exponent>
</ds:RSAKeyValue>
</ds:KeyValue>
</ds:KeyInfo>
</ds:Signature>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
What I want is a document that looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
You can find elements in a namespace using XPath with //namespace:*, as such:
doc_root.xpath('//ds:*', namespaces={'ds': 'http://www.w3.org/2000/09/xmldsig#'})
Thus, to remove all children in this namespace, you could use something like the following:
def strip_dsig(doc_root):
nsmap={'ds': 'http://www.w3.org/2000/09/xmldsig#'}
for element in doc_root.xpath('//ds:*', namespaces=nsmap):
element.getparent().remove(element)
return doc_root
This is very easy to do with an xsl stylesheet. This is probably your best approach.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
exclude-result-prefixes="ds">
<!-- no_ds.xsl -->
<xsl:template match="node()|#*">
<xsl:copy><xsl:apply-templates select="node()|#*"/></xsl:copy>
</xsl:template>
<xsl:template match="ds:*"><xsl:apply-templates select="*"/></xsl:template>
<xsl:template match="#ds:*"/>
</xsl:stylesheet>
You can run this from a command line using xsltproc (for libxml2) or equivalent:
xsltproc -o directoryname/ no_ds.xsl file1.xml file2.xml
This will create directoryname/file1.xml and directoryname/file2.xml without the ds namespace.
You can also do this with lxml using lxml's libxslt2 bindings.
no_ds_stylesheet = etree.parse('no_ds.xsl')
no_ds_transform = etree.XSLT()
# doc_to_transform is an Element or ElementTree
# from etree.fromstring(), etree.XML(), or etree.parse()
no_ds_doc = no_ds_transform(doc_to_transform)
#no_ds_doc is now another ElementTree doc, the result of the XSLT transform.
#You can reuse the no_ds_transform object multiple times (and should if you can)
no_ds_doc2 = no_ds_transform(doc_to_transform2)
Since XSLT documents are XML documents, you can even create a custom XSLT stylesheet on the fly using lxml and define the namespaces you want to omit dynamically. (Left as an exercise for the reader.)

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

I'm having problems getting lxml to successfully validate some xml. The XSD schema and XML file are both from Amazon documentation so should be compatible. But the XML itself refers to another schema that's not being loaded.
Here is my code, which is based on the lxml validation tutorial:
xsd_doc = etree.parse('ProductImage.xsd')
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse('ProductImage_sample.xml')
xsd.validate(xml)
print xsd.error_log
"ProductImage_sample.xml:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element 'AmazonEnvelope': No matching global declaration available for the validation root."
I get no errors if I validate against amzn-envelope.xsd instead of ProductImage.xsd, but that defeats the point of seeing if a given Image feed is valid. All xsd & xml files mentioned are in my working directory along with my python script by the way.
Here is a snippet of the sample xml, which should definately be valid:
<?xml version="1.0"?>
<AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">
<Header>
<DocumentVersion>1.01</DocumentVersion>
<MerchantIdentifier>Q_M_STORE_123</MerchantIdentifier>
</Header>
<MessageType>ProductImage</MessageType>
<Message>
<MessageID>1</MessageID>
<OperationType>Update</OperationType>
<ProductImage>
<SKU>1234</SKU>
Here is a snippet of the schema (this file is not public so I can't show all of it):
<?xml version="1.0"?>
<!-- Revision="$Revision: #5 $" -->
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xsd:include schemaLocation="amzn-base.xsd"/>
<xsd:element name="ProductImage">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="SKU"/>
I can say that following the include to amzn-base.xsd does not end up reaching a definition of the AmazonEnvelope tag. So my questions is: can lxml load schemas via a tag like <AmazonEnvelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="amzn-envelope.xsd">. And if not, how can I validate my Image feed?
The answer is I should validate by the parent schema file, which as mentioned at the top of the XML file is amzn-envelope.xsd as this contains the line:
<xsd:include schemaLocation="ProductImage.xsd"/>
In general then, lxml won't read such a declaration as xsi:noNamespaceSchemaLocation="amzn-envelope.xsd" but if you can find the parent schema to validate against then this should hopefully include the specific schema you're interested in.

Categories

Resources