Validate with three xml schemas as one combined schema in lxml? - python

I am generating an XML document for which different XSDs have been provided for different parts (which is to say, definitions for some elements are in certain files, definitions for others are in others).
The XSD files do not refer to each other. The schemas are:
http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd
http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/FormSubmission-v1-1.xsd
http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd
Is there a way to validate the document against all of the schemas using lxml?
The solution here is not simply to validate individually against each schema, because the problem I am having is that validation fails because of elements not specified in the XSD. For example, when validating against http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd, I get the error:
File "lxml.etree.pyx", line 3006, in lxml.etree._Validator.assertValid (src/lxml/lxml.etree.c:125415)
DocumentInvalid: Element '{http://xmlgw.companieshouse.gov.uk}CompanyIncorporation': No matching global element declaration available, but demanded by the strict wildcard., line 9
Because the document in question contains a {http://xmlgw.companieshouse.gov.uk}CompanyIncorporation element, which is not specified in the XSD being validated against, but in one of the other XSD files.

I believe you should only be validating against Egov_ch-v2-0.xsd, which appears to define an envelope document. (This is the document you are creating, right? You haven't showed your XML.)
This schema uses <xs:any namespace="##any" minOccurs="0"/> to define body contents of the envelope. However, xsd:any does not mean "ignore all contents." Rather it means "accept anything here." Whether to validate or ignore the contents is controlled by the processContents attribute, which defaults to strict. This means that any elements discovered here must validate against types available to the schema. However, Egov_ch-v2-0.xsd does not import CompanyIncorporation-v1-2.xsd, so it doesn't know about the CompanyIncorporation element, so the document does not validate.
You need to add xsd:import elements to your main schema (Egov_ch-v2-0.xsd) to import all other schemas that may be used in the document. You can either do this in the xsd file itself, or you can add the elements programmatically after parsing:
xsd = lxml.etree.parse('http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd')
newimport = lxml.etree.Element('{http://www.w3.org/2001/XMLSchema}import',
namespace="http://xmlgw.companieshouse.gov.uk",
schemaLocation="http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd")
xsd.getroot().append(newimport)
validator = lxml.etree.XMLSchema(xsd)
You can even do this in a generic way with a function that takes a list of schema paths and returns a list of xsd:import statements with namespace and schemaLocation set by parsing targetNamespace.
(As an aside, you should probably download these schema documents and reference them with filesystem paths rather than load them over the network.)

Related

Why doesn't Element.attrib include namespace definitions?

I'd like to create a XML namespace mapping (e.g., to use in findall calls as in the Python documentation of ElementTree). Given the definitions seem to exist as attributes of the xbrl root element, I'd have thought I could just examine the attrib attribute of the root element within my ElementTree. However, the following code
from io import StringIO
import xml.etree.ElementTree as ET
TEST = '''<?xml version="1.0" encoding="utf-8"?>
<xbrl
xml:lang="en-US"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:country="http://xbrl.sec.gov/country/2021"
xmlns:dei="http://xbrl.sec.gov/dei/2021q4"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:nvda="http://www.nvidia.com/20220130"
xmlns:srt="http://fasb.org/srt/2021-01-31"
xmlns:stpr="http://xbrl.sec.gov/stpr/2021"
xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</xbrl>'''
xbrl = ET.parse(StringIO(TEST))
print(xbrl.getroot().attrib)
produces the following output:
{'{http://www.w3.org/XML/1998/namespace}lang': 'en-US'}
Why aren't any of the namespace attributes showing up in root.attrib? I'd at least expect xlmns to be in the dictionary given it has no prefix.
What have I tried?
The following code seems to work to generate the namespace mapping:
print({prefix: uri for key, (prefix, uri) in ET.iterparse(StringIO(TEST), events=['start-ns'])})
output:
{'': 'http://www.xbrl.org/2003/instance',
'country': 'http://xbrl.sec.gov/country/2021',
'dei': 'http://xbrl.sec.gov/dei/2021q4',
'iso4217': 'http://www.xbrl.org/2003/iso4217',
'link': 'http://www.xbrl.org/2003/linkbase',
'nvda': 'http://www.nvidia.com/20220130',
'srt': 'http://fasb.org/srt/2021-01-31',
'stpr': 'http://xbrl.sec.gov/stpr/2021',
'us-gaap': 'http://fasb.org/us-gaap/2021-01-31',
'xbrldi': 'http://xbrl.org/2006/xbrldi',
'xlink': 'http://www.w3.org/1999/xlink',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
But yikes is it gross to have to parse the file twice.
As for the answer to your specific question, why the attrib list doesn't contain the namespace prefix decls, sorry for the unquenching answer: because they're not attributes.
http://www.w3.org/XML/1998/namespace is a special schema that doesn't act like the other schemas in your userspace. In that representation, xmlns:prefix="uri" is an attribute. In all other subordinate (by parsing sequence) schemas, xmlns:prefix="uri" is a special thing, a namespace prefix declaration, which is different than an attribute on a node or element. I don't have a reference for this but it holds true perfectly in at least a half dozen (correct) implementations of XML parsers that I've used, including those from IBM, Microsoft and Oracle.
As for the ugliness of reparsing the file, I feel your pain but it's necessary. As tdelaney so well pointed out, you may not assume that all of your namespace decls or prefixes must be on your root element.
Be prepared for the possibility of the same prefix being redefined with a different namespace on every node in your document. This may hold true and the library must correctly work with it, even if it is never the case your document (or worse, if it's never been the case so far).
Consider if perhaps you are shoehorning some text processing to parse or query XML when there may be a better solution, like XPath or XQuery. There are some good recent changes to and Python wrappers for Saxon, even though their pricing model has changed.

Is it safe to use the default load in ruamel.yaml

I can load and dump YAML files with tags using ruamel.yaml, and the tags are preserved.
If I let my customers edit the YAML document, will they be able to exploit the YAML vulnerabilities because of arbitrary python module code execution? As I understand it ruamel.yaml is derived from PyYAML that has such vulnerabilities according to its documentation.
From your question I deduct you are using .load() method on a YAML() instance as in:
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load(some_path)
and not the old PyYAML load compatible function (which cannot handle unknown tags, and which can be unsafe). The short answer is that yes that is safe as no interpretation of tags is done without calling any tag dependent code, the tags just get assigned (as strings) to known types (CommentedMap, CommentedSeq, TaggedScalar for mapping, sequence and scalar respectively).
The vulnerability in PyYAML comes from its Constructor, that is part of the unsafe Loader. This was the default for PyYAML for the longest time and most people used it even when they could have used the safeloader because they were not using any tags (or could have regeistred the tags they needed against the SafeLoader/Constructor). That unsafe Constructor is a subclass of SafeConstructor and what makes it unsafe are the multi-methods registered for the interpretation of python/{name,module,object,apply,new):.... tags, essentially dynamically interpreting these tags to load modules and run code (i.e. execute functions/methods/instantiate classes).
The initial idea behind ruamel.yaml was its RoundTripConstructor , which is also a subclass of the SafeConstructor. You used to get this using the now deprecated round_trip_load function and nowadays via the .load() method after using YAML(typ='rt'), this is also the default for a YAML() instance without typ argument. That RoundTripConstructor does not registers any tags or have any code that interprets tags above and beyond the normal tags that the SafeConstructor uses.
Only the old PyYAML load and ruamel.yaml using typ='unsafe' can execute arbitrary Python code on doctored YAML input by executing code via the !!python/.. tags.
When you use typ='rt' the tag on nodes is preserved, even when a tag is not registered. This is possible because while processing nodes (in round-trip mode), those tags will just be assigned as strings to an attribute on the constructed type of the tagged node (mapping, sequence, or scalar, with the exception of tagged null). And when dumping these types, if they have a tag, it gets re-inserted into the representation processing code. At no point is the tag evaluated, used to import code, etc.

Access python dict value in yaml with tags

Is it possible to load the value from a python dict in yaml?
I can access variable by using:
!!python/name:mymodule.myfile.myvar
but this give the whole dict.
Trying to use dict get method like so:
test: &TEST !!python/object/apply:mymod.myfile.mydict.get ['mykey']
give me the following error:
yaml.constructor.ConstructorError: while constructing a Python object cannot find module 'mymod.myfile.mydict' (No module named 'mymod.myfile.mydict'; 'mymod.myfile' is not a package)
I'm trying to do that because I have bunch of yaml files which define my project settings, one is for path directory, and I need to load it into some other yaml files and it looks like you cant load yaml variable from another yaml.
EDIT:
I have found one solution, creating my own function who return the values in dict and calling it like so:
test: &TEST !!python/object/apply:mymod.myfile.get_dict_value ['mykey']
There is no mechanism in YAML to refer to one document from another YAML document.
You'll have to do that by interpreting information in the document in the program that loads the initial YAML document. Whether you do that by explicit logic, or by using some tag doesn't make a practical difference.
Please be aware that it is unsafe to allow interpreting tags of the form !!python/name:.....`` (via yaml=YAML(typ='unsafe') in ruamel.yaml, or load() in PyYAML), and is never really necessary.

xs:import schemaLocation redirect not followed/https not supported in lxml?

We've recently moved to an https-everywhere policy on our production infrastructure, and it's causing some problems with XML schema validation, which I wonder if anyone can help me clarify.
We have schema documents available at static paths on our server, and they were once available at http://path.to/schema.xsd. They are no longer available there, instead being at https://path.to/schema.xsd, but any calls to the original URL result in a 301 (Moved Permanently).
The schema documents themselves have references to each other, in particular there's a parent schema which includes the line:
<xs:import namespace="http://example.com/schemas/iso_639-2b/1.0"
schemaLocation="http://example.com/static/iso_639-2b.xsd">
(note that the schemaLocation points to the http version of the URL still, as we weren't anticipating having to change this, what with the redirect in place)
In attempting to validate an incoming XML document against the schema
schema_path = "/path/to/file/on/disk/schema.xsd"
schema_file = open(schema_path)
schema_doc = etree.parse(schema_file)
schema = etree.XMLSchema(schema_doc)
The root schema.xsd loads (since it comes from local disk), but on the final line, when we initialise an XMLSchema (using lxml as the underlying etree implementation) we get an exception:
{XMLSchemaParseError}element decl. 'language', attribute 'type': The QName value '{http://example.com/schemas/iso_639-2b/1.0}LanguageCodeType' does not resolve to a(n) type definition., line 41
My working theory is that lxml (or even the xml schema specification, though I haven't been able to find any documentation) either doesn't follow redirects or doesn't support https (or both!).
Any info, and advice on what the appropriate fix is would be much appreciated.

Documenting tags in between paragraphs in Pythons Epydoc

I'm writing documentation for methods in python that is supposed to be available for end users to read through. I'm using Epydoc field tags to document the argument based on requirements given to me, and am trying to put the parameter description in between the description of the method and the examples of using the method as such:
"""
A description of what this utility does.
#param utilityArguments: Description of arguments
#type utilityArguments: String
A list of examples:
example1
example2
example3
"""
Unfortunately I have not had any success in finding a way to exclude the examples from the type tag, and they get added to it instead of being separate. I'm trying to not move the parameters to the end of the argument because we feel this looks neater; is there any way to terminate a tag for documentation and exclude any following text from it?
Sorry to be the bearer of bad news, but the Epydoc documentation specifically disallows this behavior:
Fields must be placed at the end of the docstring, after the description of the object. Fields may be included in any order.
Since fields are all #param and #return markups, that means that all things after a field will be considered a part of that field (unless it is another field).

Categories

Resources