I can load and dump YAML files with tags using ruamel.yaml, and the tags are preserved.
If I let my customers edit the YAML document, will they be able to exploit the YAML vulnerabilities because of arbitrary python module code execution? As I understand it ruamel.yaml is derived from PyYAML that has such vulnerabilities according to its documentation.
From your question I deduct you are using .load() method on a YAML() instance as in:
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load(some_path)
and not the old PyYAML load compatible function (which cannot handle unknown tags, and which can be unsafe). The short answer is that yes that is safe as no interpretation of tags is done without calling any tag dependent code, the tags just get assigned (as strings) to known types (CommentedMap, CommentedSeq, TaggedScalar for mapping, sequence and scalar respectively).
The vulnerability in PyYAML comes from its Constructor, that is part of the unsafe Loader. This was the default for PyYAML for the longest time and most people used it even when they could have used the safeloader because they were not using any tags (or could have regeistred the tags they needed against the SafeLoader/Constructor). That unsafe Constructor is a subclass of SafeConstructor and what makes it unsafe are the multi-methods registered for the interpretation of python/{name,module,object,apply,new):.... tags, essentially dynamically interpreting these tags to load modules and run code (i.e. execute functions/methods/instantiate classes).
The initial idea behind ruamel.yaml was its RoundTripConstructor , which is also a subclass of the SafeConstructor. You used to get this using the now deprecated round_trip_load function and nowadays via the .load() method after using YAML(typ='rt'), this is also the default for a YAML() instance without typ argument. That RoundTripConstructor does not registers any tags or have any code that interprets tags above and beyond the normal tags that the SafeConstructor uses.
Only the old PyYAML load and ruamel.yaml using typ='unsafe' can execute arbitrary Python code on doctored YAML input by executing code via the !!python/.. tags.
When you use typ='rt' the tag on nodes is preserved, even when a tag is not registered. This is possible because while processing nodes (in round-trip mode), those tags will just be assigned as strings to an attribute on the constructed type of the tagged node (mapping, sequence, or scalar, with the exception of tagged null). And when dumping these types, if they have a tag, it gets re-inserted into the representation processing code. At no point is the tag evaluated, used to import code, etc.
Related
I'd like to create a XML namespace mapping (e.g., to use in findall calls as in the Python documentation of ElementTree). Given the definitions seem to exist as attributes of the xbrl root element, I'd have thought I could just examine the attrib attribute of the root element within my ElementTree. However, the following code
from io import StringIO
import xml.etree.ElementTree as ET
TEST = '''<?xml version="1.0" encoding="utf-8"?>
<xbrl
xml:lang="en-US"
xmlns="http://www.xbrl.org/2003/instance"
xmlns:country="http://xbrl.sec.gov/country/2021"
xmlns:dei="http://xbrl.sec.gov/dei/2021q4"
xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
xmlns:link="http://www.xbrl.org/2003/linkbase"
xmlns:nvda="http://www.nvidia.com/20220130"
xmlns:srt="http://fasb.org/srt/2021-01-31"
xmlns:stpr="http://xbrl.sec.gov/stpr/2021"
xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31"
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</xbrl>'''
xbrl = ET.parse(StringIO(TEST))
print(xbrl.getroot().attrib)
produces the following output:
{'{http://www.w3.org/XML/1998/namespace}lang': 'en-US'}
Why aren't any of the namespace attributes showing up in root.attrib? I'd at least expect xlmns to be in the dictionary given it has no prefix.
What have I tried?
The following code seems to work to generate the namespace mapping:
print({prefix: uri for key, (prefix, uri) in ET.iterparse(StringIO(TEST), events=['start-ns'])})
output:
{'': 'http://www.xbrl.org/2003/instance',
'country': 'http://xbrl.sec.gov/country/2021',
'dei': 'http://xbrl.sec.gov/dei/2021q4',
'iso4217': 'http://www.xbrl.org/2003/iso4217',
'link': 'http://www.xbrl.org/2003/linkbase',
'nvda': 'http://www.nvidia.com/20220130',
'srt': 'http://fasb.org/srt/2021-01-31',
'stpr': 'http://xbrl.sec.gov/stpr/2021',
'us-gaap': 'http://fasb.org/us-gaap/2021-01-31',
'xbrldi': 'http://xbrl.org/2006/xbrldi',
'xlink': 'http://www.w3.org/1999/xlink',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
But yikes is it gross to have to parse the file twice.
As for the answer to your specific question, why the attrib list doesn't contain the namespace prefix decls, sorry for the unquenching answer: because they're not attributes.
http://www.w3.org/XML/1998/namespace is a special schema that doesn't act like the other schemas in your userspace. In that representation, xmlns:prefix="uri" is an attribute. In all other subordinate (by parsing sequence) schemas, xmlns:prefix="uri" is a special thing, a namespace prefix declaration, which is different than an attribute on a node or element. I don't have a reference for this but it holds true perfectly in at least a half dozen (correct) implementations of XML parsers that I've used, including those from IBM, Microsoft and Oracle.
As for the ugliness of reparsing the file, I feel your pain but it's necessary. As tdelaney so well pointed out, you may not assume that all of your namespace decls or prefixes must be on your root element.
Be prepared for the possibility of the same prefix being redefined with a different namespace on every node in your document. This may hold true and the library must correctly work with it, even if it is never the case your document (or worse, if it's never been the case so far).
Consider if perhaps you are shoehorning some text processing to parse or query XML when there may be a better solution, like XPath or XQuery. There are some good recent changes to and Python wrappers for Saxon, even though their pricing model has changed.
Is it possible to load the value from a python dict in yaml?
I can access variable by using:
!!python/name:mymodule.myfile.myvar
but this give the whole dict.
Trying to use dict get method like so:
test: &TEST !!python/object/apply:mymod.myfile.mydict.get ['mykey']
give me the following error:
yaml.constructor.ConstructorError: while constructing a Python object cannot find module 'mymod.myfile.mydict' (No module named 'mymod.myfile.mydict'; 'mymod.myfile' is not a package)
I'm trying to do that because I have bunch of yaml files which define my project settings, one is for path directory, and I need to load it into some other yaml files and it looks like you cant load yaml variable from another yaml.
EDIT:
I have found one solution, creating my own function who return the values in dict and calling it like so:
test: &TEST !!python/object/apply:mymod.myfile.get_dict_value ['mykey']
There is no mechanism in YAML to refer to one document from another YAML document.
You'll have to do that by interpreting information in the document in the program that loads the initial YAML document. Whether you do that by explicit logic, or by using some tag doesn't make a practical difference.
Please be aware that it is unsafe to allow interpreting tags of the form !!python/name:.....`` (via yaml=YAML(typ='unsafe') in ruamel.yaml, or load() in PyYAML), and is never really necessary.
I wrote a python script to retrieve data from a website in json format using the requests library, and then I dump it into a json file. I have written a lot of code utilizing this data and have tested it in Windows only. Recently I shifted to a Linux system, and when the same python script is executed, the order of the keys in the json file is completely different.
This is the code I'm using:
API_request = requests.get('https://www.abcd.com/datarequest')
alertJson_Data = API_request.json() # To convert returned data to json
json.dump(alertJson_Data, jsonDataFile) # for adding the json data for the alert to the file
jsonDataFile.write('\n')
jsonDataFile.close()
A lot of my other scripts depends on the ordering of the keys in this json file, so is there any way to maintain the same ordering that is used in Windows to be used in Linux as well?
For example in Windows the order is "id":, "src":, "dest":, whereas in Linux its completely different. If I directly go to the Web link on my browser, it has the same ordering as the one saved in Windows. How do I retain this ordering?
Can you use collections.OrderedDict when loading json?
e.g
from collections import OrderedDict
alertJson_Data = API_request.json(object_pairs_hook=OrderedDict)
should works, because json() method implemented on requests take the same optional arguments as json.loads
json(**kwargs)
Returns the json-encoded content of a response, if any.
Parameters **kwargs – Optional arguments that json.loads takes. Raises
ValueError – If the response body does not contain valid json.
And the doc of json.loads specify:
object_hook, if specified, will be called with the result of every
JSON object decoded and its return value will be used in place of the
given dict. This can be used to provide custom deserializations (e.g.
to support JSON-RPC class hinting).
object_pairs_hook, if specified will be called with the result of
every JSON object decoded with an ordered list of pairs. The return
value of object_pairs_hook will be used instead of the dict. This
feature can be used to implement custom decoders that rely on the
order that the key and value pairs are decoded (for example,
collections.OrderedDict() will remember the order of insertion). If
object_hook is also defined, the object_pairs_hook takes priority.
We have a REST web service written in Perl Dancer. It returns perl data structures in YAML format and also takes in parameters in YAML format - it is supposed to work with some other teams who query it using Python.
Here's the problem -- if I'm passing back just a regular old perl hash by Dancer's serialization everything works completely fine. JSON, YAML, XML... they all do the job.
HOWEVER, sometimes we need to pass Perl objects back that the Python can later pass back in as a parameter to help with unnecessary loading, etc. I played around and found that YAML is the only one that works with Perl's blessed objects in Dancer.
The problem is that Python's YAML can't parse through the YAMLs of the Perl objects (whereas it can handle regular old perl hash YAMLs without an issue).
The perl objects start out like this in YAML:
First one:
--- &1 !!perl/hash:Sequencing_API
Second:
--- !!perl/hash:SDB::DBIO
It errors out like this.
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:perl/hash:SDB::DBIO'
The regular files seem to get passed through like this:
---
fields:
library:
It seems like the extra stuff after --- are causing the issues. What can I do to address this? Or am I trying to do too much by passing around Perl objects?
the short answer is
!! is yaml shorthand for tag:yaml.org,2002: ... as such !!perl/hash is really tag:yaml.org,2002:perl/hash
now you need to tell python yaml how to deal with this type
so you add a constructor for it as follows
import yaml
def construct_perl_object(loader, node):
print "S:",suffix,"N:",node
return loader.construct_yaml_node(node)#this is likely wrong ....
yaml.add_multi_constructor(u"tag:yaml.org,2002:perl/hash:SDB::DBIO", construct_perl_object)
yaml.load(yaml_string)
or maybe just parse it out or return None maybe ... its hard to test with just that line ... but that may be what you are looking for
I am generating an XML document for which different XSDs have been provided for different parts (which is to say, definitions for some elements are in certain files, definitions for others are in others).
The XSD files do not refer to each other. The schemas are:
http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd
http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/FormSubmission-v1-1.xsd
http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd
Is there a way to validate the document against all of the schemas using lxml?
The solution here is not simply to validate individually against each schema, because the problem I am having is that validation fails because of elements not specified in the XSD. For example, when validating against http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd, I get the error:
File "lxml.etree.pyx", line 3006, in lxml.etree._Validator.assertValid (src/lxml/lxml.etree.c:125415)
DocumentInvalid: Element '{http://xmlgw.companieshouse.gov.uk}CompanyIncorporation': No matching global element declaration available, but demanded by the strict wildcard., line 9
Because the document in question contains a {http://xmlgw.companieshouse.gov.uk}CompanyIncorporation element, which is not specified in the XSD being validated against, but in one of the other XSD files.
I believe you should only be validating against Egov_ch-v2-0.xsd, which appears to define an envelope document. (This is the document you are creating, right? You haven't showed your XML.)
This schema uses <xs:any namespace="##any" minOccurs="0"/> to define body contents of the envelope. However, xsd:any does not mean "ignore all contents." Rather it means "accept anything here." Whether to validate or ignore the contents is controlled by the processContents attribute, which defaults to strict. This means that any elements discovered here must validate against types available to the schema. However, Egov_ch-v2-0.xsd does not import CompanyIncorporation-v1-2.xsd, so it doesn't know about the CompanyIncorporation element, so the document does not validate.
You need to add xsd:import elements to your main schema (Egov_ch-v2-0.xsd) to import all other schemas that may be used in the document. You can either do this in the xsd file itself, or you can add the elements programmatically after parsing:
xsd = lxml.etree.parse('http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd')
newimport = lxml.etree.Element('{http://www.w3.org/2001/XMLSchema}import',
namespace="http://xmlgw.companieshouse.gov.uk",
schemaLocation="http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd")
xsd.getroot().append(newimport)
validator = lxml.etree.XMLSchema(xsd)
You can even do this in a generic way with a function that takes a list of schema paths and returns a list of xsd:import statements with namespace and schemaLocation set by parsing targetNamespace.
(As an aside, you should probably download these schema documents and reference them with filesystem paths rather than load them over the network.)