TOML said "TOML and YAML both emphasize human readability features, like comments that make it easier to understand the purpose of a given line. TOML differs in combining these, allowing comments (unlike JSON) but preserving simplicity (unlike YAML)."
I can see TOML doesn’t rely on significant whitespace, but other than that I am not sure about the simplicity it claims. What is that exactly ?
Then I see StrictYAML, "StrictYAML is a type-safe YAML parser that parses and validates a restricted subset of the YAML specification." Type-safe, what is that exactly (again)?
What is the problem TOML didn't fix for YAML while StrictYAML thinks he does ? I did read through articles on StrictYAML site but I am still not clear.
So both TOML and StrictYAML want to fix the "problem" YAML has. But except for the indentation, what is the problem ?
---- update ----
I found here in reddit the author of StrictYaml talked about YAML vs TOML. But the answer I got so far said "strictyaml displays a rather poor understanding of YAML", while https://github.com/crdoconnor/strictyaml has got 957 stars as in 2021/12/28. So I am bit lost at which one I should use and I stick with YAML because most of my yaml is simple.
YAML downsides:
Implicit typing causes surprise type changes. (e.g. put 3 where you
previously had a string and it will magically turn into an int).
A bunch of nasty "hidden features" like node anchors and references
that make it look unclear (although to be fair a lot of people don't
use this).
TOML downsides:
Noisier syntax (especially with multiline strings).
The way arrays/tables are done is confusing, especially arrays of
tables.
I wrote a library that removed most of the nasty stuff I didn't like
about YAML leaving the core which I liked. It's got a pretty detailed
comparison between it and a bunch of other config formats, e.g.:
https://hitchdev.com/strictyaml/why-not/toml/
This may be an opinionated answer as I have written multiple YAML implementations.
Common Criticism of YAML addressed by the alternatives
YAML's outstanding semantic feature is that it can represent a possibly cyclic graph. Moreover, YAML mappings can use complex nodes (sequences or mappings) as keys. These features are what you potentially need when you want to represent an arbitrary data structure.
Another exotic YAML feature is tags. Their goal is to abstract over different types in different programming languages, e.g., a !!map would be a dict in Python but an object in JavaScript. While seldom used explicitly, implicit tag resolution is why false is usually loaded as a boolean value while droggeljug is loaded as a string. The apparent goal here was to reduce noise by not requiring to write boolean values like !!bool false or forcing quotes on every string value.
However, the reality has shown that many people are confused by this, and YAML defines that yes may be parsed as boolean has not helped either. YAML 1.2 tried to remedy this a bit by describing different schemas you can use, where the basic „failsafe“ schema exclusively loads to mappings, sequences, and strings, and the more complex „JSON“ and „core“ schemas do additional type guessing. However, most YAML implementations, prominently PyYAML, remained on YAML 1.1 for a long time (many implementations were originally rewritten PyYAML code, e.g., libyaml, SnakeYAML). This cemented the view that YAML makes questionable typing decisions that need fixing.
Nowadays, some implementations improved, and you can use the failsafe schema to avoid unwanted boolean values. In this regard, StrictYAML restricts itself to the failsafe schema; don't believe its argument that this is some novelty PyYAML can't do.
A common security issue with YAML implementations is that they mapped tags to arbitrary constructor calls (you can read up about an exploit in Ruby on Rails based on this here). Mind that this is not a YAML shortcoming; YAML doesn't suggest to call unknown functions during object construction anywhere. The base issue here is that data serialization is the enemy of data encapsulation; if your programming language offers constructors as the sole method for constructing an object, that's what you need to do when deserializing data. The remedy here is only to call known constructors, which was implemented broadly after a series of such exploits (another one with SnakeYAML iirc) surfaced. Nowadays, to call unknown constructors, you need to use a class aptly named DangerLoader in PyYAML.
TOML
TOML's main semantic difference is that it doesn't support cycles, complex keys, or tags. This means that while you can load YAML in an arbitrary user-defined class, you always load TOML into tables or arrays containing your data.
For example, while YAML allows you to load {foo: 1, bar: 2} into an object of a class with foo and bar integer fields, TOML will always load this into a table. A prominent example of YAML's capabilities you usually find in documentation is that it can load the scalar 1d6 into an object {number: 1, sides: 6}; TOML will always load it as string "1d6".
TOML's perceived simplicity here is that it doesn't do some stuff that YAML does. For example, if you're using a statically typed language like Java, after loading {foo: 1, bar: 2} into an object myObject, you can access myObject.foo safely (getting the integer 1). If you used TOML, you would need to do myObject["foo"], which could raise an exception if the key doesn't exist. This is less true in scripting languages like Python: Here, myObject.foo compiles and fails with a runtime error if foo does not happen to be a property of myObject.
My perspective from answering a lot of YAML questions here is that people don't use YAML's features and often load everything into a structure like Map<String, Object> (using Java as an example) and take it from there. If you do this, you could as well use TOML.
A different kind of simplicity TOML offers its syntax: Since it is vastly simpler than YAML, it is easier to emit errors users can understand. For example, a common error text in YAML syntax errors is „mapping values are not allowed in this context“ (try searching this on SO to find tons of questions). You get this for example here:
foo: 1
bar: 2
The error message does not help the user in fixing the error. This is because of YAML's complex syntax: YAML thinks 1 and bar are part of a multi-line scalar (because bar: is indented more than foo:), puts them together, then sees a second : and fails because multi-line scalars may not be used as implicit keys. However, most likely, the user simply either is-indented bar: or was under the impression that they can give both a scalar value to foo (1) and some children. It would be tough to write error messages that can help the user because of the possibilities in YAML syntax.
Since TOML's syntax is much simpler, the error messages are easier to understand. This is a big plus if the user writing TOML is not expected to be someone with a background in parsing grammars.
TOML has a conceptual advantage over YAML: Since its structure allows less freedom, it tends to be easier to read. When reading TOML, you always know, „okay, I'm gonna have nested tables with values in them“ while with YAML, you have some arbitrary structure. I believe this requires more cognitive load when reading a YAML file.
StrictYAML
StrictYAML argues that it provides type-safety, but since YAML isn't a programming language and specifically doesn't support assignments, this claim doesn't make any sense based on the Wikipedia definition which is linked by StrictYAML (type safety comes an goes with the programming language you use; e.g., any YAML is typesafe after loading it into a proper Java class instance, but you'll never be type-safe in a language like Python). Going over its list of removed features, it displays a rather poor understanding of YAML:
Implicit Typing: Can be deactivated in YAML implementations using the failsafe schema, as discussed above.
Direct representations of objects: It simply links to the Ruby on Rails incident, implying that this cannot be avoided, even though most implementations are safe today without removing the feature.
Duplicate Keys Disallowed: The YAML specification already requires this.
Node anchors and refs: StrictYAML argues that using this for deduplication is unreadable to non-programmers, ignoring that the intention was to be able to serialize cyclic structures, which is not possible without anchors and aliases.
On the deserialization side,
All data is a string, list or OrderedDict
It is basically the same structure TOML supports (I believe StrictYAML supports complex keys in mappings as neither list nor OrderedDict are hashable in Python).
You are also losing the ability to deserialize to predefined class structures. One could argue that the inability to construct a class object with well-defined fields makes StrictYAML less type-safe than standard YAML: A standard YAML implementation can guarantee that the returned object has a certain structure described by types, while StrictYAML gives you on every level either a string, a list or an OrderedDict and you can't do anything to restrict it.
While quite some of its arguments are flawed, the resulting language is still usable. For example, with StrictYAML, you do not need to care about the billion laughs attack haunts some YAML implementations. Again, this is not a YAML problem but an implementations problem, as YAML does not require an implementation to duplicate a node that is anchored and referred to from multiple places.
Bottom Line
Quite some YAML issues stem from poor implementations, not from issues in the language itself. However, YAML as a language certainly is complex and syntactic errors can be hard to understand, which could be a valid reason to use a language like TOML. As for StrictYAML, it does offer some benefit, but I suggest against using it because it does not have a proper specification and only a single implementation, which is a setup that is very prone to becoming a maintenance nightmare (project could be discontinued, breaking changes easily possible).
StrictYAML Type Safety -- The "Norway Problem"
Here's an example given by the StrictYAML people:
countries:
- FR
- DE
- NO
Implicitly translates "NO" to a False value.
>>> from pyyaml import load
>>> load(the_configuration)
{'countries': ['FR', 'DE', False]}
https://hitchdev.com/strictyaml/why/implicit-typing-removed
Related
I am trying to understand what typ and pure=True mean in the ruamel.yaml Python library.
I've read the documentation here.
So far, I've understood that typ='safe' uses a safe loader which omits YAML tags parsing in YAML (they can lead to arbitrary code execution).
I haven't found any explanation about round-trip parser typ='rt' in the docs.
Also, I think explanation on pure=True is confusing:
Provide pure=True to enforce using the pure Python implementation (faster C libraries will be used when possible/available)
Are faster C libraries used with pure=True or not? If they do, why do you need to specify this flag in the first place?
There are four standard typ parameters:
rt: (for round-trip) in this case the document is loaded in special types that preserve comments, etc., used for dumping. This is what ruamel.yaml was created for and this is the default (i.e. what you get if you don't specify typ). This is a subclass of the safe loader/dumper.
safe: this only loads/dumps tagged objects, when these are explicitly registered with the loader/dumper
unsafe: try to load/dump everything. Classes are automatically resolved to a tag of the form !!python/object:<module>/<class>
base: the loader/dumper from which everything is derived. All scalars are loaded as strings (even the types like integer, float, Boolean that are handled specially as mentioned in the YAML specification or the type description
For safe, unsafe, base there is the faster C Loader available. If you install from the .tar.gz file these will only get compiled during installation when the appropriate compiler is available. If they are not available, because they could not be compiled, then they cannot be used.
There is no C version of the rt code. So it is not possible to use C libraries.
The word pure is for when you use Python only modules. The opposite would be "tainted": Python tainted with C extension modules. There is no tainted=True parameter. This is implicit (when possible/available, see previous paragraph) when pure=true is not specified, as the default for pure is False
In order to further confuse you: the above are the four basic (built-in) values for type. If you use plug-ins you can e.g. do
yaml = YAML(typ='jinja2')
as shown in this answer
Some of the above information is available from the YAML() docstring, little of that however made it into the package documentation, primarily as a result of lazyness of ruamel.yaml's author.
Before coming to a concrete example, let me mention the problem. As a beginning Python programmer with extensive experience in C++, I'm always missing variable declarations. I could yield to the temptation of documenting the type of every nontrivial identifier, but I have a feeling that that would not be terribly pythonic. For one thing, it would be silly that neither the interpreter nor any tool parse these informal declarations. And if the interpreter did, that would be an entirely different language.
As an alternative to writing mere comments, I am contemplating switching to a mode of creating datatypes whose only purpose is to enforce types/interfaces. They would streamline the code and would make me detect type errors at earlier stages. For this convenience I would be paying a little loss in efficiency from the indirection.
For example, to avoid writing as a comment "Dictionary of Employee objects indexed by employeeID", I would write a wrapper class called "EmployeeDict", whose interface would limit the operations that can/cannot be performed.
Would such an idea fly in the long term? Does it defeat the spirit of Python in some way? Is it used by experienced Pythonistas?
For those conversant in C++, I would in other words be translating
typedef std::map<EmployeeId, Employee> MyMap;
into a type. (Though I am not actually porting any code across.)
Update
Even if it's unphythonic, as HumphreyTriscuit confirms, I am loath to write comments that get read by humans without also automating a little the type checking. It's nice that this issue is resolved in 3.5, but I'm stuck for the time being with 2.7, and so I'll mark jsbueno's answer correct until someone can suggest a way—à la "assert isinstance(param, dict)", but one that also concisely confirms the type of the key/value, somewhat paralleling C++—to solve this problem in 2.7.
Actually, as of Python 3.5, the language comes bundled with tools for parameter type annotations that is introspectable by third party tools- of which tehre might be some ut there already.
Anyway, take a look at https://www.python.org/dev/peps/pep-0484/
Even if you don't use any other tools - the way described on PEP 484 above is the "Pythonic way" of declaring types, that won't conflict with other 3rd party tools. So,if you want to write a tool chain of yours as you describe, you should start by creating using function annotations as described on that PEP.
That is good for documenting (and enfocing if the case be), parameters and return values. For class attributes, you can check this answer of mine, based on crafting a special __setitem__ method on a abse class of your hierarchy:
Force python class member variable to be specific type
As for local variables - there is no way to enforce/check their type but code comments.
And a last advise to keep you "on the Python way" remember to be permissive and check for interfaces, rather than specific classes.
For a while I've been using a package called "gnosis-utils" which provides an XML pickling service for Python. This class works reasonably well, however it seems to have been neglected by it's developer for the last four years.
At the time we originally selected gnosis it was the only XML serization tool for Python. The advantage of Gnosis was that it provided a set of classes whose function was very similar to the built-in Python XML pickler. It produced XML which python-developers found easy to read, but non-python developers found confusing.
Now that the proejct has grown we have a new requirement: We need to be able to exchange XML with our colleagues who prefer Java or .Net. These non-python developers will not be using Python - they intend to produce XML directly, hence we have a need to simplify the format of the XML.
So are there any alternatives to Gnosis. Our requirements:
Must work on Python 2.4 / Windows x86 32bit
Output must be XML, as simple as possible
API must resemble Pickle as closely as possible
Performance is not hugely important
Of course we could simply adapt Gnosis, however we'd prefer to simply use a component which already provides the functions we requrie (assuming that it exists).
So what you're looking for is a python library that spits out arbitrary XML for your objects? You don't need to control the format, so you can't be bothered to actually write something that iterates over the relevant properties of your data and generates the XML using one of the existing tools?
This seems like a bad idea. Arbitrary XML serialization doesn't sound like a good way to move forward. Any format that includes all of pickle's features is going to be ugly, verbose, and very nasty to use. It will not be simple. It will not translate well into Java.
What does your data look like?
If you tell us precisely what aspects of pickle you need (and why lxml.objectify doesn't fulfill those), we will be better able to help you.
Have you considered using JSON for your serialization? It's easy to parse, natively supports python-like data structures, and has wide-reaching support. As an added bonus, it doesn't open your code to all kinds of evil exploits the way the native pickle module does.
Honestly, you need to bite the bullet and define a format, and build a serializer using the standard XML tools, if you absolutely must use XML. Consider JSON.
There is xml_marshaller which provides a simple way of dumping arbitrary Python objects to XML:
>>> from xml_marshaller import xml_marshaller
>>> class Foo(object): pass
>>> foo = Foo()
>>> foo.bar = 'baz'
>>> dump_str = xml_marshaller.dumps(foo)
Pretty printing the above with lxml (which is a dependency of xml_marshaller anyway):
>>> from lxml.etree import fromstring, tostring
>>> print tostring(fromstring(dump_str), pretty_print=True)
You get output like this:
<marshal>
<object id="i2" module="__main__" class="Foo">
<tuple/>
<dictionary id="i3">
<string>bar</string>
<string>baz</string>
</dictionary>
</object>
</marshal>
I did not check for Python 2.4 compatibility since this question was asked long ago, but a solution for xml dumping arbitrary Python objects remains relevant.
I'm a C programmer and I'm getting quite good with Python. But I still have some problems getting my mind around the OO awesomeness of Python.
Here is my current design problem:
The end "product" is a JSON data structure created in Python (and passed to Javascript code) containing different types of data like:
{ type:url, {urlpayloaddict) }
{ type:text, {textpayloaddict}
...
My Javascript knows how to parse and display each type of JSON response.
I'm happy with this design. My question comes from handling this data in the Python code.
I obtain my data from a variety of sources: MySQL, a table lookup, an API call to a web service...
Basically, should I make a super class responseElement and specialise it for each type of response, then pass around a list of these objects in the Python code OR should I simply pass around a list of dictionaries that contain the response data in key value pairs. The answer seems to result in significantly different implementations.
I'm a bit unsure if I'm getting too object happy ??
In my mind, it basically goes like this: you should try to keep things the same where they are the same, and separate them where they're different.
If you're performing the exact same operations on and with the data, and it can all be represented in a common format, then there's no reason to have separate objects for it - translate it into a common format ASAP and Don't Repeat Yourself when it comes to implementing things that don't distinguish.
If each type/source of data requires specialized operations specific to it, and there isn't much in the way of overlap between such at the layer your Python code is dealing with, then keep things in separate objects so that you maintain a tight association between the specialized code and the specific data on which it is able to operate.
Do the different response sources represent fundamentally different categories or classes of objects? They don't appear to, the way you've described it.
Thus, various encode/decode functions and passing around only one type seems the best solution for you.
That type can be a dict or your own class, if you have special methods to use on the data (but those methods would then not care what input and output encodings were), or you could put the encode/decode pairs into the class. (Decode would be a classmethod, returning a new instance.)
Your receiver objects (which can perfectly well be instances of different classes, perhaps generated by a Factory pattern depending on the source of incoming data) should all have a common method that returns the appropriate dict (or other directly-JSON'able structure, such as a list that will turn into a JSON array).
Differently from what one answer states, this approach clearly doesn't require higher level code to know what exact kind of receiver it's dealing with (polymorphism will handle that for you in any OO language!) -- nor does the higher level code need to know "names of keys" (as, again, that other answer peculiarly states), as it can perfectly well treat the "JSON'able data" as a pretty opaque data token (as long as it's suitable to be the argument for a json.dumps later call!).
Building up and passing around a container of "plain old data" objects (produced and added to the container in various ways) for eventual serialization (or other such uniform treatment, but you can see JSON translation as a specific form of serialization) is a common OO pattern. No need to carry around anything richer or heavier than such POD data, after all, and in Python using dicts as the PODs is often a perfectly natural implementation choice.
I've had success with the OOP approach. Consider a base class with a "ToJson" method and have each subclass implement it appropriately. Then your higher level code doesn't need to know any detail about how the data was obtained...it just knows it has to call "ToJson" on every object in the list you mentioned.
A dictionary would work too, but it requires your calling code to know names of keys, etc and won't scale as well.
OOP I say!
Personally, I opt for the latter (passing around a list of data) wherever and whenever possible. I think OO is often misused/abused for certain things. I specifically avoid things like wrapping data in an object just for the sake of wrapping it in an object. So this, {'type':'url', 'data':{some_other_dict}} is better to me than:
class DataObject:
def __init__(self):
self.type = 'url'
self.data = {some_other_dict}
But, if you need to add specific functionality to this data, like the ability for it to sort its data.keys() and return them as a set, then creating an object makes more sense.
Are there any security exploits that could occur in this scenario:
eval(repr(unsanitized_user_input), {"__builtins__": None}, {"True":True, "False":False})
where unsanitized_user_input is a str object. The string is user-generated and could be nasty. Assuming our web framework hasn't failed us, it's a real honest-to-god str instance from the Python builtins.
If this is dangerous, can we do anything to the input to make it safe?
We definitely don't want to execute anything contained in the string.
See also:
Funny blog post about eval safety
Previous Question
Blog: Fast deserialization in Python
The larger context which is (I believe) not essential to the question is that we have thousands of these:
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
unsanitized_user_input_3,
unsanitized_user_input_4,
...])
in some cases nested:
repr([[unsanitized_user_input_1,
unsanitized_user_input_2],
[unsanitized_user_input_3,
unsanitized_user_input_4],
...])
which are themselves converted to strings with repr(), put in persistent storage, and eventually read back into memory with eval.
Eval deserialized the strings from persistent storage much faster than pickle and simplejson. The interpreter is Python 2.5 so json and ast aren't available. No C modules are allowed and cPickle is not allowed.
It is indeed dangerous and the safest alternative is ast.literal_eval (see the ast module in the standard library). You can of course build and alter an ast to provide e.g. evaluation of variables and the like before you eval the resulting AST (when it's down to literals).
The possible exploit of eval starts with any object it can get its hands on (say True here) and going via .__class_ to its type object, etc. up to object, then gets its subclasses... basically it can get to ANY object type and wreck havoc. I can be more specific but I'd rather not do it in a public forum (the exploit is well known, but considering how many people still ignore it, revealing it to wannabe script kiddies could make things worse... just avoid eval on unsanitized user input and live happily ever after!-).
If you can prove beyond doubt that unsanitized_user_input is a str instance from the Python built-ins with nothing tampered, then this is always safe. In fact, it'll be safe even without all those extra arguments since eval(repr(astr)) = astr for all such string objects. You put in a string, you get back out a string. All you did was escape and unescape it.
This all leads me to think that eval(repr(x)) isn't what you want--no code will ever be executed unless someone gives you an unsanitized_user_input object that looks like a string but isn't, but that's a different question--unless you're trying to copy a string instance in the slowest way possible :D.
With everything as you describe, it is technically safe to eval repred strings, however, I'd avoid doing it anyway as it's asking for trouble:
There could be some weird corner-case where your assumption that only repred strings are stored (eg. a bug / different pathway into the storage that doesn't repr instantly becmes a code injection exploit where it might otherwise be unexploitable)
Even if everything is OK now, assumptions might change at some point, and unsanitised data may get stored in that field by someone unaware of the eval code.
Your code may get reused (or worse, copy+pasted) into a situation you didn't consider.
As Alex Martelli pointed out, in python2.6 and higher, there is ast.literal_eval which will safely handle both strings and other simple datatypes like tuples. This is probably the safest and most complete solution.
Another possibility however is to use the string-escape codec. This is much faster than eval (about 10 times according to timeit), available in earlier versions than literal_eval, and should do what you want:
>>> s = 'he\nllo\' wo"rld\0\x03\r\n\tabc'
>>> repr(s)[1:-1].decode('string-escape') == s
True
(The [1:-1] is to strip the outer quotes repr adds.)
Generally, you should never allow anyone to post code.
So called "paid professional programmers" have a hard-enough time writing code that actually works.
Accepting code from the anonymous public -- without benefit of formal QA -- is the worst of all possible scenarios.
Professional programmers -- without good, solid formal QA -- will make a hash of almost any web site. Indeed, I'm reverse engineering some unbelievably bad code from paid professionals.
The idea of allowing a non-professional -- unencumbered by QA -- to post code is truly terrifying.
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
...
... unsanitized_user_input is a str object
You shouldn't have to serialise strings to store them in a database..
If these are all strings, as you mentioned - why can't you just store the strings in a db.StringListProperty?
The nested entries might be a bit more complicated, but why is this the case? When you have to resort to eval to get data from the database, you're probably doing something wrong..
Couldn't you store each unsanitized_user_input_x as it's own db.StringProperty row, and have group them by an reference field?
Either of those may not be applicable, since I've no idea what you're trying to achieve, but my point is - can you not structure the data in a way you where don't have to rely on eval (and also rely on it not being a security issue)?