ruamel.yaml: clarification on typ and pure=True

ruamel.yaml: clarification on typ and pure=True - python

I am trying to understand what typ and pure=True mean in the ruamel.yaml Python library.
I've read the documentation here.
So far, I've understood that typ='safe' uses a safe loader which omits YAML tags parsing in YAML (they can lead to arbitrary code execution).
I haven't found any explanation about round-trip parser typ='rt' in the docs.
Also, I think explanation on pure=True is confusing:
Provide pure=True to enforce using the pure Python implementation (faster C libraries will be used when possible/available)
Are faster C libraries used with pure=True or not? If they do, why do you need to specify this flag in the first place?

There are four standard typ parameters:
rt: (for round-trip) in this case the document is loaded in special types that preserve comments, etc., used for dumping. This is what ruamel.yaml was created for and this is the default (i.e. what you get if you don't specify typ). This is a subclass of the safe loader/dumper.
safe: this only loads/dumps tagged objects, when these are explicitly registered with the loader/dumper
unsafe: try to load/dump everything. Classes are automatically resolved to a tag of the form !!python/object:<module>/<class>
base: the loader/dumper from which everything is derived. All scalars are loaded as strings (even the types like integer, float, Boolean that are handled specially as mentioned in the YAML specification or the type description
For safe, unsafe, base there is the faster C Loader available. If you install from the .tar.gz file these will only get compiled during installation when the appropriate compiler is available. If they are not available, because they could not be compiled, then they cannot be used.
There is no C version of the rt code. So it is not possible to use C libraries.
The word pure is for when you use Python only modules. The opposite would be "tainted": Python tainted with C extension modules. There is no tainted=True parameter. This is implicit (when possible/available, see previous paragraph) when pure=true is not specified, as the default for pure is False
In order to further confuse you: the above are the four basic (built-in) values for type. If you use plug-ins you can e.g. do
yaml = YAML(typ='jinja2')
as shown in this answer
Some of the above information is available from the YAML() docstring, little of that however made it into the package documentation, primarily as a result of lazyness of ruamel.yaml's author.

Related

TOML vs YAML vs StrictYAML

TOML said "TOML and YAML both emphasize human readability features, like comments that make it easier to understand the purpose of a given line. TOML differs in combining these, allowing comments (unlike JSON) but preserving simplicity (unlike YAML)."
I can see TOML doesn’t rely on significant whitespace, but other than that I am not sure about the simplicity it claims. What is that exactly ?
Then I see StrictYAML, "StrictYAML is a type-safe YAML parser that parses and validates a restricted subset of the YAML specification." Type-safe, what is that exactly (again)?
What is the problem TOML didn't fix for YAML while StrictYAML thinks he does ? I did read through articles on StrictYAML site but I am still not clear.
So both TOML and StrictYAML want to fix the "problem" YAML has. But except for the indentation, what is the problem ?
---- update ----
I found here in reddit the author of StrictYaml talked about YAML vs TOML. But the answer I got so far said "strictyaml displays a rather poor understanding of YAML", while https://github.com/crdoconnor/strictyaml has got 957 stars as in 2021/12/28. So I am bit lost at which one I should use and I stick with YAML because most of my yaml is simple.
YAML downsides:
Implicit typing causes surprise type changes. (e.g. put 3 where you
previously had a string and it will magically turn into an int).
A bunch of nasty "hidden features" like node anchors and references
that make it look unclear (although to be fair a lot of people don't
use this).
TOML downsides:
Noisier syntax (especially with multiline strings).
The way arrays/tables are done is confusing, especially arrays of
tables.
I wrote a library that removed most of the nasty stuff I didn't like
about YAML leaving the core which I liked. It's got a pretty detailed
comparison between it and a bunch of other config formats, e.g.:
https://hitchdev.com/strictyaml/why-not/toml/

This may be an opinionated answer as I have written multiple YAML implementations.
Common Criticism of YAML addressed by the alternatives
YAML's outstanding semantic feature is that it can represent a possibly cyclic graph. Moreover, YAML mappings can use complex nodes (sequences or mappings) as keys. These features are what you potentially need when you want to represent an arbitrary data structure.
Another exotic YAML feature is tags. Their goal is to abstract over different types in different programming languages, e.g., a !!map would be a dict in Python but an object in JavaScript. While seldom used explicitly, implicit tag resolution is why false is usually loaded as a boolean value while droggeljug is loaded as a string. The apparent goal here was to reduce noise by not requiring to write boolean values like !!bool false or forcing quotes on every string value.
However, the reality has shown that many people are confused by this, and YAML defines that yes may be parsed as boolean has not helped either. YAML 1.2 tried to remedy this a bit by describing different schemas you can use, where the basic „failsafe“ schema exclusively loads to mappings, sequences, and strings, and the more complex „JSON“ and „core“ schemas do additional type guessing. However, most YAML implementations, prominently PyYAML, remained on YAML 1.1 for a long time (many implementations were originally rewritten PyYAML code, e.g., libyaml, SnakeYAML). This cemented the view that YAML makes questionable typing decisions that need fixing.
Nowadays, some implementations improved, and you can use the failsafe schema to avoid unwanted boolean values. In this regard, StrictYAML restricts itself to the failsafe schema; don't believe its argument that this is some novelty PyYAML can't do.
A common security issue with YAML implementations is that they mapped tags to arbitrary constructor calls (you can read up about an exploit in Ruby on Rails based on this here). Mind that this is not a YAML shortcoming; YAML doesn't suggest to call unknown functions during object construction anywhere. The base issue here is that data serialization is the enemy of data encapsulation; if your programming language offers constructors as the sole method for constructing an object, that's what you need to do when deserializing data. The remedy here is only to call known constructors, which was implemented broadly after a series of such exploits (another one with SnakeYAML iirc) surfaced. Nowadays, to call unknown constructors, you need to use a class aptly named DangerLoader in PyYAML.
TOML
TOML's main semantic difference is that it doesn't support cycles, complex keys, or tags. This means that while you can load YAML in an arbitrary user-defined class, you always load TOML into tables or arrays containing your data.
For example, while YAML allows you to load {foo: 1, bar: 2} into an object of a class with foo and bar integer fields, TOML will always load this into a table. A prominent example of YAML's capabilities you usually find in documentation is that it can load the scalar 1d6 into an object {number: 1, sides: 6}; TOML will always load it as string "1d6".
TOML's perceived simplicity here is that it doesn't do some stuff that YAML does. For example, if you're using a statically typed language like Java, after loading {foo: 1, bar: 2} into an object myObject, you can access myObject.foo safely (getting the integer 1). If you used TOML, you would need to do myObject["foo"], which could raise an exception if the key doesn't exist. This is less true in scripting languages like Python: Here, myObject.foo compiles and fails with a runtime error if foo does not happen to be a property of myObject.
My perspective from answering a lot of YAML questions here is that people don't use YAML's features and often load everything into a structure like Map<String, Object> (using Java as an example) and take it from there. If you do this, you could as well use TOML.
A different kind of simplicity TOML offers its syntax: Since it is vastly simpler than YAML, it is easier to emit errors users can understand. For example, a common error text in YAML syntax errors is „mapping values are not allowed in this context“ (try searching this on SO to find tons of questions). You get this for example here:
foo: 1
bar: 2
The error message does not help the user in fixing the error. This is because of YAML's complex syntax: YAML thinks 1 and bar are part of a multi-line scalar (because bar: is indented more than foo:), puts them together, then sees a second : and fails because multi-line scalars may not be used as implicit keys. However, most likely, the user simply either is-indented bar: or was under the impression that they can give both a scalar value to foo (1) and some children. It would be tough to write error messages that can help the user because of the possibilities in YAML syntax.
Since TOML's syntax is much simpler, the error messages are easier to understand. This is a big plus if the user writing TOML is not expected to be someone with a background in parsing grammars.
TOML has a conceptual advantage over YAML: Since its structure allows less freedom, it tends to be easier to read. When reading TOML, you always know, „okay, I'm gonna have nested tables with values in them“ while with YAML, you have some arbitrary structure. I believe this requires more cognitive load when reading a YAML file.
StrictYAML
StrictYAML argues that it provides type-safety, but since YAML isn't a programming language and specifically doesn't support assignments, this claim doesn't make any sense based on the Wikipedia definition which is linked by StrictYAML (type safety comes an goes with the programming language you use; e.g., any YAML is typesafe after loading it into a proper Java class instance, but you'll never be type-safe in a language like Python). Going over its list of removed features, it displays a rather poor understanding of YAML:
Implicit Typing: Can be deactivated in YAML implementations using the failsafe schema, as discussed above.
Direct representations of objects: It simply links to the Ruby on Rails incident, implying that this cannot be avoided, even though most implementations are safe today without removing the feature.
Duplicate Keys Disallowed: The YAML specification already requires this.
Node anchors and refs: StrictYAML argues that using this for deduplication is unreadable to non-programmers, ignoring that the intention was to be able to serialize cyclic structures, which is not possible without anchors and aliases.
On the deserialization side,
All data is a string, list or OrderedDict
It is basically the same structure TOML supports (I believe StrictYAML supports complex keys in mappings as neither list nor OrderedDict are hashable in Python).
You are also losing the ability to deserialize to predefined class structures. One could argue that the inability to construct a class object with well-defined fields makes StrictYAML less type-safe than standard YAML: A standard YAML implementation can guarantee that the returned object has a certain structure described by types, while StrictYAML gives you on every level either a string, a list or an OrderedDict and you can't do anything to restrict it.
While quite some of its arguments are flawed, the resulting language is still usable. For example, with StrictYAML, you do not need to care about the billion laughs attack haunts some YAML implementations. Again, this is not a YAML problem but an implementations problem, as YAML does not require an implementation to duplicate a node that is anchored and referred to from multiple places.
Bottom Line
Quite some YAML issues stem from poor implementations, not from issues in the language itself. However, YAML as a language certainly is complex and syntactic errors can be hard to understand, which could be a valid reason to use a language like TOML. As for StrictYAML, it does offer some benefit, but I suggest against using it because it does not have a proper specification and only a single implementation, which is a setup that is very prone to becoming a maintenance nightmare (project could be discontinued, breaking changes easily possible).

StrictYAML Type Safety -- The "Norway Problem"
Here's an example given by the StrictYAML people:
countries:
- FR
- DE
- NO
Implicitly translates "NO" to a False value.
>>> from pyyaml import load
>>> load(the_configuration)
{'countries': ['FR', 'DE', False]}
https://hitchdev.com/strictyaml/why/implicit-typing-removed

What does "i" represent in Python .pyi extension?

In Python, what does "i" represent in .pyi extension?
In PEP-484, it mentions .pyi is "a stub file" but no mnemonic help on the extension. So does the "i" mean "Include"? "Implementation"? "Interface"?

I think the i in .pyi stands for "Interface"
Definition for Interface in Java:
An interface in the Java programming language is an abstract type that
is used to specify a behaviour that classes must implement
From Python typeshed github repository:
Each Python module is represented by a .pyi "stub". This is a normal
Python file (i.e., it can be interpreted by Python 3), except all the
methods are empty.
In 'Mypy' repository, they explicitly mention "stub" files as public interfaces:
A stubs file only contains a description of the public interface of
the module without any implementations.
Because "Interfaces" do not exist in Python (see this SO question between Abstract class and Interface) I think the designers intended to dedicate a special extension for it.
pyi implements "stub" file (definition from Martin Fowler)
Stubs: provide canned answers to calls made during the test, usually
not responding at all to anything outside what's programmed in for the
test.
But people are more familiar with Interfaces than "stub" files, therefore it was easier to choose .pyi rather than .pys to avoid unnecessary confusion.

Apparently PyCharm creates .pyi file for its own purposes:
The *.pyi files are used by PyCharm and other development tools to provide
more information, such as PEP 484 type hints, than it is able to glean from
introspection of extension types and methods. They are not intended to be
imported, executed or used for any other purpose other than providing info
to the tools. If you don't use use a tool that makes use of .pyi files then
you can safely ignore this file.
See: https://www.python.org/dev/peps/pep-0484/
https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html
This comment was found in: python27/Lib/site-packages/wx/core.pyi

The i in .pyi stands for ‘interface’.
The .pyi extension was first mentioned in this GitHub issue thread where JukkaL says:
I'd probably prefer an extension with just a single dot. It also needs to be something that is not in use (it should not be used by cython, etc.). .pys seems to be used in Windows (or was). Maybe .pyi, where i stands for an interface definition?

Another way to explain the contents of a module that Wing can't figure out is with a pyi Python Interface file. This file is merely a Python skeleton with the proper structure, call signature, and return values to correspond to the functions, attributes, classes, and methods specified in a module.

What is the capitalization standard for class names in the Python Standard Library?

The norm for Python standard library classes seems to be that class names are lowercase - this appears to hold true for built-ins such as str and int as well as for most classes that are part of standard library modules that must be imported such as datetime.date or datetime.datetime.
But, certain standard library classes such as enum.Enum and decimal.Decimal are capitalized. At first glance, it might seem that classes are capitalized when their name is equal to the module name, but that does not hold true in all cases (such as datetime.datetime).
What's the rationale/logic behind the capitalization conventions for class names in the Python Standard Library?

The Key Resources section of the Developers Guide lists PEP 8 as the style guide.
From PEP 8 Naming Conventions, emphasis mine.
The naming conventions of Python's library are a bit of a mess, so
we'll never get this completely consistent -- nevertheless, here are
the currently recommended naming standards. New modules and packages
(including third party frameworks) should be written to these
standards, but where an existing library has a different style,
internal consistency is preferred.
Also from PEP 8
A style guide is about consistency. Consistency with this style guide
is important. Consistency within a project is more important.
Consistency within one module or function is the most important.
...
Some other good reasons to ignore a particular guideline:
To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean
up someone else's mess (in true XP style).
Because the code in question predates the introduction of the guideline and there is no other reason to be modifying that code.
You probably will never know why Standard Library naming conventions conflict with PEP 8 but it is probably a good idea to follow it for new stuff or even in your own projects.

Pep 8 is considered to be the standard style guide by many Python devs. This recommends to name classes using CamelCase/CapWords.
The naming convention for functions may be used instead in cases where the interface is documented and used primarily as a callable.
Note that there is a separate convention for builtin names: most builtin names are single words (or two words run together), with the CapWords convention used only for exception names and builtin constants.
Check this link for PEP8 naming conventions and standards.
datetime is a part of standard library,
Python’s standard library is very extensive, offering a wide range of facilities as indicated by the long table of contents listed below. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming.
In some cases, like sklearn, nltk, django, the package names are all lowercase. This link will take you there.
Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
When an extension module written in C or C++ has an accompanying Python module that provides a higher level (e.g. more object oriented) interface, the C/C++ module has a leading underscore (e.g. _socket ).
I hope this covers all the questions.

Explicitly typed version of Python?

I rather like Python's syntactic sugar; and standard library functions.
However the one feature which I dislike; is implicit typing.
Is there a distribution of Python with explicit typing; which is still compatible with e.g.: packages on PyPi?
[I was looking into RPython]

From python 3, the ability to use type annotation was introduced into the python standard with PEP 3017.
Fast-forward to python 3.5 and PEP 0484 builds on this to introduce type hinting along with the typing module which enables one to specify the types for a variable or the return type of a function.
from typing import Iterator
def fib(n: int) -> Iterator[int]:
a, b = 0, 1
while a < n:
yield a
a, b = b, a + b
Above example taken from https://pawelmhm.github.io
According to the 484 notes:
While these annotations are available at runtime through the usual
__annotations__ attribute, no type checking happens at runtime. Instead, the proposal assumes the existence of a separate off-line
type checker which users can run over their source code voluntarily.
Essentially, such a type checker acts as a very powerful linter.
(While it would of course be possible for individual users to employ a
similar checker at run time for Design By Contract enforcement or JIT
optimization, those tools are not yet as mature.)
tl;dr
Although python provides this form of "static typing", it is not enforced at run time and the python interpreter simply ignores any type specifications you have provided and will still use duck typing to infer types. Therefore, it is up to you to find a linter which will detect any issues with the types.
Furthermore
The motivation for including typing in the python standard was mostly influenced by mypy, so it might be worth checking them out. They also provide examples which may prove useful.

The short answer is no. What you are asking for is deeply built into Python, and can't be changed without changing the language so drastically that is wouldn't be Python.
I'm assuming you don't like variables that are re-typed when re-assigned to? You might consider other ways to check for this if this is a problem with your code.

No You can not have cake and eat cake.
Python is great because its dynamically typed! Period. (That's why it have such nice standard library too)
There is only 2 advantages of statically typed languages 1) speed - when algorithms are right to begin with and 2) compilation errors
As for 1)
Use PyPi,
Profile,
Use ctype libs for great performance.
Its typical to have only 10% or less code that is performance critical. All that other 90%? Enjoy advantages of dynamic typing.
As for 2)
Use Classes (And contracts)
Use Unit Testing
Use refactoring
Use good code editor
Its typical to have data NOT FITTING into standard data types, which are too strict or too loose in what they allow to be stored in them. Make sure that You validate Your data on Your own.
Unit Testing is must have for algorithm testing, which no compiler can do for You, and should catch any problems arising from wrong data types (and unlike compiler they are as fine grained as you need them to be)
Refactoring solves all those issues when You are not sure if given changes wont break Your code (and again, strongly typed data can not guarantee that either).
And good code editor can solve so many problems... Use Sublime Text for a while. And then You will know what I mean.
(To be sure, I do not give You answer You want to have. But rather I question Your needs, especially those that You did not included in Your question)

Now in 2021, there's a library called Deal that not only provides a robust static type checker, but also allows you to specify pre- and post-conditions, loop invariants, explicitly state expectations regarding exceptions and IO/side-effects, and even formally prove correctness of code (for an albeit small subset of Python).
Here's an example from their GitHub:
# the result is always non-negative
#deal.post(lambda result: result >= 0)
# the function has no side-effects
#deal.pure
def count(items: List[str], item: str) -> int:
return items.count(item)
# generate test function
test_count = deal.cases(count)
Now we can:
Run python3 -m deal lint or flake8 to statically check errors.
Run python3 -m deal test or pytest to generate and run tests.
Just use the function in the project and check errors in runtime.

Since comments are limited...
As an interpreted language Python is by definition weakly typed. This is not a bad thing more as a control in place for the programmer to preempt potential syntactical bugs, but in truth that won't stop logical bugs from happening any less, and thus makes the point moot.
Even though the paper on RPython makes it's point, it is focused on Object Oriented Programming. You must bear in mind that Python is more an amalgamation of OOP and Functional Programming, likely others too.
I encourage reading of this page, it is very informative.

Structured python docstrings, IDE-friendly

In PHP I was used to PHPdoc syntax:
/** Do something useful
#param first Primary data
#return int
#throws BadException
*/
function($first){ ...
— kinda short useful reference: very handy when all you need is just to recall 'what's that??', especially for 3rd-party libraries. Also, all IDEs can display this in popup hints.
It seems like there's no conventions in Python: just plain text. It describes things well, but it's too long to be a digest.
Ok, let it be. But in my applications I don't want to use piles of plaintext.
Are there any well-known conventions to follow? And how to document class attributes?! PyCharm IDE recipes are especially welcome :)
In Python3 there's a PEP 3107 for functional annotations. That's not useful for 2.x (2.6, specifically)
Also there's a PEP 0287 for reStructuredText: fancy but still not structured.

I use epydoc. It supports comments in reStructured Text, and it generates HTML documentation from those comments (akin to javadoc).

The numpydoc standard is well-defined, based around reStructuredText (which is standard within the python ecosystem), and has Sphinx integration. It should be relatively straight forward to write a plugin for PyCharm which can digest numpydoc.
Sphinx also has references on how to document attributes: http://sphinx.pocoo.org/ext/autodoc.html?highlight=autoattribute

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.