Preserving anchors and aliases in YAML with Python - python

I'm editing a large YAML document in Python with extensive anchors and aliases. I'd like to be able to determine how the anchor is derived based on data from the node it references.
For instance the node has a 'name' field and I'd like the anchor to be the value of that field rather than a random id number.
Is this possible with PyYAML or ruamel.yaml?

There are a few things to keep in mind:
YAML has no fields. I assume that that is your interpretation of keys in a mapping, so that you want an anchor associated with a mapping to be the same as the value for the key 'name'
During load time the event created when encountering an anchor doesn't know about whether it is an anchor on a scalar, sequence or mapping. Let alone that it could access the value for 'name'.
Changing the anchor during load is tricky, as you have to keep track of aliases referring to the original anchor (and map them to its new value)
In PyYAML the anchor name gets created during dump-ing, so you would have to hook into that when using PyYAML. You can do the same with ruamel.yaml
Only ruamel.yaml has the capability to preserve an anchor on round-trip. I.e. if you can have the anchor to be persistent, even if the value for the key 'name' changes (assuming you test e.g. on the default generated form idNNNN)
When you use ruamel.yaml you can recursively walk the data-structure, keeping track of nodes already visited (in case a child contains an ancestor) and when encountering a ruamel.yaml.comments.CommentedMap, set the anchor (currently the attribute with the value of ruamel.yaml.comments.Anchor.attrib i.e. _yaml_anchor). Untested code:
if isinstance(x, ruamel.yaml.comments.CommentedMap):
if 'name' in x:
x.yaml_set_anchor(x['name'])
If you have a YAML document that you can round-trip you can hook into the representer:
import sys
import ruamel.yaml
from ruamel.yaml.representer import RoundTripRepresenter
yaml_str = """\
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &id001
a: 1
b: 2
name: mydata
- c: 3
- *id001
"""
class MyRTR(RoundTripRepresenter):
def represent_mapping(self, tag, mapping, flow_style=None):
if 'name' in mapping:
# if not isinstance(mapping, ruamel.yaml.comments.CommentedMap):
# mapping = ruamel.yaml.comments.CommentedMap(mapping)
mapping.yaml_set_anchor(mapping['name'])
mapping.yaml_set_anchor(mapping['name'])
return RoundTripRepresenter.represent_mapping(
self, tag, mapping, flow_style=flow_style)
yaml = ruamel.yaml.YAML()
yaml.Representer = MyRTR
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
which gives:
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &mydata a: 1
b: 2
name: mydata
- c: 3
- *mydata
But note that this assumes that you loaded the data and that all dicts are actually CommentedMaps under the hood. If that is not the case (i.e. you added normal dicts, then uncomment the two lines doing the conversion.

Related

Is it possible to remove unecessary nested structure in yaml file?

I need to set a param that is deep inside a yaml object like below:
executors:
hpc01:
context:
cp2k:
charge: 0
Is it possible to make it more clear, for example
executors: hpc01: context: cp2k: charge: 0
I am using ruamel.yaml in Python to parse the file and it fails to parse the example. Is there some yaml dialect can support such style, or is there better way to write such configuration in standard yaml spec?
since all json is valid yaml...
executors: {"hpc01" : {"context": {"cp2k": {"charge": 0}}}}
should be valid...
a little proof:
from ruamel.yaml import YAML
a = YAML().load('executors: {"hpc01" : {"context": {"cp2k": {"charge": 0}}}}')
b = YAML().load('''executors:
hpc01:
context:
cp2k:
charge: 0''')
if a == b:
print ("equal")
will print: equal.
What you are proposing is invalid YAML, since the colon + space is parsed as a value indicator. Since
YAML can have mappings as keys for other mappings, you would get all kinds of interpretation issues, such as
should
a: b: c
be interpreted as a mapping with key a: b and value c or as a mapping with key a and value b: c.
If you want to write everything on one line, and don't want the overhead of YAML's flow-style, I suggest
you use the fact that the value indicator expects a space after the colon and do a little post-processing:
import sys
import ruamel.yaml
yaml_str = """\
before: -1
executors:hpc01:context:cp2k:charge: 0
after: 1
"""
COLON = ':'
def unfold_keys(d):
if isinstance(d, dict):
replace = []
for idx, (k, v) in enumerate(d.items()):
if COLON in k:
for segment in reversed(k.split(COLON)):
v = {segment: v}
replace.append((idx, k, v))
else:
unfold_keys(v)
for idx, oldkey, kv in replace:
del d[oldkey]
v = list(kv.values())[0]
# v.refold = True
d.insert(idx, list(kv.keys())[0], v)
elif isinstance(d, list):
for elem in d:
unfold_keys
return d
yaml = ruamel.yaml.YAML()
data = unfold_keys(yaml.load(yaml_str))
yaml.dump(data, sys.stdout)
which gives:
before: -1
executors:
hpc01:
context:
cp2k:
charge: 0
after: 1
Since ruamel.yaml parses mappings in the default mode to CommentedMap instances which have .insert() method,
you can actually preserve the position of the "unfolded" key in the mapping.
You can of course use another character (e.g. underscore). You can also reverse the process by uncommenting the line # v.refold = True and provide another recursive function that walks over the data and checks on that attribute and does the reverse
of unfold_keys(), just before dumping.

Reading from nested json and getting None type Error -> try/except

I am reading data from nested json with this code:
data = json.loads(json_file.json)
for nodesUni in data["data"]["queryUnits"]['nodes']:
try:
tm = (nodesUni['sql']['busData'][0]['engine']['engType'])
except:
tm = ''
try:
to = (nodesUni['sql']['carData'][0]['engineData']['producer']['engName'])
except:
to = ''
json_output_for_one_GU_owner = {
"EngineType": tm,
"EngineName": to,
}
I am having an issue with None type error (eg. this one doesn't exists at all nodesUni['sql']['busData'][0]['engine']['engType'] cause there are no data, so I am using try/except. But my code is more complex and having a try/except for every value is crazy. Is there any other option how to deal with this?
Error: "TypeError: 'NoneType' object is not subscriptable"
This is non-trivial as your requirement is to traverse the dictionaries without errors, and get an empty string value in the end, all that in a very simple expression like cascading the [] operators.
First method
My approach is to add a hook when loading the json file, so it creates default dictionaries in an infinite way
import collections,json
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads('{"foo":"bar"}',object_hook=hook)
print(data["x"][0]["zzz"]) # doesn't exist
print(data["foo"]) # exists
prints:
defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {})
bar
when accessing some combination of keys that don't exist (at any level), superdefaultdict recursively creates a defaultdict of itself (this is a nice pattern, you can read more about it in Is there a standard class for an infinitely nested defaultdict?), allowing any number of non-existing key levels.
Now the only drawback is that it returns a defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {}) which is ugly. So
print(data["x"][0]["zzz"] or "")
prints empty string if the dictionary is empty. That should suffice for your purpose.
Use like that in your context:
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads(json_file.json,object_hook=hook)
for nodesUni in data["data"]["queryUnits"]['nodes']:
tm = nodesUni['sql']['busData'][0]['engine']['engType'] or ""
to = nodesUni['sql']['carData'][0]['engineData']['producer']['engName'] or ""
Drawbacks:
It creates a lot of empty dictionaries in your data object. Shouldn't be a problem (except if you're very low in memory) as the object isn't dumped to a file afterwards (where the non-existent values would appear)
If a value already exists, trying to access it as a dictionary crashes the program
Also if some value is 0 or an empty list, the or operator will pick "". This can be workarounded with another wrapper that tests if the object is an empty superdefaultdict instead. Less elegant but doable.
Second method
Convert the access of your successive dictionaries as a string (for instance just double quote your expression like "['sql']['busData'][0]['engine']['engType']", parse it, and loop on the keys to get the data. If there's an exception, stop and return an empty string.
import json,re,operator
def get(key,data):
key_parts = [x.strip("'") if x.startswith("'") else int(x) for x in re.findall(r"\[([^\]]*)\]",key)]
try:
for k in key_parts:
data = data[k]
return data
except (KeyError,IndexError,TypeError):
return ""
testing with some simple data:
data = json.loads('{"foo":"bar","hello":{"a":12}}')
print(get("['sql']['busData'][0]['engine']['engType']",data))
print(get("['hello']['a']",data))
print(get("['hello']['a']['e']",data))
we get, empty string (some keys are missing), 12 (the path is valid), empty string (we tried to traverse a non-dict existing value).
The syntax could be simplified (ex: "sql"."busData".O."engine"."engType") but would still have to retain a way to differentiate keys (strings) from indices (integers)
The second approach is probably the most flexible one.

Access elements inside yaml using python

I am using yaml and pyyaml to configure my application.
Is it possible to configure something like this -
config.yml -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: $root.repo_root/data
service:
root: $root.data_root/csv/xyz.csv
yaml loading function -
def load_config(config_path):
config_path = os.path.abspath(config_path)
if not os.path.isfile(config_path):
raise FileNotFoundError("{} does not exist".format(config_path))
else:
with open(config_path) as f:
config = yaml.load(f, Loader=yaml.SafeLoader)
# logging.info(config)
logging.info("Config used for run - \n{}".format(yaml.dump(config, sort_keys=False)))
return DotDict(config)
Current Output-
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: ${root.repo_root}/data
service:
root: ${root.data_root}/csv/xyz.csv
Desired Output -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data
service:
root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv
Is this even possible with python? If so any help would be really nice.
Thanks in advance.
A general approach:
read the file as is
search for strings containing $:
determine the "path" of "variables"
replace the "variables" with actual values
An example, using recursive call for dictionaries and replaces strings:
import re, pprint, yaml
def convert(input,top=None):
"""Replaces $key1.key2 with actual values. Modifies input in-place"""
if top is None:
top = input # top should be the original input
if isinstance(input,dict):
ret = {k:convert(v,top) for k,v in input.items()} # recursively convert items
if input != ret: # in case order matters, do it one or several times more until no change happens
ret = convert(ret)
input.update(ret) # update original input
return input # return updated input (for the case of recursion)
if isinstance(input,str):
vars = re.findall(r"\$[\w_\.]+",input) # find $key_1.key_2.keyN sequences
for var in vars:
keys = var[1:].split(".") # remove dollar and split by dots to make "key chain"
val = top # starting from top ...
for k in keys: # ... for each key in the key chain ...
val = val[k] # ... go one level down
input = input.replace(var,val) # replace $key sequence eith actual value
return input # return modified input
# TODO int, float, list, ...
with open("in.yml") as f: config = yaml.load(f) # load as is
convert(config) # convert it (in-place)
pprint.pprint(config)
Output:
{'root': {'data_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data',
'repo_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght'},
'service': {'root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv'}}
Note: YAML is not that important here, would work also with JSON, XML or other formats.
Note2: If you use exclusively YAML and exclusively python, some answers from this post may be useful (using anchors and references and application specific local tags)

How to map over a CommentedMap while preserving the comments/style?

Given a ruamel.yaml CommentedMap, and some transformation function f: CommentedMap → Any, I would like to produce a new CommentedMap with transformed keys and values, but otherwise as similar as possible to the original.
If I don't care about preserving style, I can do this:
result = {
f(key) : f(value)
for key, value in my_commented_map.items()
}
If I didn't need to transform the keys (and I didn't care about mutating the original), I could do this:
for key, value in my_commented_map.items():
my_commented_map[key] = f(value)
The style and comment information are each attached to the
CommentedMap via special attributes. The style you can copy, but
the comments are partly indexed to key on which line they occur, and
if you transform that key, you also need to transform that indexed
comment.
In your first example you apply f() to both key and value, I'll use
seperate functions in my example, all-capsing the keys, and
all-lowercasing the values (this of course only works on string type
keys and value, so this is a restriction of the example, not of
the solution)
import sys
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap as CM
from ruamel.yaml.comments import Format, Comment
yaml_str = """\
# example YAML document
abc: All Strings are Equal # but some Strings are more Equal then others
klm: Flying Blue
xYz: the End # for now
"""
def fkey(s):
return s.upper()
def fval(s):
return s.lower()
def transform(data, fk, fv):
d = CM()
if hasattr(data, Format.attrib):
setattr(d, Format.attrib, getattr(data, Format.attrib))
ca = None
if hasattr(data, Comment.attrib):
setattr(d, Comment.attrib, getattr(data, Comment.attrib))
ca = getattr(d, Comment.attrib)
# as the key mapping could map new keys on old keys, first gather everything
key_com = {}
for k in data:
new_k = fk(k)
d[new_k] = fv(data[k])
if ca is not None and k in ca.items:
key_com[new_k] = ca.items.pop(k)
if ca is not None:
assert len(ca.items) == 0
ca._items = key_com # the attribute, not the read-only property
return d
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
# the following will print any new CommentedMap with curly braces, this just here to check
# if the style attribute copying is working correctly, remove from real code
yaml.default_flow_style = True
data = transform(data, fkey, fval)
yaml.dump(data, sys.stdout)
which gives:
# example YAML document
ABC: all strings are equal # but some Strings are more Equal then others
KLM: flying blue
XYZ: the end # for now
Please note:
the above tries (and succeeds) to start a comment in the original
column, if that is not possible, e.g. when a transformed key or
value takes more space, it is pushed further to the right.
if you have a more complex datastructure, recursively walk the tree, descending into mappings
and sequences. In that case it might be more easy to store (key, value, comment) tuples
then pop() all the keys and reinsert the stored values (instead of rebuilding the tree).

Dumping a dictionary to a YAML file while preserving order

I've been trying to dump a dictionary to a YAML file. The problem is that the program that imports the YAML file needs the keywords in a specific order. This order is not alphabetically.
import yaml
import os
baseFile = 'myfile.dat'
lyml = [{'BaseFile': baseFile}]
lyml.append({'Environment':{'WaterDepth':0.,'WaveDirection':0.,'WaveGamma':0.,'WaveAlpha':0.}})
CaseName = 'OrderedDict.yml'
CaseDir = r'C:\Users\BTO\Documents\Projects\Mooring code testen'
CaseFile = os.path.join(CaseDir, CaseName)
with open(CaseFile, 'w') as f:
yaml.dump(lyml, f, default_flow_style=False)
This produces a *.yml file which is formatted like this:
- BaseFile: myfile.dat
- Environment:
WaterDepth: 0.0
WaveAlpha: 0.0
WaveDirection: 0.0
WaveGamma: 0.0
But what I want is that the order is preserved:
- BaseFile: myfile.dat
- Environment:
WaterDepth: 0.0
WaveDirection: 0.0
WaveGamma: 0.0
WaveAlpha: 0.0
Is this possible?
yaml.dump has a sort_keys keyword argument that is set to True by default. Set it to False to not reorder:
with open(CaseFile, 'w') as f:
yaml.dump(lyml, f, default_flow_style=False, sort_keys=False)
Use an OrderedDict instead of dict. Run the below setup code at the start. Now yaml.dump, should preserve the order. More details here and here
def setup_yaml():
""" https://stackoverflow.com/a/8661021 """
represent_dict_order = lambda self, data: self.represent_mapping('tag:yaml.org,2002:map', data.items())
yaml.add_representer(OrderedDict, represent_dict_order)
setup_yaml()
Example: https://pastebin.com/raw.php?i=NpcT6Yc4
PyYAML supports representer to serialize a class instance to a YAML node.
yaml.YAMLObject uses metaclass magic to register a constructor, which transforms a YAML node to a class instance, and a representer, which serializes a class instance to a YAML node.
Add following lines above your code:
def represent_dictionary_order(self, dict_data):
return self.represent_mapping('tag:yaml.org,2002:map', dict_data.items())
def setup_yaml():
yaml.add_representer(OrderedDict, represent_dictionary_order)
setup_yaml()
Then you can use OrderedDict to preserve the order in yaml.dump():
import yaml
from collections import OrderedDict
def represent_dictionary_order(self, dict_data):
return self.represent_mapping('tag:yaml.org,2002:map', dict_data.items())
def setup_yaml():
yaml.add_representer(OrderedDict, represent_dictionary_order)
setup_yaml()
dic = OrderedDict()
dic['a'] = 1
dic['b'] = 2
dic['c'] = 3
print(yaml.dump(dic))
# {a: 1, b: 2, c: 3}
Your difficulties are a result of assumptions on multiple levels that are incorrect and, depending on your YAML parser, might not be transparently resolvable.
In Python's dict the keys are unordered (at least for Python < 3.6). And even though the keys have some order in the source file, as soon as they are in the dict they aren't:
d = {'WaterDepth':0.,'WaveDirection':0.,'WaveGamma':0.,'WaveAlpha':0.}
for key in d:
print key
gives:
WaterDepth
WaveGamma
WaveAlpha
WaveDirection
If you want your keys ordered you can use the collections.OrderedDict type (or my own ruamel.ordereddict type which is in C and more than an order of magnitude faster), and you have to add the keys ordered, either as a list of tuples:
from ruamel.ordereddict import ordereddict
# from collections import OrderedDict as ordereddict # < this will work as well
d = ordereddict([('WaterDepth', 0.), ('WaveDirection', 0.), ('WaveGamma', 0.), ('WaveAlpha', 0.)])
for key in d:
print key
which will print the keys in the order they were specified in the source.
The second problem is that even if a Python dict has some key ordering that happens to be what you want, the YAML specification does explicitly say that mappings are unordered and that is the way e.g. PyYAML implements the dumping of Python dict to YAML mapping (And the other way around).
Also, if you dump an ordereddict or OrderedDict you normally don't get the plain YAML mapping that you indicate you want, but some tagged YAML entry.
As losing the order is often undesirable, in your case because your reader assumes some order, in my case because that made it difficult to compare versions because key ordering would not be consistent after insertion/deletion, I implemented round-trip consistency in ruamel.yaml so you can do:
import sys
import ruamel.yaml as yaml
yaml_str = """\
- BaseFile: myfile.dat
- Environment:
WaterDepth: 0.0
WaveDirection: 0.0
WaveGamma: 0.0
WaveAlpha: 0.0
"""
data = yaml.load(yaml_str, Loader=yaml.RoundTripLoader)
print(data)
yaml.dump(data, sys.stdout, Dumper=yaml.RoundTripDumper)
which gives you exactly your output result. data works as a dict (and so does `data['Environment'], but underneath they are smarter constructs that preserve order, comments, YAML anchor names etc). You can of course change these (adding/deleting key-value pairs), which is easy, but you can also build these from scratch:
import sys
import ruamel.yaml as yaml
from ruamel.yaml.comments import CommentedMap
baseFile = 'myfile.dat'
lyml = [{'BaseFile': baseFile}]
lyml.append({'Environment': CommentedMap([('WaterDepth', 0.), ('WaveDirection', 0.), ('WaveGamma', 0.), ('WaveAlpha', 0.)])})
yaml.dump(data, sys.stdout, Dumper=yaml.RoundTripDumper)
Which again prints the contents with keys in the order you want them.
I find the later less readable, than when starting from a YAML string, but it does construct the lyml data structure somewhat faster.
oyaml is a python library which preserves dict ordering when dumping.
It is specifically helpful in more complex cases where the dictionary is nested and may contain lists.
Once installed:
import oyaml as yaml
with open(CaseFile, 'w') as f:
f.write(yaml.dump(lyml))

Categories

Resources