How to map over a CommentedMap while preserving the comments/style?

How to map over a CommentedMap while preserving the comments/style? - python

Given a ruamel.yaml CommentedMap, and some transformation function f: CommentedMap → Any, I would like to produce a new CommentedMap with transformed keys and values, but otherwise as similar as possible to the original.
If I don't care about preserving style, I can do this:
result = {
f(key) : f(value)
for key, value in my_commented_map.items()
}
If I didn't need to transform the keys (and I didn't care about mutating the original), I could do this:
for key, value in my_commented_map.items():
my_commented_map[key] = f(value)

The style and comment information are each attached to the
CommentedMap via special attributes. The style you can copy, but
the comments are partly indexed to key on which line they occur, and
if you transform that key, you also need to transform that indexed
comment.
In your first example you apply f() to both key and value, I'll use
seperate functions in my example, all-capsing the keys, and
all-lowercasing the values (this of course only works on string type
keys and value, so this is a restriction of the example, not of
the solution)
import sys
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap as CM
from ruamel.yaml.comments import Format, Comment
yaml_str = """\
# example YAML document
abc: All Strings are Equal # but some Strings are more Equal then others
klm: Flying Blue
xYz: the End # for now
"""
def fkey(s):
return s.upper()
def fval(s):
return s.lower()
def transform(data, fk, fv):
d = CM()
if hasattr(data, Format.attrib):
setattr(d, Format.attrib, getattr(data, Format.attrib))
ca = None
if hasattr(data, Comment.attrib):
setattr(d, Comment.attrib, getattr(data, Comment.attrib))
ca = getattr(d, Comment.attrib)
# as the key mapping could map new keys on old keys, first gather everything
key_com = {}
for k in data:
new_k = fk(k)
d[new_k] = fv(data[k])
if ca is not None and k in ca.items:
key_com[new_k] = ca.items.pop(k)
if ca is not None:
assert len(ca.items) == 0
ca._items = key_com # the attribute, not the read-only property
return d
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
# the following will print any new CommentedMap with curly braces, this just here to check
# if the style attribute copying is working correctly, remove from real code
yaml.default_flow_style = True
data = transform(data, fkey, fval)
yaml.dump(data, sys.stdout)
which gives:
# example YAML document
ABC: all strings are equal # but some Strings are more Equal then others
KLM: flying blue
XYZ: the end # for now
Please note:
the above tries (and succeeds) to start a comment in the original
column, if that is not possible, e.g. when a transformed key or
value takes more space, it is pushed further to the right.
if you have a more complex datastructure, recursively walk the tree, descending into mappings
and sequences. In that case it might be more easy to store (key, value, comment) tuples
then pop() all the keys and reinsert the stored values (instead of rebuilding the tree).

Related

Is it possible to remove unecessary nested structure in yaml file?

I need to set a param that is deep inside a yaml object like below:
executors:
hpc01:
context:
cp2k:
charge: 0
Is it possible to make it more clear, for example
executors: hpc01: context: cp2k: charge: 0
I am using ruamel.yaml in Python to parse the file and it fails to parse the example. Is there some yaml dialect can support such style, or is there better way to write such configuration in standard yaml spec?

since all json is valid yaml...
executors: {"hpc01" : {"context": {"cp2k": {"charge": 0}}}}
should be valid...
a little proof:
from ruamel.yaml import YAML
a = YAML().load('executors: {"hpc01" : {"context": {"cp2k": {"charge": 0}}}}')
b = YAML().load('''executors:
hpc01:
context:
cp2k:
charge: 0''')
if a == b:
print ("equal")
will print: equal.

What you are proposing is invalid YAML, since the colon + space is parsed as a value indicator. Since
YAML can have mappings as keys for other mappings, you would get all kinds of interpretation issues, such as
should
a: b: c
be interpreted as a mapping with key a: b and value c or as a mapping with key a and value b: c.
If you want to write everything on one line, and don't want the overhead of YAML's flow-style, I suggest
you use the fact that the value indicator expects a space after the colon and do a little post-processing:
import sys
import ruamel.yaml
yaml_str = """\
before: -1
executors:hpc01:context:cp2k:charge: 0
after: 1
"""
COLON = ':'
def unfold_keys(d):
if isinstance(d, dict):
replace = []
for idx, (k, v) in enumerate(d.items()):
if COLON in k:
for segment in reversed(k.split(COLON)):
v = {segment: v}
replace.append((idx, k, v))
else:
unfold_keys(v)
for idx, oldkey, kv in replace:
del d[oldkey]
v = list(kv.values())[0]
# v.refold = True
d.insert(idx, list(kv.keys())[0], v)
elif isinstance(d, list):
for elem in d:
unfold_keys
return d
yaml = ruamel.yaml.YAML()
data = unfold_keys(yaml.load(yaml_str))
yaml.dump(data, sys.stdout)
which gives:
before: -1
executors:
hpc01:
context:
cp2k:
charge: 0
after: 1
Since ruamel.yaml parses mappings in the default mode to CommentedMap instances which have .insert() method,
you can actually preserve the position of the "unfolded" key in the mapping.
You can of course use another character (e.g. underscore). You can also reverse the process by uncommenting the line # v.refold = True and provide another recursive function that walks over the data and checks on that attribute and does the reverse
of unfold_keys(), just before dumping.

Python: apply wildcard match to keys being read from dictionary

This is for a script I'm running in Blender, but the question pertains to the Python part of it. It's not specific to Blender.
The script is originally from this answer, and it replaces a given material (the key) with its newer equivalent (the value).
Here's the code:
import bpy
objects = bpy.context.selected_objects
mat_dict = {
"SOLID-WHITE": "Sld_WHITE",
"SOLID-BLACK": "Sld_BLACK",
"SOLID-BLUE": "Sld_BLUE"
}
for obj in objects:
for slot in obj.material_slots:
slot.material = bpy.data.materials[mat_dict[slot.material.name]]
The snag is, how to handle duplicates when the scene may have not only objects with the material "SOLID-WHITE", but also "SOLID-WHITE.001", "SOLID-WHITE.002", and so on.
I was looking at this answer to a question about wildcards in Python and it seems fnmatch might well well-suited for this task.
I've tried working fnmatch into the last line of the code. I've also tried wrapping the dictionary keys with it (very WET, I know). Neither of these approaches has worked.
How can I run a wildcard match on each dictionary key?
So for example, whether an object has "SOLID-WHITE" or "SOLID-WHITE"-dot-some-number, it will still be replaced with "Sld_WHITE"?

I have no clue about Blender so I'm not sure if I'm getting the problem right, but how about the following?
mat_dict = {
"SOLID-WHITE": "Sld_WHITE",
"SOLID-BLACK": "Sld_BLACK",
"SOLID-BLUE": "Sld_BLUE"
}
def get_new_material(old_material):
for k, v in mat_dict.items():
# .split(".")[0] extracts the part to the left of the dot (if there is one)
if old_material.split(".")[0] == k:
return v
return old_material
for obj in objects:
for slot in obj.material_slots:
new_material = get_new_material(slot.material.name)
slot.material = bpy.data.materials[new_material]
Instead of the .split(".")[0] you could use or re.match by storing regexes as keys in your dictionary. As you noticed in the comment, startswith could match too much, and the same would be the case for fnmatch.
Examples of the above function in action:
In [3]: get_new_material("SOLID-WHITE.001")
Out[3]: 'Sld_WHITE'
In [4]: get_new_material("SOLID-WHITE")
Out[4]: 'Sld_WHITE'
In [5]: get_new_material("SOLID-BLACK")
Out[5]: 'Sld_BLACK'
In [6]: get_new_material("test")
Out[6]: 'test'

There are two ways you can approach this.
You can make a smart dictionary that matches vague names. Or you can change the key that is used to look up the a color.
Here is an example of the first approach using fnmatch.
this approach changes the lookup time complexity from O(1) to O(n) when a color contains a number. this approach extends UserDict with a __missing__ method. the __missing__ method gets called if the key is not found in the dictionary. it compares every key with the given key using fnmatch.
from collections import UserDict
import fnmatch
import bpy
objects = bpy.context.selected_objects
class Colors(UserDict):
def __missing__(self, key):
for color in self.keys():
if fnmatch.fnmatch(key, color + "*"):
return self[color]
raise KeyError(f"could not match {key}")
mat_dict = Colors({
"SOLID-WHITE": "Sld_WHITE",
"SOLID-BLACK": "Sld_BLACK",
"SOLID-BLUE": "Sld_BLUE"
})
for obj in objects:
for slot in obj.material_slots:
slot.material = bpy.data.materials[mat_dict[slot.material.name]]
Here is an example of the second approach using regex.
import re
import bpy
objects = bpy.context.selected_objects
mat_dict = {
"SOLID-WHITE": "Sld_WHITE",
"SOLID-BLACK": "Sld_BLACK",
"SOLID-BLUE": "Sld_BLUE"
}
pattern = re.compile(r"([A-Z\-]+)(?:\.\d+)?")
# matches any number of capital letters and dashes
# can be followed by a dot followed by any number of digits
# this pattern can match the following strings
# ["AAAAA", "----", "AA-AA.00005"]
for obj in objects:
for slot in obj.material_slots:
match = pattern.fullmatch(slot.material.name)
if match:
slot.material = bpy.data.materials[mat_dict[match.group(1)]]
else:
slot.material = bpy.data.materials[mat_dict[slot.material.name]]

Preserving anchors and aliases in YAML with Python

I'm editing a large YAML document in Python with extensive anchors and aliases. I'd like to be able to determine how the anchor is derived based on data from the node it references.
For instance the node has a 'name' field and I'd like the anchor to be the value of that field rather than a random id number.
Is this possible with PyYAML or ruamel.yaml?

There are a few things to keep in mind:
YAML has no fields. I assume that that is your interpretation of keys in a mapping, so that you want an anchor associated with a mapping to be the same as the value for the key 'name'
During load time the event created when encountering an anchor doesn't know about whether it is an anchor on a scalar, sequence or mapping. Let alone that it could access the value for 'name'.
Changing the anchor during load is tricky, as you have to keep track of aliases referring to the original anchor (and map them to its new value)
In PyYAML the anchor name gets created during dump-ing, so you would have to hook into that when using PyYAML. You can do the same with ruamel.yaml
Only ruamel.yaml has the capability to preserve an anchor on round-trip. I.e. if you can have the anchor to be persistent, even if the value for the key 'name' changes (assuming you test e.g. on the default generated form idNNNN)
When you use ruamel.yaml you can recursively walk the data-structure, keeping track of nodes already visited (in case a child contains an ancestor) and when encountering a ruamel.yaml.comments.CommentedMap, set the anchor (currently the attribute with the value of ruamel.yaml.comments.Anchor.attrib i.e. _yaml_anchor). Untested code:
if isinstance(x, ruamel.yaml.comments.CommentedMap):
if 'name' in x:
x.yaml_set_anchor(x['name'])
If you have a YAML document that you can round-trip you can hook into the representer:
import sys
import ruamel.yaml
from ruamel.yaml.representer import RoundTripRepresenter
yaml_str = """\
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &id001
a: 1
b: 2
name: mydata
- c: 3
- *id001
"""
class MyRTR(RoundTripRepresenter):
def represent_mapping(self, tag, mapping, flow_style=None):
if 'name' in mapping:
# if not isinstance(mapping, ruamel.yaml.comments.CommentedMap):
# mapping = ruamel.yaml.comments.CommentedMap(mapping)
mapping.yaml_set_anchor(mapping['name'])
mapping.yaml_set_anchor(mapping['name'])
return RoundTripRepresenter.represent_mapping(
self, tag, mapping, flow_style=flow_style)
yaml = ruamel.yaml.YAML()
yaml.Representer = MyRTR
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
which gives:
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &mydata a: 1
b: 2
name: mydata
- c: 3
- *mydata
But note that this assumes that you loaded the data and that all dicts are actually CommentedMaps under the hood. If that is not the case (i.e. you added normal dicts, then uncomment the two lines doing the conversion.

Return all keys along with value in nested dictionary

I am working on getting all text that exists in several .yaml files placed into a new singular YAML file that will contain the English translations that someone can then translate into Spanish.
Each YAML file has a lot of nested text. I want to print the full 'path', aka all the keys, along with the value, for each value in the YAML file. Here's an example input for a .yaml file that lives in the myproject.section.more_information file:
default:
heading: Here’s A Title
learn_more:
title: Title of Thing
url: www.url.com
description: description
opens_new_window: true
and here's the desired output:
myproject.section.more_information.default.heading: Here’s a Title
myproject.section.more_information.default.learn_more.title: Title of Thing
mproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description: description
myproject.section.more_information.default.learn_more.opens_new_window: true
This seems like a good candidate for recursion, so I've looked at examples such as this answer
However, I want to preserve all of the keys that lead to a given value, not just the last key in a value. I'm currently using PyYAML to read/write YAML.
Any tips on how to save each key as I continue to check if the item is a dictionary and then return all the keys associated with each value?

What you're wanting to do is flatten nested dictionaries. This would be a good place to start: Flatten nested Python dictionaries, compressing keys
In fact, I think the code snippet in the top answer would work for you if you just changed the sep argument to ..
edit:
Check this for a working example based on the linked SO answer http://ideone.com/Sx625B
import collections
some_dict = {
'default': {
'heading': 'Here’s A Title',
'learn_more': {
'title': 'Title of Thing',
'url': 'www.url.com',
'description': 'description',
'opens_new_window': 'true'
}
}
}
def flatten(d, parent_key='', sep='_'):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
results = flatten(some_dict, parent_key='', sep='.')
for item in results:
print(item + ': ' + results[item])
If you want it in order, you'll need an OrderedDict though.

Walking over nested dictionaries begs for recursion and by handing in the "prefix" to "path" this prevents you from having to do any manipulation on the segments of your path (as #Prune) suggests.
There are a few things to keep in mind that makes this problem interesting:
because you are using multiple files can result in the same path in multiple files, which you need to handle (at least throwing an error, as otherwise you might just lose data). In my example I generate a list of values.
dealing with special keys (non-string (convert?), empty string, keys containing a .). My example reports these and exits.
Example code using ruamel.yaml ¹:
import sys
import glob
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq
from ruamel.yaml.compat import string_types, ordereddict
class Flatten:
def __init__(self, base):
self._result = ordereddict() # key to list of tuples of (value, comment)
self._base = base
def add(self, file_name):
data = ruamel.yaml.round_trip_load(open(file_name))
self.walk_tree(data, self._base)
def walk_tree(self, data, prefix=None):
"""
this is based on ruamel.yaml.scalarstring.walk_tree
"""
if prefix is None:
prefix = ""
if isinstance(data, dict):
for key in data:
full_key = self.full_key(key, prefix)
value = data[key]
if isinstance(value, (dict, list)):
self.walk_tree(value, full_key)
continue
# value is a scalar
comment_token = data.ca.items.get(key)
comment = comment_token[2].value if comment_token else None
self._result.setdefault(full_key, []).append((value, comment))
elif isinstance(base, list):
print("don't know how to handle lists", prefix)
sys.exit(1)
def full_key(self, key, prefix):
"""
check here for valid keys
"""
if not isinstance(key, string_types):
print('key has to be string', repr(key), prefix)
sys.exit(1)
if '.' in key:
print('dot in key not allowed', repr(key), prefix)
sys.exit(1)
if key == '':
print('empty key not allowed', repr(key), prefix)
sys.exit(1)
return prefix + '.' + key
def dump(self, out):
res = CommentedMap()
for path in self._result:
values = self._result[path]
if len(values) == 1: # single value for path
res[path] = values[0][0]
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=path)
continue
res[path] = seq = CommentedSeq()
for index, value in enumerate(values):
seq.append(value[0])
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=index)
ruamel.yaml.round_trip_dump(res, out)
flatten = Flatten('myproject.section.more_information')
for file_name in glob.glob('*.yaml'):
flatten.add(file_name)
flatten.dump(sys.stdout)
If you have an additional input file:
default:
learn_more:
commented: value # this value has a comment
description: another description
then the result is:
myproject.section.more_information.default.heading: Here’s A Title
myproject.section.more_information.default.learn_more.title: Title of Thing
myproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description:
- description
- another description
myproject.section.more_information.default.learn_more.opens_new_window: true
myproject.section.more_information.default.learn_more.commented: value # this value has a comment
Of course if your input doesn't have double paths, your output won't have any lists.
By using string_types and ordereddict from ruamel.yaml makes this Python2 and Python3 compatible (you don't indicate which version you are using).
The ordereddict preserves the original key ordering, but this is of course dependent on the processing order of the files. If you want the paths sorted, just change dump() to use:
for path in sorted(self._result):
Also note that the comment on the 'commented' dictionary entry is preserved.
¹ ruamel.yaml is a YAML 1.2 parser that preserves comments and other data on round-tripping (PyYAML does most parts of YAML 1.1). Disclaimer: I am the author of ruamel.yaml

Keep a simple list of strings, being the most recent key at each indentation depth. When you progress from one line to the next with no change, simply change the item at the end of the list. When you "out-dent", pop the last item off the list. When you indent, append to the list.
Then, each time you hit a colon, the corresponding key item is the concatenation of the strings in the list, something like:
'.'.join(key_list)
Does that get you moving at an honorable speed?

enumerate column headers in CSV that belong to the same tag (key) in python

I am using the following sets of generators to parse XML in to CSV:
import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import csv
def flatten_list(aList, prefix=''):
for i, element in enumerate(aList, 1):
eprefix = "{}{}".format(prefix, i)
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, eprefix)
# treat like list
elif element[0].tag == element[1].tag:
yield from flatten_list(element, eprefix)
elif element.text:
text = element.text.strip()
if text:
yield eprefix[:].rstrip('.'), element.text
def flatten_dict(parent_element, prefix=''):
prefix = prefix + parent_element.tag
if parent_element.items():
for k, v in parent_element.items():
yield prefix + k, v
for element in parent_element:
eprefix = element.tag
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, prefix=prefix)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
yield from flatten_list(element, prefix=eprefix)
# if the tag has attributes, add those to the dict
if element.items():
for k, v in element.items():
yield eprefix+k
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
for k, v in element.items():
yield eprefix+k
# finally, if there are no child tags and no attributes, extract
# the text
else:
yield eprefix, element.text
def makerows(pairs):
headers = []
columns = {}
for k, v in pairs:
if k in columns:
columns[k].extend((v,))
else:
headers.append(k)
columns[k] = [k, v]
m = max(len(c) for c in columns.values())
for c in columns.values():
c.extend(' ' for i in range(len(c), m))
L = [columns[k] for k in headers]
rows = list(zip(*L))
return rows
def main():
with open('2-Response_duplicate.xml', 'r', encoding='utf-8') as f:
xml_string = f.read()
xml_string= xml_string.replace('', '') #optional to remove ampersands.
root = ElementTree.XML(xml_string)
# for key, value in flatten_dict(root):
# key = key.rstrip('.').rsplit('.', 1)[-1]
# print(key,value)
writer = csv.writer(open("try5.csv", 'wt'))
writer.writerows(makerows(flatten_dict(root)))
if __name__ == "__main__":
main()
One column of the CSV, when opened in Excel, looks like this:
ObjectGuid
2adeb916-cc43-4d73-8c90-579dd4aa050a
2e77c588-56e5-4f3f-b990-548b89c09acb
c8743bdd-04a6-4635-aedd-684a153f02f0
1cdc3d86-f9f4-4a22-81e1-2ecc20f5e558
2c19d69b-26d3-4df0-8df4-8e293201656f
6d235c85-6a3e-4cb3-9a28-9c37355c02db
c34e05de-0b0c-44ee-8572-c8efaea4a5ee
9b0fe8f5-8ec4-4f13-b797-961036f92f19
1d43d35f-61ef-4df2-bbd9-30bf014f7e10
9cb132e8-bc69-4e4f-8f29-c1f503b50018
24fd77da-030c-4cb7-94f7-040b165191ce
0a949d4f-4f4c-467e-b0a0-40c16fc95a79
801d3091-c28e-44d2-b9bd-3bad99b32547
7f355633-426d-464b-bab9-6a294e95c5d5
This is due to the fact that there are 14 tags with name ObjectGuid. For example, one of these tags looks like this:
<ObjectGuid>2adeb916-cc43-4d73-8c90-579dd4aa050a</ObjectGuid>
My question: is there an efficient method to enumerate the headers (the keys) such that each key is enumerated like so with it's corresponding value (text in the XML data structure):
It would be displayed in Excel as follows:
ObjectGuid_1 ObjectGuid_2 ObejectGuid3 etc.....
Please let me know if there is any other information that you need from me (such as sample XML). Thank you for your help.

It is a mistake to add an element, attribute, or annotative descriptor to the data set itself for the purpose of identity… Normalizing the data should only be done if you own that data and know with 100% guarantee that doing so will not
have any negative effect on additional consumers (ones relying on attribute order to manipulate the DOM). However what is the point of using a dict or nested dicts (which I don’t quite get either t) if the efficiency of the hashed table lookup is taken right back by making 0(n) checks for this attribute new attribute? The point of this hashing is random look up..
If it’s simply the structured in (key, value) pair, which makes sense here.. Why not just use some other contiguous data structure, but treat it like a dictionary.. Say a named tuple…
A second solution is if you want to add additional state is to throw your generator in a class.
class order:
def__init__(self, lines, order):
self.lines = lines
self.order - python(max)
def __iter__(self):
for l, line in enumerate(self.lines, 1);
self.order.append( l, line))
yield line
when open (some file.csv) as f:
lines = oder( f);
Messing with the data a Harmless conversion? For example if were to create a conversion dictionary (see below)
Well that’s fine, that is until one of the values is blank…
types = [ (‘x ’, float’),
(‘y’, float’)
with open(‘some.csv’) as f:
for row in cvs.DictReader(f):
row.update((key, conversion (row [ key]))
for key, conversion in field_types)
[x: ,y: 2. 2] — > that is until there is an empty data point.. Kaboom.
So My suggestion would not be to change or add to the data, but change the algorithm in which deal with such.. If the problem is order why not simply treat say a tuple as a named tuple similar to a dictionary, the caveat being mutability however makes sense with data uniformity...
*I don’t understand the nested dictionary…That is for the y header values yes?
values and order key —> key — > ( key: value ) ? or you could just skip the
first row :p..
So just skip the first row..
for line in {header: list, header: list }
line.next() # problem solved.. or print(line , end = ‘’)
*** Notables
-To iterator over multiple sequences in parallel
h = [a,b,c]
x = [1,2,3]
for i in zip(a,b):
print(i)
(a, 1)
(b, 2)
-Chaining
a = [1,2 , 3]
b= [a, b , c ]enter code here
for x in chain(a, b):
//remove white space

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to map over a CommentedMap while preserving the comments/style? - python

Related

Is it possible to remove unecessary nested structure in yaml file?

Python: apply wildcard match to keys being read from dictionary

Preserving anchors and aliases in YAML with Python

Return all keys along with value in nested dictionary

enumerate column headers in CSV that belong to the same tag (key) in python

Categories

Resources