How to parse string replacement fields in a string in python? - python

Python has this concept of string replacement fields such as mystr1 = "{replaceme} other text..." where {replaceme} (the replacement field) can be easily formatted via statements such as mystr1.format(replaceme="yay!").
So I often am working with large strings and sometimes do not know all of the replacement fields and need to either manually resolve them which is not too bad if it is one or two, but sometimes it is dozens and would be nice if python had a function similar to dict.keys().
How does one to parse string replacement fields in a string in python?

In lieu of answers from the community I wrote a helper function below to spit out the replacement fields to a dict which I can then simply update the values to what I want and format the string.
Is there a better way or built in way to do this?
cool_string = """{a}yo{b}ho{c}ho{d}and{e}a{f}bottle{g}of{h}rum{i}{j}{k}{l}{m}{n}{o}{p}{q}{r}{s}{t}{u}{v}{w}{x}{y}{z}"""
def parse_keys_string(s,keys={}):
try:
print(s.format(**keys)[:0])
return keys
except KeyError as e:
print("Adding Key:",e)
e = str(e).replace("'","")
keys[e]=e
parse_keys_string(s,keys)
return keys
cool_string_replacement_fields_dict = parse_keys_string(cool_string)
#set replacement field values
i = 1
for k,v in cool_string_replacement_fields_dict.items():
cool_string_replacement_fields_dict[k] = i
i = i + 1
#format the string with desired values...
cool_string_formatted = cool_string.format(**cool_string_replacement_fields_dict)
print(cool_string_formatted)

I came up with the following:
class NumfillinDict(dict):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.i = -1
def __missing__(self, key): #optionally one could have logic based on key
self.i+=1
return f"({self.i})"
cool_string = ("{a}yo{b}ho{c}ho{d}and{e}a{f}bottle{g}of{h}rum{i}\n"
"{j}{k}{l}{m}{n}{o}{p}{q}{r}{s}{t}{u}{v}{w}{x}{y}{z}")
dt = NumfillinDict(notneeded='something', b=' -=actuallyIknowb<=- ')
filled_string = cool_string.format_map(dt)
print(filled_string)
It works a bit like a defaultdict by filling in missing key-value pairs using the __missing__ method.
Result:
(0)yo -=actuallyIknowb<=- ho(1)ho(2)and(3)a(4)bottle(5)of(6)rum(7)
(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)
Inspired by: Format string unused named arguments

Related

"AttributeError: 'generator' object has no attribute 'replace' "

I'm not sure why I'm seeing this error message: AttributeError: 'generator' object has no attribute 'replace' (on line: modified_file = hex_read_file.replace(batch_to_amend_final, batch_amendment_final).
import binascii, os, re, time
os.chdir(...)
files_to_amend = os.listdir(...)
joiner = "00"
# Allow user to input the text to be replaced, and with what
while True:
batch_to_amend3 = input("\n\nWhat number would you like to amend? \n\n >>> ")
batch_amendment3 = input("\n\nWhat is the new number? \n\n >>> ")
batch_to_amend2 = batch_to_amend3.encode()
batch_to_amend = joiner.encode().join(binascii.hexlify(bytes((i,))) for i in batch_to_amend2)
batch_amendment2 = batch_amendment3.encode()
batch_amendment = joiner.encode().join(binascii.hexlify(bytes((i,))) for i in batch_amendment2)
# Function to translate label files
def lbl_translate(files_to_amend):
with open(files_to_amend, 'rb') as read_file:
read_file2 = read_file.read()
hex_read_file = (binascii.hexlify(bytes((i,))) for i in read_file2)
print(hex_read_file)
modified_file = hex_read_file.replace(batch_to_amend, batch_amendment)
with open(files_to_amend, 'wb') as write_file:
write_file.write(modified_file)
write_file.close()
print("Amended: " + files_to_amend)
# Calling function to modify labels
for label in files_to_amend:
lbl_translate(label)
hex_read_file is a generator comprehension (note the round brackets around the statement) defined here:
hex_read_file = (binascii.hexlify(bytes((i,))) for i in read_file2)
As many already pointed out in the comments, comprehesions don't have a replace method as strings have, so you have two possibilities, depending on your specific use-case:
Turn the comprehension in a bytestring and call replace on that (considering how you use write_file.write(modified_file) afterwards, this is the option that would work with that directly):
hex_read_file = bytes(binascii.hexlify(bytes((int(i),))) for i in read_file2) # note: I added th eadditional int() call to fix the issue mentioned in the comments
Filter and replace directly in the comprehension (and modify how you write out the result):
def lbl_translate(files_to_amend, replacement_map):
with open(files_to_amend, 'rb') as read_file:
read_file2 = read_file.read()
hex_read_file = ( replacement_map.get(binascii.hexlify(bytes((int(i),))), binascii.hexlify(bytes((int(i),)))) for i in read_file2) # see Note below
with open(files_to_amend, 'wb') as write_file:
for b in hex_read_file:
write_file.write(b)
print("Amended: " + files_to_amend)
where replacement_map is a dict that you fill in with the batch_to_amend as key and the batch_amendment value (you can speficy multiple amendments too and it will work just the same). The call would then be:
for label in files_to_amend:
lbl_translate(label,{batch_to_amend:batch_amendment})
NOTE: Using standard python dicts, because of how comprehensions work, you need to call binascii.hexlify(bytes((int(i),))) twice here. A better option uses collections.defaultdict
A better option would use defaultdict, if they were implemented in a sensible way (see here for more context on why I say that). defaltdicts expect a lambda with no parameters generating the value for unknown keys, instead you need to create your own subclass of dict and implement the __missing__ method to obtain the desired behaviour:
hex_read_file = ( replacement_map[binascii.hexlify(bytes((int(i),)))] for i in read_file2) # replacement_map is a collections.defaultdict
and you define replacement_map as:
class dict_with_key_as_default(dict): # find a better name for the type
def __missing__(self, key):
'''if a value is not in the dictionary, return the key value instead.'''
return key
replacement_map = dict_with_key_as_default()
replacement_map[batch_to_amend] = batch_amendment
for label in files_to_amend:
lbl_translate(label, replacement_map)
(class dict_with_key_as_default taken from this answer and renamed for clarity)
Edit note: As mentioned in the comments, the OP has an error in the comprehension where they call hexlify() on some binary string instead of integer values. The solution adds a cast to int for the bytes where relevant, but it's far from the best solution to this problem. Since the OP's intent is not clear, I left it as close to the original as possible, but an alternative solution should be used instead.

Python-based method for finding an object by name

I'd like some advice on how to better extract objects from the 'self' construct by name such that the object is returned for the string of its type is provided.
My working code is presented below.
def makeObjectTypeList(self):
objectList = []
allObjects = dir( self )
for att in allObjects:
test = getattr(self, att)
test1 = test.__class__.__name__
test2 = len( re.split( self.objectType, test1) )
if test2 > 1:
objectList.append( att)
self.objectList = objectList
The value for self.objectType is the string 'QDial'. And, I return all instances of objects of QDial from the method with the re.split() method and the check on the number of items in its result (e.g., len(re.split()) > 1).
My question is how to make this more compact using 'enumerate', etc. in the mode of Python coding. My code is general so that I can pass it 'QLablel', 'QTabWidget', etc. in self.objectType, and obtain all such type-matched instances. But, it feels clunky, and I don't bother yet to trap for the case of a non-existent class type.
You can use a list comprehension that iterates through the attribute-value pairs of the dict returned by the vars function and retain only those whose class of the value matches self.objectType:
def makeObjectTypeList(self):
self.objectList = [k for k, v in vars(self).items() if v.__class__.__name__ == self.objectType]

Performing string transformations on buildbot build properties?

Is there a good way to perform string transformations on a property or source stamp attribute before using it in an Interpolate? We use slashes in our branch names, and I need to transform the slashes into dashes so I can use them in filenames.
That is, say I have the branch "feature/fix-all-the-things", accessible as Interpolate("%(prop:branch)s") or Interpolate("%(src::branch)s"). I would like to be able to transform it to "feature-fix-all-the-things" for some interpolations. Obviously, it needs to remain in its original form for selecting the appropriate branch from source control.
It turned out, I just needed to subclass Interpolate:
import re
from buildbot.process.properties import Interpolate
class InterpolateReplace(Interpolate):
"""Interpolate with regex replacements.
This takes an additional argument, `patterns`, which is a list of
dictionaries containing the keys "search" and "replace", corresponding to
`pattern` and `repl` arguments to `re.sub()`.
"""
def __init__(self, fmtstring, patterns, *args, **kwargs):
Interpolate.__init__(self, fmtstring, *args, **kwargs)
self._patterns = patterns
def _sub(self, s):
for pattern in self._patterns:
search = pattern['search']
replace = pattern['replace']
s = re.sub(search, replace, s)
return s
def getRenderingFor(self, props):
props = props.getProperties()
if self.args:
d = props.render(self.args)
d.addCallback(lambda args:
self._sub(self.fmtstring % tuple(args)))
return d
else:
d = props.render(self.interpolations)
d.addCallback(lambda res:
self._sub(self.fmtstring % res))
return d
It looks like there's a newer, easier way to do this since buildbot v0.9.0 with Transform:
filename = util.Transform(
lambda p: p.replace('/', '-'),
util.Property('branch')
)

How can I apply a prefix to dictionary access?

I'm imitating the behavior of the ConfigParser module to write a highly specialized parser that exploits some well-defined structure in the configuration files for a particular application I work with. Several sections of the config file contain hundreds of variable and routine mappings prefixed with either Variable_ or Routine_, like this:
[Map.PRD]
Variable_FOO=LOC1
Variable_BAR=LOC2
Routine_FOO=LOC3
Routine_BAR=LOC4
...
[Map.SHD]
Variable_FOO=LOC1
Variable_BAR=LOC2
Routine_FOO=LOC3
Routine_BAR=LOC4
...
I'd like to maintain the basic structure of ConfigParser where each section is stored as a single dictionary, so users would still have access to the classic syntax:
config.content['Mappings']['Variable_FOO'] = 'LOC1'
but also be able to use a simplified API that drills down to this section:
config.vmapping('PRD')['FOO'] = 'LOC1'
config.vmapping('PRD')['BAR'] = 'LOC2'
config.rmapping('PRD')['FOO'] = 'LOC3'
config.rmapping('PRD')['BAR'] = 'LOC4'
Currently I'm implementing this by storing the section in a special subclass of dict to which I've added a prefix attribute. The variable and routine properties of the parser set the prefix attribute of the dict-like object to 'Variable_' or 'Routine_' and then modified __getitem__ and __setitem__ attributes of the dict handle gluing the prefix together with the key to access the appropriate item. It's working, but involves a lot of boilerplate to implement all the associated niceties like supporting iteration.
I suppose my ideal solution would be do dispense with the subclassed dict and have have the variable and routine properties somehow present a "view" of the plain dict object underneath without the prefixes.
Update
Here's the solution I implemented, largely based on #abarnet's answer:
class MappingDict(object):
def __init__(self, prefix, d):
self.prefix, self.d = prefix, d
def prefixify(self, name):
return '{}_{}'.format(self.prefix, name)
def __getitem__(self, name):
name = self.prefixify(name)
return self.d.__getitem__(name)
def __setitem__(self, name, value):
name = self.prefixify(name)
return self.d.__setitem__(name, value)
def __delitem__(self, name):
name = self.prefixify(name)
return self.d.__delitem__(name)
def __iter__(self):
return (key.partition('_')[-1] for key in self.d
if key.startswith(self.prefix))
def __repr__(self):
return 'MappingDict({})'.format(dict.__repr__(self))
class MyParser(object):
SECTCRE = re.compile(r'\[(?P<header>[^]]+)\]')
def __init__(self, filename):
self.filename = filename
self.content = {}
lines = [x.strip() for x in open(filename).read().splitlines()
if x.strip()]
for line in lines:
match = re.match(self.SECTCRE, line)
if match:
section = match.group('header')
self.content[section] = {}
else:
key, sep, value = line.partition('=')
self.content[section][key] = value
def write(self, filename):
fp = open(filename, 'w')
for section in sorted(self.content, key=sectionsort):
fp.write("[%s]\n" % section)
for key in sorted(self.content[section], key=cpfsort):
value = str(self.content[section][key])
fp.write("%s\n" % '='.join([key,value]))
fp.write("\n")
fp.close()
def vmapping(self, nsp):
section = 'Map.{}'.format(nsp)
return MappingDict('Variable', self.content[section])
def rmapping(self, nsp):
section = 'Map.{}'.format(nsp)
return MappingDict('Routine', self.content[section])
It's used like this:
config = MyParser('myfile.cfg')
vmap = config.vmapping('PRD')
vmap['FOO'] = 'LOC5'
vmap['BAR'] = 'LOC6'
config.write('newfile.cfg')
The resulting newfile.cfg reflects the LOC5 and LOC6 changes.
I don't think you want inheritance here. You end up with two separate dict objects which you have to create on load and then paste back together on save…
If that's acceptable, you don't even need to bother with the prefixing during normal operations; just do the prefixing while saving, like this:
class Config(object):
def save(self):
merged = {'variable_{}'.format(key): value for key, value
in self.variable_dict.items()}
merged.update({'routine_{}'.format(key): value for key, value
in self.routine_dict.items()}
# now save merged
If you want that merged object to be visible at all times, but don't expect to be called on that very often, make it a #property.
If you want to access the merged dictionary regularly, at the same time you're accessing the two sub-dictionaries, then yes, you want a view:
I suppose my ideal solution would be do dispense with the subclassed dict and have have the global and routine properties somehow present a "view" of the plain dict object underneath without the prefixes.
This is going to be very hard to do with inheritance. Certainly not with inheritance from dict; inheritance from builtins.dict_items might work if you're using Python 3, but it still seems like a stretch.
But with delegation, it's easy. Each sub-dictionary just holds a reference to the parent dict:
class PrefixedDict(object):
def __init__(self, prefix, d):
self.prefix, self.d = prefix, d
def prefixify(self, key):
return '{}_{}'.format(self.prefix, key)
def __getitem__(self, key):
return self.d.__getitem__(self.prefixify(key))
def __setitem__(self, key, value):
return self.d.__setitem__(self.prefixify(key), value)
def __delitem__(self, key):
return self.d.__delitem__(self.prefixify(key))
def __iter__(self):
return (key[len(self.prefix):] for key in self.d
if key.startswith(self.prefix)])
You don't get any of the dict methods for free that way—but that's a good thing, because they were mostly incorrect anyway, right? Explicitly delegate the ones you want. (If you do have some you want to pass through as-is, use __getattr__ for that.)
Besides being conceptually simpler and harder to screw up through accidentally forgetting to override something, this also means that PrefixDict can work with any type of mapping, not just a dict.
So, no matter which way you go, where and how do these objects get created?
The easy answer is that they're attributes that you create when you construct a Config:
def __init__(self):
self.d = {}
self.variable = PrefixedDict('Variable', self.d)
self.routine = PrefixedDict('Routine', self.d)
If this needs to be dynamic (e.g., there can be an arbitrary set of prefixes), create them at load time:
def load(self):
# load up self.d
prefixes = set(key.split('_')[0] for key in self.d)
for prefix in prefixes:
setattr(self, prefix, PrefixedDict(prefix, self.d)
If you want to be able to create them on the fly (so config.newprefix['foo'] = 3 adds 'Newprefix_foo'), you can do this instead:
def __getattr__(self, name):
return PrefixedDict(name.title(), self.d)
But once you're using dynamic attributes, you really have to question whether it isn't cleaner to use dictionary (item) syntax instead, like config['newprefix']['foo']. For one thing, that would actually let you call one of the sub-dictionaries 'global', as in your original question…
Or you can first build the dictionary syntax, use what's usually referred to as an attrdict (search ActiveState recipes and PyPI for 3000 implementations…), which lets you automatically make config.newprefix mean config['newprefix'], so you can use attribute syntax when you have valid identifiers, but fall back to dictionary syntax when you don't.
There are a couple of options for how to proceed.
The simplest might be to use nested dictionaries, so Variable_FOO becomes config["variable"]["FOO"]. You might want to use a defaultdict(dict) for the outer dictionary so you don't need to worry about initializing the inner ones when you add the first value to them.
Another option would be to use tuple keys in a single dictionary. That is, Variable_FOO would become config[("variable", "FOO")]. This is easy to do with code, since you can simply assign to config[tuple(some_string.split("_"))]. Though, I suppose you could also just use the unsplit string as your key in this case.
A final approach allows you to use the syntax you want (where Variable_FOO is accessed as config.Variable["FOO"]), by using __getattr__ and a defaultdict behind the scenes:
from collections import defaultdict
class Config(object):
def __init__(self):
self._attrdicts = defaultdict(dict)
def __getattr__(self, name):
return self._attrdicts[name]
You could extend this with behavior for __setattr__ and __delattr__ but it's probably not necessary. The only serious limitation to this approach (given the original version of the question), is that the attributes names (like Variable) must be legal Python identifiers. You can't use strings with leading numbers, Python keywords (like global) or strings containing whitespace characters.
A downside to this approach is that it's a bit more difficult to use programatically (by, for instance, your config-file parser). To read a value of Variable_FOO and save it to config.Variable["FOO"] you'll probably need to use the global getattr function, like this:
name, value = line.split("=")
prefix, suffix = name.split("_")
getattr(config, prefix)[suffix] = value

Adding comments to YAML produced with PyYaml

I'm creating Yaml documents from my own python objects using PyYaml.
for example my object:
class MyObj(object):
name = "boby"
age = 34
becomes:
boby:
age: 34
So far so good.
But I have not found a way to programmatically add comments to the produced yaml so it will look like:
boby: # this is the name
age: 34 # in years
Looking at PyYaml documentation and also at the code, I found no way of doing so.
Any suggestions?
You probably have some representer for the MyObj class, as by default dumping ( print(yaml.dump(MyObj())) ) with PyYAML will give you:
!!python/object:__main__.MyObj {}
PyYAML can only do one thing with the comments in your desired output: discard them. If you would read that desired output back in, you end
up with a dict containing a dict ({'boby': {'age': 34}}, you would not get a MyObj() instance because there is no tag information)
The enhanced version for PyYAML that I developed (ruamel.yaml) can read in YAML with comments, preserve the comments and write comments when dumping.
If you read your desired output, the resulting data will look (and act) like a dict containing a dict, but in reality there is more complex data structure that can handle the comments. You can however create that structure when ruamel.yaml asks you to dump an instance of MyObj and if you add the comments at that time, you will get your desired output.
from __future__ import print_function
import sys
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap
class MyObj():
name = "boby"
age = 34
def convert_to_yaml_struct(self):
x = CommentedMap()
a = CommentedMap()
x[data.name] = a
x.yaml_add_eol_comment('this is the name', 'boby', 11)
a['age'] = data.age
a.yaml_add_eol_comment('in years', 'age', 11)
return x
#staticmethod
def yaml_representer(dumper, data, flow_style=False):
assert isinstance(dumper, ruamel.yaml.RoundTripDumper)
return dumper.represent_dict(data.convert_to_yaml_struct())
ruamel.yaml.RoundTripDumper.add_representer(MyObj, MyObj.yaml_representer)
ruamel.yaml.round_trip_dump(MyObj(), sys.stdout)
Which prints:
boby: # this is the name
age: 34 # in years
There is no need to wait with creating the CommentedMap instances until you want to represent the MyObj instance. I would e.g. make name and age into properties that get/set values from/on the approprate CommentedMap. That way you could more easily add the comments before the yaml_representer static method is called to represent the MyObj instance.
Here is a solution I came up with; it's a bit complex but less complex than ruamel, as it works entirely with the plain PyYAML API, and does not round trip comments (so it would not be an appropriate answer to this other question). It's probably not as robust overall yet, as I have not tested extensively, but it seems good-enough for my use case, which is that I want dicts/mappings to be able to have comments, both for the entire mapping, as well as per-item comments.
I believe that round-tripping comments--in this limited context--would also be possible with a similar approach, but I have not tried it, as it's not currently a use-case I have.
Finally, while this solution does not implement adding per-item comment to items in lists/sequences (as this is not something I need at the moment) it could easily be extended to do so.
First, as in ruamel, we need a sort of CommentedMapping class, which associates comments with each key in a Mapping. There are many possible approaches to this; mine is just one:
from collections.abc import Mapping, MutableMapping
class CommentedMapping(MutableMapping):
def __init__(self, d, comment=None, comments={}):
self.mapping = d
self.comment = comment
self.comments = comments
def get_comment(self, *path):
if not path:
return self.comment
# Look the key up in self (recursively) and raise a
# KeyError or other execption if such a key does not
# exist in the nested structure
sub = self.mapping
for p in path:
if isinstance(sub, CommentedMapping):
# Subvert comment copying
sub = sub.mapping[p]
else:
sub = sub[p]
comment = None
if len(path) == 1:
comment = self.comments.get(path[0])
if comment is None:
comment = self.comments.get(path)
return comment
def __getitem__(self, item):
val = self.mapping[item]
if (isinstance(val, (dict, Mapping)) and
not isinstance(val, CommentedMapping)):
comment = self.get_comment(item)
comments = {k[1:]: v for k, v in self.comments.items()
if isinstance(k, tuple) and len(k) > 1 and k[0] == item}
val = self.__class__(val, comment=comment, comments=comments)
return val
def __setitem__(self, item, value):
self.mapping[item] = value
def __delitem__(self, item):
del self.mapping[item]
for k in list(self.comments):
if k == item or (isinstance(k, tuple) and k and k[0] == item):
del self.comments[key]
def __iter__(self):
return iter(self.mapping)
def __len__(self):
return len(self.mapping)
def __repr__(self):
return f'{type(self).__name__}({self.mapping}, comment={self.comment!r}, comments={self.comments})'
This class has both a .comment attribute, so that it can carry an overall comment for the mapping, and a .comments attribute containing per-key comments. It also allows adding comments for keys in nested dicts, by specifying the key path as a tuple. E.g. comments={('c', 'd'): 'comment'} allows specifying a comment for the key 'd' in the nested dict at 'c'. When getting items from CommentedMapping, if the item's value is a dict/Mapping, it is also wrapped in a CommentedMapping in such a way that preserves its comments. This is useful for recursive calls into the YAML representer for nested structures.
Next we need to implement a custom YAML Dumper which takes care of the full process of serializing an object to YAML. A Dumper is a complicated class that's composed from four other classes, an Emitter, a Serializer, a Representer, and a Resolver. Of these we only have to implement the first three; Resolvers are more concerned with, e.g. how implict scalars like 1 get resolved to the correct type, as well as determining the default tags for various values. It's not really involved here.
First we implement a resolver. The resolver is responsible for recognizing different Python types, and mapping them to their appropriate nodes in the native YAML data structure/representation graph. Namely, these include scalar nodes, sequence nodes, and mapping nodes. For example, the base Representer class includes a representer for Python dicts which converts them to a MappingNode (each item in the dict in turn consists of a pair of ScalarNodes, one for each key and one for each value).
In order to attach comments to entire mappings, as well as to each key in a mapping, we introduce two new Node types which are not formally part of the YAML specification:
from yaml.node import Node, ScalarNode, MappingNode
class CommentedNode(Node):
"""Dummy base class for all nodes with attached comments."""
class CommentedScalarNode(ScalarNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None, style=None,
comment=None):
super().__init__(tag, value, start_mark, end_mark, style)
self.comment = comment
class CommentedMappingNode(MappingNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None,
flow_style=None, comment=None, comments={}):
super().__init__(tag, value, start_mark, end_mark, flow_style)
self.comment = comment
self.comments = comments
We then add a CommentedRepresenter which includes code for representing a CommentedMapping as a CommentedMappingNode. In fact, it just reuses the base class's code for representing a mapping, but converts the returned MappingNode to a CommentedMappingNode. It also converts each key from a ScalarNode to a CommentedscalarNode. We base it on SafeRepresenter here since I don't need serialization of arbitrary Python objects:
from yaml.representer import SafeRepresenter
class CommentedRepresenter(SafeRepresenter):
def represent_commented_mapping(self, data):
node = super().represent_dict(data)
comments = {k: data.get_comment(k) for k in data}
value = []
for k, v in node.value:
if k.value in comments:
k = CommentedScalarNode(
k.tag, k.value,
k.start_mark, k.end_mark, k.style,
comment=comments[k.value])
value.append((k, v))
node = CommentedMappingNode(
node.tag,
value,
flow_style=False, # commented dicts must be in block style
# this could be implemented differently for flow-style
# maps, but for my case I only want block-style, and
# it makes things much simpler
comment=data.get_comment(),
comments=comments
)
return node
yaml_representers = SafeRepresenter.yaml_representers.copy()
yaml_representers[CommentedMapping] = represent_commented_mapping
Next we need to implement a subclass of Serializer. The serializer is responsible for walking the representation graph of nodes, and for each node output one or more events to the emitter, which is a complicated (and sometimes difficult to follow) state machine, which receives a stream of events and outputs the appropriate YAML markup for each event (e.g. there is a MappingStartEvent which, when received, will output a { if it's a flow-style mapping, and/or add the appropriate level of indentation for subsequent output up to the corresponding MappingEndEvent.
Point being, the new serializer must output events representing comments, so that the emitter can know when it needs to emit a comment. This is handling simply by adding a CommentEvent and emitting them every time a CommentedMappingNode or CommentedScalarNode are encountered in the representation:
from yaml import Event
class CommentEvent(yaml.Event):
"""
Simple stream event representing a comment to be output to the stream.
"""
def __init__(self, value, start_mark=None, end_mark=None):
super().__init__(start_mark, end_mark)
self.value = value
class CommentedSerializer(Serializer):
def serialize_node(self, node, parent, index):
if (node not in self.serialized_nodes and
isinstance(node, CommentedNode) and
not (isinstance(node, CommentedMappingNode) and
isinstance(parent, CommentedMappingNode))):
# Emit CommentEvents, but only if the current node is not a
# CommentedMappingNode nested in another CommentedMappingNode (in
# which case we would have already emitted its comment via the
# parent mapping)
self.emit(CommentEvent(node.comment))
super().serialize_node(node, parent, index)
Next, the Emitter needs to be subclassed to handle CommentEvents. This is perhaps the trickiest part, since as I wrote the emitter is a bit complex and fragile, and written in such a way that it's difficult to modify the state machine (I am tempted to rewrite it more clearly, but don't have time right now). So I experimented with a number of different solutions.
The key method here is Emitter.emit which processes the event stream, and calls "state" methods which perform some action depending on what state the machine is in, which is in turn affected by what events appear in the stream. An important realization is that the stream processing is suspended in many cases while waiting for more events to come in--this is what the Emitter.need_more_events method is responsible for. In some cases, before the current event can be handled, more events need to come in first. For example, in the case of MappingStartEvent at least 3 more events need to be buffered on the stream: the first key/value pair, and the possible the next key. The Emitter needs to know, before it can begin formatting a map, if there are one or more items in the map, and possibly also the length of the first key/value pair. The number of events required before the current event can be handled are hard-coded in the need_more_events method.
The problem is that this does not account for the now possible presence of CommentEvents on the event stream, which should not impact processing of other events. Therefore the Emitter.need_events method to account for the presence of CommentEvents. E.g. if the current event is MappingStartEvent, and there are 3 subsequent events buffered, if one of those are a CommentEvent we can't count it, so we'll need at a minimum 4 events (in case the next one is one of the expected events in a mapping).
Finally, every time a CommentEvent is encountered on the stream, we forcibly break out of the current event processing loop to handle writing the comment, then pop the CommentEvent off the stream and continue as if nothing happened. This is the end result:
import textwrap
from yaml.emitter import Emitter
class CommentedEmitter(Emitter):
def need_more_events(self):
if self.events and isinstance(self.events[0], CommentEvent):
# If the next event is a comment, always break out of the event
# handling loop so that we divert it for comment handling
return True
return super().need_more_events()
def need_events(self, count):
# Hack-y: the minimal number of queued events needed to start
# a block-level event is hard-coded, and does not account for
# possible comment events, so here we increase the necessary
# count for every comment event
comments = [e for e in self.events if isinstance(e, CommentEvent)]
return super().need_events(count + min(count, len(comments)))
def emit(self, event):
if self.events and isinstance(self.events[0], CommentEvent):
# Write the comment, then pop it off the event stream and continue
# as normal
self.write_comment(self.events[0].value)
self.events.pop(0)
super().emit(event)
def write_comment(self, comment):
indent = self.indent or 0
width = self.best_width - indent - 2 # 2 for the comment prefix '# '
lines = ['# ' + line for line in wrap(comment, width)]
for line in lines:
if self.encoding:
line = line.encode(self.encoding)
self.write_indent()
self.stream.write(line)
self.write_line_break()
I also experimented with different approaches to the implementation of write_comment. The Emitter base class has its own method (write_plain) which can handle writing text to the stream with appropriate indentation and line-wrapping. However, it's not quite flexible enough to handle something like comments, where each line needs to be prefixed with something like '# '. One technique I tried was monkey-patching the write_indent method to handle this case, but in the end it was too ugly. I found that simply using Python's built-in textwrap.wrap was sufficient for my case.
Next, we create the dumper by subclassing the existing SafeDumper but inserting our new classes into the MRO:
from yaml import SafeDumper
class CommentedDumper(CommentedEmitter, CommentedSerializer,
CommentedRepresenter, SafeDumper):
"""
Extension of `yaml.SafeDumper` that supports writing `CommentedMapping`s with
all comments output as YAML comments.
"""
Here's an example usage:
>>> import yaml
>>> d = CommentedMapping({
... 'a': 1,
... 'b': 2,
... 'c': {'d': 3},
... }, comment='my commented dict', comments={
... 'a': 'a comment',
... 'b': 'b comment',
... 'c': 'long string ' * 44,
... ('c', 'd'): 'd comment'
... })
>>> print(yaml.dump(d, Dumper=CommentedDumper))
# my commented dict
# a comment
a: 1
# b comment
b: 2
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string
c:
# d comment
d: 3
I still haven't tested this solution very extensively, and it likely still contains bugs. I'll update it as I use it more and find corner-cases, etc.

Categories

Resources