Adding comments to YAML produced with PyYaml

Adding comments to YAML produced with PyYaml - python

I'm creating Yaml documents from my own python objects using PyYaml.
for example my object:
class MyObj(object):
name = "boby"
age = 34
becomes:
boby:
age: 34
So far so good.
But I have not found a way to programmatically add comments to the produced yaml so it will look like:
boby: # this is the name
age: 34 # in years
Looking at PyYaml documentation and also at the code, I found no way of doing so.
Any suggestions?

You probably have some representer for the MyObj class, as by default dumping ( print(yaml.dump(MyObj())) ) with PyYAML will give you:
!!python/object:__main__.MyObj {}
PyYAML can only do one thing with the comments in your desired output: discard them. If you would read that desired output back in, you end
up with a dict containing a dict ({'boby': {'age': 34}}, you would not get a MyObj() instance because there is no tag information)
The enhanced version for PyYAML that I developed (ruamel.yaml) can read in YAML with comments, preserve the comments and write comments when dumping.
If you read your desired output, the resulting data will look (and act) like a dict containing a dict, but in reality there is more complex data structure that can handle the comments. You can however create that structure when ruamel.yaml asks you to dump an instance of MyObj and if you add the comments at that time, you will get your desired output.
from __future__ import print_function
import sys
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap
class MyObj():
name = "boby"
age = 34
def convert_to_yaml_struct(self):
x = CommentedMap()
a = CommentedMap()
x[data.name] = a
x.yaml_add_eol_comment('this is the name', 'boby', 11)
a['age'] = data.age
a.yaml_add_eol_comment('in years', 'age', 11)
return x
#staticmethod
def yaml_representer(dumper, data, flow_style=False):
assert isinstance(dumper, ruamel.yaml.RoundTripDumper)
return dumper.represent_dict(data.convert_to_yaml_struct())
ruamel.yaml.RoundTripDumper.add_representer(MyObj, MyObj.yaml_representer)
ruamel.yaml.round_trip_dump(MyObj(), sys.stdout)
Which prints:
boby: # this is the name
age: 34 # in years
There is no need to wait with creating the CommentedMap instances until you want to represent the MyObj instance. I would e.g. make name and age into properties that get/set values from/on the approprate CommentedMap. That way you could more easily add the comments before the yaml_representer static method is called to represent the MyObj instance.

Here is a solution I came up with; it's a bit complex but less complex than ruamel, as it works entirely with the plain PyYAML API, and does not round trip comments (so it would not be an appropriate answer to this other question). It's probably not as robust overall yet, as I have not tested extensively, but it seems good-enough for my use case, which is that I want dicts/mappings to be able to have comments, both for the entire mapping, as well as per-item comments.
I believe that round-tripping comments--in this limited context--would also be possible with a similar approach, but I have not tried it, as it's not currently a use-case I have.
Finally, while this solution does not implement adding per-item comment to items in lists/sequences (as this is not something I need at the moment) it could easily be extended to do so.
First, as in ruamel, we need a sort of CommentedMapping class, which associates comments with each key in a Mapping. There are many possible approaches to this; mine is just one:
from collections.abc import Mapping, MutableMapping
class CommentedMapping(MutableMapping):
def __init__(self, d, comment=None, comments={}):
self.mapping = d
self.comment = comment
self.comments = comments
def get_comment(self, *path):
if not path:
return self.comment
# Look the key up in self (recursively) and raise a
# KeyError or other execption if such a key does not
# exist in the nested structure
sub = self.mapping
for p in path:
if isinstance(sub, CommentedMapping):
# Subvert comment copying
sub = sub.mapping[p]
else:
sub = sub[p]
comment = None
if len(path) == 1:
comment = self.comments.get(path[0])
if comment is None:
comment = self.comments.get(path)
return comment
def __getitem__(self, item):
val = self.mapping[item]
if (isinstance(val, (dict, Mapping)) and
not isinstance(val, CommentedMapping)):
comment = self.get_comment(item)
comments = {k[1:]: v for k, v in self.comments.items()
if isinstance(k, tuple) and len(k) > 1 and k[0] == item}
val = self.__class__(val, comment=comment, comments=comments)
return val
def __setitem__(self, item, value):
self.mapping[item] = value
def __delitem__(self, item):
del self.mapping[item]
for k in list(self.comments):
if k == item or (isinstance(k, tuple) and k and k[0] == item):
del self.comments[key]
def __iter__(self):
return iter(self.mapping)
def __len__(self):
return len(self.mapping)
def __repr__(self):
return f'{type(self).__name__}({self.mapping}, comment={self.comment!r}, comments={self.comments})'
This class has both a .comment attribute, so that it can carry an overall comment for the mapping, and a .comments attribute containing per-key comments. It also allows adding comments for keys in nested dicts, by specifying the key path as a tuple. E.g. comments={('c', 'd'): 'comment'} allows specifying a comment for the key 'd' in the nested dict at 'c'. When getting items from CommentedMapping, if the item's value is a dict/Mapping, it is also wrapped in a CommentedMapping in such a way that preserves its comments. This is useful for recursive calls into the YAML representer for nested structures.
Next we need to implement a custom YAML Dumper which takes care of the full process of serializing an object to YAML. A Dumper is a complicated class that's composed from four other classes, an Emitter, a Serializer, a Representer, and a Resolver. Of these we only have to implement the first three; Resolvers are more concerned with, e.g. how implict scalars like 1 get resolved to the correct type, as well as determining the default tags for various values. It's not really involved here.
First we implement a resolver. The resolver is responsible for recognizing different Python types, and mapping them to their appropriate nodes in the native YAML data structure/representation graph. Namely, these include scalar nodes, sequence nodes, and mapping nodes. For example, the base Representer class includes a representer for Python dicts which converts them to a MappingNode (each item in the dict in turn consists of a pair of ScalarNodes, one for each key and one for each value).
In order to attach comments to entire mappings, as well as to each key in a mapping, we introduce two new Node types which are not formally part of the YAML specification:
from yaml.node import Node, ScalarNode, MappingNode
class CommentedNode(Node):
"""Dummy base class for all nodes with attached comments."""
class CommentedScalarNode(ScalarNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None, style=None,
comment=None):
super().__init__(tag, value, start_mark, end_mark, style)
self.comment = comment
class CommentedMappingNode(MappingNode, CommentedNode):
def __init__(self, tag, value, start_mark=None, end_mark=None,
flow_style=None, comment=None, comments={}):
super().__init__(tag, value, start_mark, end_mark, flow_style)
self.comment = comment
self.comments = comments
We then add a CommentedRepresenter which includes code for representing a CommentedMapping as a CommentedMappingNode. In fact, it just reuses the base class's code for representing a mapping, but converts the returned MappingNode to a CommentedMappingNode. It also converts each key from a ScalarNode to a CommentedscalarNode. We base it on SafeRepresenter here since I don't need serialization of arbitrary Python objects:
from yaml.representer import SafeRepresenter
class CommentedRepresenter(SafeRepresenter):
def represent_commented_mapping(self, data):
node = super().represent_dict(data)
comments = {k: data.get_comment(k) for k in data}
value = []
for k, v in node.value:
if k.value in comments:
k = CommentedScalarNode(
k.tag, k.value,
k.start_mark, k.end_mark, k.style,
comment=comments[k.value])
value.append((k, v))
node = CommentedMappingNode(
node.tag,
value,
flow_style=False, # commented dicts must be in block style
# this could be implemented differently for flow-style
# maps, but for my case I only want block-style, and
# it makes things much simpler
comment=data.get_comment(),
comments=comments
)
return node
yaml_representers = SafeRepresenter.yaml_representers.copy()
yaml_representers[CommentedMapping] = represent_commented_mapping
Next we need to implement a subclass of Serializer. The serializer is responsible for walking the representation graph of nodes, and for each node output one or more events to the emitter, which is a complicated (and sometimes difficult to follow) state machine, which receives a stream of events and outputs the appropriate YAML markup for each event (e.g. there is a MappingStartEvent which, when received, will output a { if it's a flow-style mapping, and/or add the appropriate level of indentation for subsequent output up to the corresponding MappingEndEvent.
Point being, the new serializer must output events representing comments, so that the emitter can know when it needs to emit a comment. This is handling simply by adding a CommentEvent and emitting them every time a CommentedMappingNode or CommentedScalarNode are encountered in the representation:
from yaml import Event
class CommentEvent(yaml.Event):
"""
Simple stream event representing a comment to be output to the stream.
"""
def __init__(self, value, start_mark=None, end_mark=None):
super().__init__(start_mark, end_mark)
self.value = value
class CommentedSerializer(Serializer):
def serialize_node(self, node, parent, index):
if (node not in self.serialized_nodes and
isinstance(node, CommentedNode) and
not (isinstance(node, CommentedMappingNode) and
isinstance(parent, CommentedMappingNode))):
# Emit CommentEvents, but only if the current node is not a
# CommentedMappingNode nested in another CommentedMappingNode (in
# which case we would have already emitted its comment via the
# parent mapping)
self.emit(CommentEvent(node.comment))
super().serialize_node(node, parent, index)
Next, the Emitter needs to be subclassed to handle CommentEvents. This is perhaps the trickiest part, since as I wrote the emitter is a bit complex and fragile, and written in such a way that it's difficult to modify the state machine (I am tempted to rewrite it more clearly, but don't have time right now). So I experimented with a number of different solutions.
The key method here is Emitter.emit which processes the event stream, and calls "state" methods which perform some action depending on what state the machine is in, which is in turn affected by what events appear in the stream. An important realization is that the stream processing is suspended in many cases while waiting for more events to come in--this is what the Emitter.need_more_events method is responsible for. In some cases, before the current event can be handled, more events need to come in first. For example, in the case of MappingStartEvent at least 3 more events need to be buffered on the stream: the first key/value pair, and the possible the next key. The Emitter needs to know, before it can begin formatting a map, if there are one or more items in the map, and possibly also the length of the first key/value pair. The number of events required before the current event can be handled are hard-coded in the need_more_events method.
The problem is that this does not account for the now possible presence of CommentEvents on the event stream, which should not impact processing of other events. Therefore the Emitter.need_events method to account for the presence of CommentEvents. E.g. if the current event is MappingStartEvent, and there are 3 subsequent events buffered, if one of those are a CommentEvent we can't count it, so we'll need at a minimum 4 events (in case the next one is one of the expected events in a mapping).
Finally, every time a CommentEvent is encountered on the stream, we forcibly break out of the current event processing loop to handle writing the comment, then pop the CommentEvent off the stream and continue as if nothing happened. This is the end result:
import textwrap
from yaml.emitter import Emitter
class CommentedEmitter(Emitter):
def need_more_events(self):
if self.events and isinstance(self.events[0], CommentEvent):
# If the next event is a comment, always break out of the event
# handling loop so that we divert it for comment handling
return True
return super().need_more_events()
def need_events(self, count):
# Hack-y: the minimal number of queued events needed to start
# a block-level event is hard-coded, and does not account for
# possible comment events, so here we increase the necessary
# count for every comment event
comments = [e for e in self.events if isinstance(e, CommentEvent)]
return super().need_events(count + min(count, len(comments)))
def emit(self, event):
if self.events and isinstance(self.events[0], CommentEvent):
# Write the comment, then pop it off the event stream and continue
# as normal
self.write_comment(self.events[0].value)
self.events.pop(0)
super().emit(event)
def write_comment(self, comment):
indent = self.indent or 0
width = self.best_width - indent - 2 # 2 for the comment prefix '# '
lines = ['# ' + line for line in wrap(comment, width)]
for line in lines:
if self.encoding:
line = line.encode(self.encoding)
self.write_indent()
self.stream.write(line)
self.write_line_break()
I also experimented with different approaches to the implementation of write_comment. The Emitter base class has its own method (write_plain) which can handle writing text to the stream with appropriate indentation and line-wrapping. However, it's not quite flexible enough to handle something like comments, where each line needs to be prefixed with something like '# '. One technique I tried was monkey-patching the write_indent method to handle this case, but in the end it was too ugly. I found that simply using Python's built-in textwrap.wrap was sufficient for my case.
Next, we create the dumper by subclassing the existing SafeDumper but inserting our new classes into the MRO:
from yaml import SafeDumper
class CommentedDumper(CommentedEmitter, CommentedSerializer,
CommentedRepresenter, SafeDumper):
"""
Extension of `yaml.SafeDumper` that supports writing `CommentedMapping`s with
all comments output as YAML comments.
"""
Here's an example usage:
>>> import yaml
>>> d = CommentedMapping({
... 'a': 1,
... 'b': 2,
... 'c': {'d': 3},
... }, comment='my commented dict', comments={
... 'a': 'a comment',
... 'b': 'b comment',
... 'c': 'long string ' * 44,
... ('c', 'd'): 'd comment'
... })
>>> print(yaml.dump(d, Dumper=CommentedDumper))
# my commented dict
# a comment
a: 1
# b comment
b: 2
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string long string long
# string long string long string long string long string long string long string
# long string long string long string long string long string
c:
# d comment
d: 3
I still haven't tested this solution very extensively, and it likely still contains bugs. I'll update it as I use it more and find corner-cases, etc.

Related

Is there a pythonic way to log a function with extra values without bloating its paramter list?

I am currently struggling to log a function in python in a clean way.
Assume I want to log a function which has an obscure list_of_numbers as argument.
def function_to_log(list_of_numbers):
# manipulate the values in list_of_numbers ...
return some_result_computed_from_list_of_numbers
When the values in list_of_numbers are manipulated in the above function, I would like to log that change, but not with the value in the obscure list_of_numbers, but the value at the same index in a second list list_with_names_for_log.
The thing that annoys me: now I also have to input list_with_names_for_log, which bloats the argument list of my function,
def function_to_log(list_of_numbers, list_with_names_for_log):
# do some stuff like, change value on 3rd index:
list_of_numbers[3] = 17.4
log.info('You have changed {} to 17.4'.format(list_with_names_for_log[3]))
# and so on ...
return some_result_computed_from_list_of_numbers
I use multiple of these lists exclusively for logging in this function.
Has anybody an idea how to get this a little bit cleaner?

Provided it makes sense for the data to be grouped, I'd group the name/data pairs in a structure. What you currently have is essentially "parallel lists", which are typically a smell unless you're in a language where they're your only option.
A simple dataclass can be introduced:
from dataclasses import dataclass
#dataclass
class NamedData:
name: str
data: int
Then:
def function_to_log(pairs):
pairs[3].data = 17.4
log.info('You have changed {} to 17.4'.format(pairs[3].name))
# and so on ...
return some_result_computed_from_list_of_numbers
As a sample of data:
pairs = [NamedData("Some Name", 1), NamedData("Some other name", 2)]
And if you have two separate lists, it's simple to adapt:
pairs = [NamedData(name, data) for name, data in zip(names, data_list)]
Only do this though if you typically need both bits in most places that each list is used. Grouping will only help clean up the code if the name and data are both needed in most places that either is used. Otherwise, you're just introducing overhead and bloat elsewhere to clean up a few calls.

Delay creating python dict until set/update

I have a Python dict-of-dict structure with a large number of outer-dict keys (millions to billions). The inner dicts are mostly empty, but can store key-value pairs. Currently I create a separate dict as each of the inner dicts. But it uses a lot of memory that I don't end up using. Each empty dict is small, but I have a lot of them. I'd like to delay creating the inner dict until needed.
Ideally, I'd like to even delay creating the inner dict until a key-value pair is set in the inner dict. I envision using a single DelayDict object for ALL outer-dict values. This object would act like an empty dict for get and getitem calls, but as soon as a setitem or update call comes in it would create an empty dict to take its place. I run into trouble having the delaydict object know how to connect the new empty dict with the dict-of-dict structure.
class DelayDict(object): % can do much more - only showing get/set
def __init__(self, dod):
self.dictofdict = dod % the outer dict
def __getitem__(self, key):
raise KeyError(key)
def __setitem__(self, key, value):
replacement = {key: value}
% replace myself in the outer dict!!
self.dict-of-dict[?????] = replacement
I can't think of how to store the new replacement dict in the dict-of-dict structure so that it replaces the DelayDict class as the inner dict. I know properties can do similar things, but I believe the same fundamental trouble arises when I try to replace myself inside the outer dict.

Old question, but I came across a similar problem. I'm not sure that it's a
good idea to try to spare some memory, but if you really need to do that, you should try to build your own data structure.
If you are stuck with the dict of dict, here's a solution.
First, you need a way to create keys in the OuterDict without value (value is {}) by default). if OuterDict is a wrapper around a dict __d:
def create(self, key):
self.__d[key] = None
How much memory will you spare?
>>> import sys
>>> a = {}
>>> sys.getsizeof(a)
136
As you pointed out, None is created only once, but you have to keep a reference on it. In Cpython (64 bits), it's 8 bytes. For 1 billion elements, you spare (136-8)* 10**9 bytes = 128 Gb (and not Mb, thanks!). You need to give a
placeholder when someone ask for the value. The placeholder keeps track of the outer dict and the key in the outer dict. It wraps a dict and assigns this dict to outer[key] when you assign a value.
No more talking, code:
class OuterDict():
def __init__(self):
self.__d = {}
def __getitem__(self, key):
v = self.__d[key]
if v is None: # an orphan key
v = PlaceHolder(self.__d, key)
return v
def create(self, key):
self.__d[key] = None
class PlaceHolder():
def __init__(self, parent, key):
self.__parent = parent
self.__key = key
self.__d = {}
def __getitem__(self, key):
return self.__d[key]
def __setitem__(self, key, value):
if not self.__d:
self.__parent[self.__key] = self.__d # copy me in the outer dict
self.__d[key] = value
def __repr__(self):
return repr("PlaceHolder for "+str(self.__d))
# __len__, ...
A test:
o = OuterDict()
o.create("a") # a is empty
print (o["a"])
try:
o["a"]["b"] # Key Error
except KeyError as e:
print ("KeyError", e)
o["a"]["b"] = 2
print (o["a"])
# output:
# 'PlaceHolder for {}'
# KeyError 'b'
# {'b': 2}
Why it doesn't use much memory? Because you are not building billions of placeholders. You release them when you don't need them anymore. Maybe you will need just one at a time.
Possible improvements: you can create a pool of PlaceHolders. A stack may be a good data structure: recently created placeholder are likely to be released soon. When you need a new PlaceHolder, you
look into the stack, and if a placeholder has only one ref (sys.getrefcount(ph) == 1), you can use it. To fasten the process, when you are looking for
a free placeholder, you can remember the placeholder with the maximum refcount. You switch the free placeholder with this "max refcount" placeholder. Hence, the placeholders with the maximum
refcount are sent to the bottom of the stack.

Roundtrip leading 0s of hexadecimal numbers

I'm willing to load a yaml file containing 32-bit hexadecimal numbers, and keep the leading 0s so that the number is always in the form 0xXXXXXXXX.
I have created a custom class and representer so that dumping hexadecimal numbers in this form is possible:
class HexWInt(int):
pass
def represent_HexWInt(self, data):
# type: (Any) -> Any
return self.represent_scalar(u'tag:yaml.org,2002:int', '0x' + format(data, 'x').upper().zfill(8))
yaml.RoundTripRepresenter.add_representer(HexWInt, represent_HexWInt)
However, I cannot find a proper way to apply this format to roundtripped hexadecimal numbers.
Indeed, the following:
yamltext = "hexa: 0x0123ABCD"
code = yaml.round_trip_load(yamltext)
yaml.dump(code, sys.stdout, Dumper=yaml.RoundTripDumper)
Displays
hexa: 0x123ABCD
Where I would like this to be displayed
hexa: 0x0123ABCD
How can I proceed to force hexadecimal numbers to fit the 0xXXXXXXXX format?

There are multiple ways to do what you want. If you don't want to influence the normal behaviour for the parser, you should subclass the RoundTripLoader and RoundTripConstructor with alternative RoundTripConstructor and RoundTripRepresenter. But that requires registering all constructors and representers, and is quite verbose.
If you don't care about being to be able to load other YAML documents with hex scalar integers that have leading zeros with the original functionality later on in your program, you can just add a new constructor and representer to the RoundTripConstructor and RoundTripRepresenter.
The easiest part is to get your format, based on a value and a width. You don't need zfill() nor upper() for that if you are using format anyway:
'0x{:0{}X}'.format(value, width)
does the job.
The main reason that your code doesn't work is because your code never constructs a HexWInt, as the RoundTripLoader doesn't know that it should do so. I would also not hard code the width to eight, but derive it from the input (using len()), and preserve that.
import sys
import ruamel.yaml
class HexWInt(ruamel.yaml.scalarint.ScalarInt):
def __new__(cls, value, width):
x = ruamel.yaml.scalarint.ScalarInt.__new__(cls, value)
x._width = width # keep the original width
return x
def __isub__(self, a):
return HexWInt(self - a, self._width)
def alt_construct_yaml_int(constructor, node):
# check for 0x0 starting hex integers
value_s = ruamel.yaml.compat.to_str(constructor.construct_scalar(node))
if not value_s.startswith('0x0'):
return constructor.construct_yaml_int(node)
return HexWInt(int(value_s[2:], 16), len(value_s[2:]))
ruamel.yaml.constructor.RoundTripConstructor.add_constructor(
u'tag:yaml.org,2002:int', alt_construct_yaml_int)
def represent_hexw_int(representer, data):
return representer.represent_scalar(u'tag:yaml.org,2002:int',
'0x{:0{}X}'.format(data, data._width))
ruamel.yaml.representer.RoundTripRepresenter.add_representer(HexWInt, represent_hexw_int)
yaml_text = """\
hexa: 0x0123ABCD
hexb: 0x02AD
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_text)
data['hexc'] = HexWInt(0xa1, 8)
data['hexb'] -= 3
yaml.dump(data, sys.stdout)
HexWInt stores both value and width. alt_construct_yaml_int passes everything to the original construct_yaml_int except for the case where the scalar starts with 0x0. It is registered with add_constructor() based on the normal regex based matching done by the Resolver. The representer combines the value and width back into a string. The output of the above is:
hexa: 0x0123ABCD
hexb: 0x02AD
hexc: 0x000000A1
Please note that you cannot do something like:
data['hexb'] -= 3
as ScalarInt (which does have the method __isub__) doesn't know about the width attribute. For the above to work, you'll have to implement the appropriate methods, like ScalarInt does, as methods on HexWInt. E.g.:
def __isub__(self, a):
return HexWInt(self - a, self._width)
An enhanced version of the above (which also preserves _ in integers and supports octal and binary integers) is incorporated in ruamel.yaml>=0.14.7

How can I apply a prefix to dictionary access?

I'm imitating the behavior of the ConfigParser module to write a highly specialized parser that exploits some well-defined structure in the configuration files for a particular application I work with. Several sections of the config file contain hundreds of variable and routine mappings prefixed with either Variable_ or Routine_, like this:
[Map.PRD]
Variable_FOO=LOC1
Variable_BAR=LOC2
Routine_FOO=LOC3
Routine_BAR=LOC4
...
[Map.SHD]
Variable_FOO=LOC1
Variable_BAR=LOC2
Routine_FOO=LOC3
Routine_BAR=LOC4
...
I'd like to maintain the basic structure of ConfigParser where each section is stored as a single dictionary, so users would still have access to the classic syntax:
config.content['Mappings']['Variable_FOO'] = 'LOC1'
but also be able to use a simplified API that drills down to this section:
config.vmapping('PRD')['FOO'] = 'LOC1'
config.vmapping('PRD')['BAR'] = 'LOC2'
config.rmapping('PRD')['FOO'] = 'LOC3'
config.rmapping('PRD')['BAR'] = 'LOC4'
Currently I'm implementing this by storing the section in a special subclass of dict to which I've added a prefix attribute. The variable and routine properties of the parser set the prefix attribute of the dict-like object to 'Variable_' or 'Routine_' and then modified __getitem__ and __setitem__ attributes of the dict handle gluing the prefix together with the key to access the appropriate item. It's working, but involves a lot of boilerplate to implement all the associated niceties like supporting iteration.
I suppose my ideal solution would be do dispense with the subclassed dict and have have the variable and routine properties somehow present a "view" of the plain dict object underneath without the prefixes.
Update
Here's the solution I implemented, largely based on #abarnet's answer:
class MappingDict(object):
def __init__(self, prefix, d):
self.prefix, self.d = prefix, d
def prefixify(self, name):
return '{}_{}'.format(self.prefix, name)
def __getitem__(self, name):
name = self.prefixify(name)
return self.d.__getitem__(name)
def __setitem__(self, name, value):
name = self.prefixify(name)
return self.d.__setitem__(name, value)
def __delitem__(self, name):
name = self.prefixify(name)
return self.d.__delitem__(name)
def __iter__(self):
return (key.partition('_')[-1] for key in self.d
if key.startswith(self.prefix))
def __repr__(self):
return 'MappingDict({})'.format(dict.__repr__(self))
class MyParser(object):
SECTCRE = re.compile(r'\[(?P<header>[^]]+)\]')
def __init__(self, filename):
self.filename = filename
self.content = {}
lines = [x.strip() for x in open(filename).read().splitlines()
if x.strip()]
for line in lines:
match = re.match(self.SECTCRE, line)
if match:
section = match.group('header')
self.content[section] = {}
else:
key, sep, value = line.partition('=')
self.content[section][key] = value
def write(self, filename):
fp = open(filename, 'w')
for section in sorted(self.content, key=sectionsort):
fp.write("[%s]\n" % section)
for key in sorted(self.content[section], key=cpfsort):
value = str(self.content[section][key])
fp.write("%s\n" % '='.join([key,value]))
fp.write("\n")
fp.close()
def vmapping(self, nsp):
section = 'Map.{}'.format(nsp)
return MappingDict('Variable', self.content[section])
def rmapping(self, nsp):
section = 'Map.{}'.format(nsp)
return MappingDict('Routine', self.content[section])
It's used like this:
config = MyParser('myfile.cfg')
vmap = config.vmapping('PRD')
vmap['FOO'] = 'LOC5'
vmap['BAR'] = 'LOC6'
config.write('newfile.cfg')
The resulting newfile.cfg reflects the LOC5 and LOC6 changes.

I don't think you want inheritance here. You end up with two separate dict objects which you have to create on load and then paste back together on save…
If that's acceptable, you don't even need to bother with the prefixing during normal operations; just do the prefixing while saving, like this:
class Config(object):
def save(self):
merged = {'variable_{}'.format(key): value for key, value
in self.variable_dict.items()}
merged.update({'routine_{}'.format(key): value for key, value
in self.routine_dict.items()}
# now save merged
If you want that merged object to be visible at all times, but don't expect to be called on that very often, make it a #property.
If you want to access the merged dictionary regularly, at the same time you're accessing the two sub-dictionaries, then yes, you want a view:
I suppose my ideal solution would be do dispense with the subclassed dict and have have the global and routine properties somehow present a "view" of the plain dict object underneath without the prefixes.
This is going to be very hard to do with inheritance. Certainly not with inheritance from dict; inheritance from builtins.dict_items might work if you're using Python 3, but it still seems like a stretch.
But with delegation, it's easy. Each sub-dictionary just holds a reference to the parent dict:
class PrefixedDict(object):
def __init__(self, prefix, d):
self.prefix, self.d = prefix, d
def prefixify(self, key):
return '{}_{}'.format(self.prefix, key)
def __getitem__(self, key):
return self.d.__getitem__(self.prefixify(key))
def __setitem__(self, key, value):
return self.d.__setitem__(self.prefixify(key), value)
def __delitem__(self, key):
return self.d.__delitem__(self.prefixify(key))
def __iter__(self):
return (key[len(self.prefix):] for key in self.d
if key.startswith(self.prefix)])
You don't get any of the dict methods for free that way—but that's a good thing, because they were mostly incorrect anyway, right? Explicitly delegate the ones you want. (If you do have some you want to pass through as-is, use __getattr__ for that.)
Besides being conceptually simpler and harder to screw up through accidentally forgetting to override something, this also means that PrefixDict can work with any type of mapping, not just a dict.
So, no matter which way you go, where and how do these objects get created?
The easy answer is that they're attributes that you create when you construct a Config:
def __init__(self):
self.d = {}
self.variable = PrefixedDict('Variable', self.d)
self.routine = PrefixedDict('Routine', self.d)
If this needs to be dynamic (e.g., there can be an arbitrary set of prefixes), create them at load time:
def load(self):
# load up self.d
prefixes = set(key.split('_')[0] for key in self.d)
for prefix in prefixes:
setattr(self, prefix, PrefixedDict(prefix, self.d)
If you want to be able to create them on the fly (so config.newprefix['foo'] = 3 adds 'Newprefix_foo'), you can do this instead:
def __getattr__(self, name):
return PrefixedDict(name.title(), self.d)
But once you're using dynamic attributes, you really have to question whether it isn't cleaner to use dictionary (item) syntax instead, like config['newprefix']['foo']. For one thing, that would actually let you call one of the sub-dictionaries 'global', as in your original question…
Or you can first build the dictionary syntax, use what's usually referred to as an attrdict (search ActiveState recipes and PyPI for 3000 implementations…), which lets you automatically make config.newprefix mean config['newprefix'], so you can use attribute syntax when you have valid identifiers, but fall back to dictionary syntax when you don't.

There are a couple of options for how to proceed.
The simplest might be to use nested dictionaries, so Variable_FOO becomes config["variable"]["FOO"]. You might want to use a defaultdict(dict) for the outer dictionary so you don't need to worry about initializing the inner ones when you add the first value to them.
Another option would be to use tuple keys in a single dictionary. That is, Variable_FOO would become config[("variable", "FOO")]. This is easy to do with code, since you can simply assign to config[tuple(some_string.split("_"))]. Though, I suppose you could also just use the unsplit string as your key in this case.
A final approach allows you to use the syntax you want (where Variable_FOO is accessed as config.Variable["FOO"]), by using __getattr__ and a defaultdict behind the scenes:
from collections import defaultdict
class Config(object):
def __init__(self):
self._attrdicts = defaultdict(dict)
def __getattr__(self, name):
return self._attrdicts[name]
You could extend this with behavior for __setattr__ and __delattr__ but it's probably not necessary. The only serious limitation to this approach (given the original version of the question), is that the attributes names (like Variable) must be legal Python identifiers. You can't use strings with leading numbers, Python keywords (like global) or strings containing whitespace characters.
A downside to this approach is that it's a bit more difficult to use programatically (by, for instance, your config-file parser). To read a value of Variable_FOO and save it to config.Variable["FOO"] you'll probably need to use the global getattr function, like this:
name, value = line.split("=")
prefix, suffix = name.split("_")
getattr(config, prefix)[suffix] = value

What do I do when I need a self referential dictionary?

I'm new to Python, and am sort of surprised I cannot do this.
dictionary = {
'a' : '123',
'b' : dictionary['a'] + '456'
}
I'm wondering what the Pythonic way to correctly do this in my script, because I feel like I'm not the only one that has tried to do this.
EDIT: Enough people were wondering what I'm doing with this, so here are more details for my use cases. Lets say I want to keep dictionary objects to hold file system paths. The paths are relative to other values in the dictionary. For example, this is what one of my dictionaries may look like.
dictionary = {
'user': 'sholsapp',
'home': '/home/' + dictionary['user']
}
It is important that at any point in time I may change dictionary['user'] and have all of the dictionaries values reflect the change. Again, this is an example of what I'm using it for, so I hope that it conveys my goal.
From my own research I think I will need to implement a class to do this.

No fear of creating new classes -
You can take advantage of Python's string formating capabilities
and simply do:
class MyDict(dict):
def __getitem__(self, item):
return dict.__getitem__(self, item) % self
dictionary = MyDict({
'user' : 'gnucom',
'home' : '/home/%(user)s',
'bin' : '%(home)s/bin'
})
print dictionary["home"]
print dictionary["bin"]

Nearest I came up without doing object:
dictionary = {
'user' : 'gnucom',
'home' : lambda:'/home/'+dictionary['user']
}
print dictionary['home']()
dictionary['user']='tony'
print dictionary['home']()

>>> dictionary = {
... 'a':'123'
... }
>>> dictionary['b'] = dictionary['a'] + '456'
>>> dictionary
{'a': '123', 'b': '123456'}
It works fine but when you're trying to use dictionary it hasn't been defined yet (because it has to evaluate that literal dictionary first).
But be careful because this assigns to the key of 'b' the value referenced by the key of 'a' at the time of assignment and is not going to do the lookup every time. If that is what you are looking for, it's possible but with more work.

What you're describing in your edit is how an INI config file works. Python does have a built in library called ConfigParser which should work for what you're describing.

This is an interesting problem. It seems like Greg has a good solution. But that's no fun ;)
jsbueno as a very elegant solution but that only applies to strings (as you requested).
The trick to a 'general' self referential dictionary is to use a surrogate object. It takes a few (understatement) lines of code to pull off, but the usage is about what you want:
S = SurrogateDict(AdditionSurrogateDictEntry)
d = S.resolve({'user': 'gnucom',
'home': '/home/' + S['user'],
'config': [S['home'] + '/.emacs', S['home'] + '/.bashrc']})
The code to make that happen is not nearly so short. It lives in three classes:
import abc
class SurrogateDictEntry(object):
__metaclass__ = abc.ABCMeta
def __init__(self, key):
"""record the key on the real dictionary that this will resolve to a
value for
"""
self.key = key
def resolve(self, d):
""" return the actual value"""
if hasattr(self, 'op'):
# any operation done on self will store it's name in self.op.
# if this is set, resolve it by calling the appropriate method
# now that we can get self.value out of d
self.value = d[self.key]
return getattr(self, self.op + 'resolve__')()
else:
return d[self.key]
#staticmethod
def make_op(opname):
"""A convience class. This will be the form of all op hooks for subclasses
The actual logic for the op is in __op__resolve__ (e.g. __add__resolve__)
"""
def op(self, other):
self.stored_value = other
self.op = opname
return self
op.__name__ = opname
return op
Next, comes the concrete class. simple enough.
class AdditionSurrogateDictEntry(SurrogateDictEntry):
__add__ = SurrogateDictEntry.make_op('__add__')
__radd__ = SurrogateDictEntry.make_op('__radd__')
def __add__resolve__(self):
return self.value + self.stored_value
def __radd__resolve__(self):
return self.stored_value + self.value
Here's the final class
class SurrogateDict(object):
def __init__(self, EntryClass):
self.EntryClass = EntryClass
def __getitem__(self, key):
"""record the key and return"""
return self.EntryClass(key)
#staticmethod
def resolve(d):
"""I eat generators resolve self references"""
stack = [d]
while stack:
cur = stack.pop()
# This just tries to set it to an appropriate iterable
it = xrange(len(cur)) if not hasattr(cur, 'keys') else cur.keys()
for key in it:
# sorry for being a duche. Just register your class with
# SurrogateDictEntry and you can pass whatever.
while isinstance(cur[key], SurrogateDictEntry):
cur[key] = cur[key].resolve(d)
# I'm just going to check for iter but you can add other
# checks here for items that we should loop over.
if hasattr(cur[key], '__iter__'):
stack.append(cur[key])
return d
In response to gnucoms's question about why I named the classes the way that I did.
The word surrogate is generally associated with standing in for something else so it seemed appropriate because that's what the SurrogateDict class does: an instance replaces the 'self' references in a dictionary literal. That being said, (other than just being straight up stupid sometimes) naming is probably one of the hardest things for me about coding. If you (or anyone else) can suggest a better name, I'm all ears.
I'll provide a brief explanation. Throughout S refers to an instance of SurrogateDict and d is the real dictionary.
A reference S[key] triggers S.__getitem__ and SurrogateDictEntry(key) to be placed in the d.
When S[key] = SurrogateDictEntry(key) is constructed, it stores key. This will be the key into d for the value that this entry of SurrogateDictEntry is acting as a surrogate for.
After S[key] is returned, it is either entered into the d, or has some operation(s) performed on it. If an operation is performed on it, it triggers the relative __op__ method which simple stores the value that the operation is performed on and the name of the operation and then returns itself. We can't actually resolve the operation because d hasn't been constructed yet.
After d is constructed, it is passed to S.resolve. This method loops through d finding any instances of SurrogateDictEntry and replacing them with the result of calling the resolve method on the instance.
The SurrogateDictEntry.resolve method receives the now constructed d as an argument and can use the value of key that it stored at construction time to get the value that it is acting as a surrogate for. If an operation was performed on it after creation, the op attribute will have been set with the name of the operation that was performed. If the class has a __op__ method, then it has a __op__resolve__ method with the actual logic that would normally be in the __op__ method. So now we have the logic (self.op__resolve) and all necessary values (self.value, self.stored_value) to finally get the real value of d[key]. So we return that which step 4 places in the dictionary.
finally the SurrogateDict.resolve method returns d with all references resolved.
That'a a rough sketch. If you have any more questions, feel free to ask.

If you, just like me wandering how to make #jsbueno snippet work with {} style substitutions, below is the example code (which is probably not much efficient though):
import string
class MyDict(dict):
def __init__(self, *args, **kw):
super(MyDict,self).__init__(*args, **kw)
self.itemlist = super(MyDict,self).keys()
self.fmt = string.Formatter()
def __getitem__(self, item):
return self.fmt.vformat(dict.__getitem__(self, item), {}, self)
xs = MyDict({
'user' : 'gnucom',
'home' : '/home/{user}',
'bin' : '{home}/bin'
})
>>> xs["home"]
'/home/gnucom'
>>> xs["bin"]
'/home/gnucom/bin'
I tried to make it work with the simple replacement of % self with .format(**self) but it turns out it wouldn't work for nested expressions (like 'bin' in above listing, which references 'home', which has it's own reference to 'user') because of the evaluation order (** expansion is done before actual format call and it's not delayed like in original % version).

Write a class, maybe something with properties:
class PathInfo(object):
def __init__(self, user):
self.user = user
#property
def home(self):
return '/home/' + self.user
p = PathInfo('thc')
print p.home # /home/thc

As sort of an extended version of #Tony's answer, you could build a dictionary subclass that calls its values if they are callables:
class CallingDict(dict):
"""Returns the result rather than the value of referenced callables.
>>> cd = CallingDict({1: "One", 2: "Two", 'fsh': "Fish",
... "rhyme": lambda d: ' '.join((d[1], d['fsh'],
... d[2], d['fsh']))})
>>> cd["rhyme"]
'One Fish Two Fish'
>>> cd[1] = 'Red'
>>> cd[2] = 'Blue'
>>> cd["rhyme"]
'Red Fish Blue Fish'
"""
def __getitem__(self, item):
it = super(CallingDict, self).__getitem__(item)
if callable(it):
return it(self)
else:
return it
Of course this would only be usable if you're not actually going to store callables as values. If you need to be able to do that, you could wrap the lambda declaration in a function that adds some attribute to the resulting lambda, and check for it in CallingDict.__getitem__, but at that point it's getting complex, and long-winded, enough that it might just be easier to use a class for your data in the first place.

This is very easy in a lazily evaluated language (haskell).
Since Python is strictly evaluated, we can do a little trick to turn things lazy:
Y = lambda f: (lambda x: x(x))(lambda y: f(lambda *args: y(y)(*args)))
d1 = lambda self: lambda: {
'a': lambda: 3,
'b': lambda: self()['a']()
}
# fix the d1, and evaluate it
d2 = Y(d1)()
# to get a
d2['a']() # 3
# to get b
d2['b']() # 3
Syntax wise this is not very nice. That's because of us needing to explicitly construct lazy expressions with lambda: ... and explicitly evaluate lazy expression with ...(). It's the opposite problem in lazy languages needing strictness annotations, here in Python we end up needing lazy annotations.
I think with some more meta-programmming and some more tricks, the above could be made more easy to use.
Note that this is basically how let-rec works in some functional languages.

The jsbueno answer in Python 3 :
class MyDict(dict):
def __getitem__(self, item):
return dict.__getitem__(self, item).format(self)
dictionary = MyDict({
'user' : 'gnucom',
'home' : '/home/{0[user]}',
'bin' : '{0[home]}/bin'
})
print(dictionary["home"])
print(dictionary["bin"])
Her ewe use the python 3 string formatting with curly braces {} and the .format() method.
Documentation : https://docs.python.org/3/library/string.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding comments to YAML produced with PyYaml - python

Related

Is there a pythonic way to log a function with extra values without bloating its paramter list?

Delay creating python dict until set/update

Roundtrip leading 0s of hexadecimal numbers

How can I apply a prefix to dictionary access?

What do I do when I need a self referential dictionary?

Categories

Resources