YAML file update and delete using python? - python

I have a YAML file and it looks like below
test:
- exam.com
- exam1.com
- exam2.com
test2:
- examp.com
- examp1.com
- examp2.com
I like to manage this file using python.
Task is, I like to add an entry under "test2" and delete entry from "test".

You first have to load the data, which will give you a top-level dict (in a variable called data in the following example), the values for the keys will be lists. On those lists you can do the del resp. insert() (or append())
import sys
import ruamel.yaml
yaml_str = """\
test:
- exam.com
- exam1.com
- exam2.com
test2:
- examp.com
- examp1.com # want to insert after this
- examp2.com
"""
data = ruamel.yaml.round_trip_load(yaml_str)
del data['test'][1]
data['test2'].insert(2, 'examp1.5')
ruamel.yaml.round_trip_dump(data, sys.stdout, block_seq_indent=1)
gives:
test:
- exam.com
- exam2.com
test2:
- examp.com
- examp1.com # want to insert after this
- examp1.5
- examp2.com
The block_seq_indent=1 is necessary as by default ruamel.yaml will left align a sequence value with the key.¹
If you want to get rid of the comment in the output you can do:
data['test2']._yaml_comment = None
¹ This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author.

Related

How can I include a specific data from YAML file inside another?

I have a very similar structure and scenario to this question and it help me a lot, but I'm looking for a more specific situation, I wanna include to my yaml file just a data from another yaml file, not the complete file. Something like:
UPDATED: I correct the structure of files below to describe properly my scenario, sorry.
foo.yaml
a: 1
b:
- 1.43
- 543.55
- item : !include {I wanna just the [1, 2, 3] from} bar.yaml
bar.yaml
- 3.6
- [1, 2, 3]
Right now, I'm importing all the second file, but I don't need all and don't figure it out the proper solution since yesterday. Below is my actual structure:
foo.yaml
variables: !include bar.yaml #I'm importing the entire file for now and have to navegate in that to get what I need.
a: 1
b:
- 1.43
- 543.55
You can write your own custom include constructor:
bar_yaml = """
- 3.6
- [1, 2, 3]
"""
foo_yaml = """
variables: !include [bar.yaml, 1]
a: 1
b:
- 1.43
- 543.55
"""
def include_constructor(loader, node):
selector = loader.construct_sequence(node)
name = selector.pop(0)
# in actual code, load the file named by name.
# for this example, we'll ignore the name and load the predefined string
content = yaml.safe_load(bar_yaml)
# walk over the selector items and descend into the loaded structure each time.
for item in selector:
content = content[item]
return content
yaml.add_constructor('!include', include_constructor, Loader=yaml.SafeLoader)
print(yaml.safe_load(foo_yaml))
This !include will treat the first item in the given sequence as file name and the following items as sequence indexes (or dictionary keys). E.g. you can do !include [bar.yaml, 1, 2] to only load the 3.

Access elements inside yaml using python

I am using yaml and pyyaml to configure my application.
Is it possible to configure something like this -
config.yml -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: $root.repo_root/data
service:
root: $root.data_root/csv/xyz.csv
yaml loading function -
def load_config(config_path):
config_path = os.path.abspath(config_path)
if not os.path.isfile(config_path):
raise FileNotFoundError("{} does not exist".format(config_path))
else:
with open(config_path) as f:
config = yaml.load(f, Loader=yaml.SafeLoader)
# logging.info(config)
logging.info("Config used for run - \n{}".format(yaml.dump(config, sort_keys=False)))
return DotDict(config)
Current Output-
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: ${root.repo_root}/data
service:
root: ${root.data_root}/csv/xyz.csv
Desired Output -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data
service:
root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv
Is this even possible with python? If so any help would be really nice.
Thanks in advance.
A general approach:
read the file as is
search for strings containing $:
determine the "path" of "variables"
replace the "variables" with actual values
An example, using recursive call for dictionaries and replaces strings:
import re, pprint, yaml
def convert(input,top=None):
"""Replaces $key1.key2 with actual values. Modifies input in-place"""
if top is None:
top = input # top should be the original input
if isinstance(input,dict):
ret = {k:convert(v,top) for k,v in input.items()} # recursively convert items
if input != ret: # in case order matters, do it one or several times more until no change happens
ret = convert(ret)
input.update(ret) # update original input
return input # return updated input (for the case of recursion)
if isinstance(input,str):
vars = re.findall(r"\$[\w_\.]+",input) # find $key_1.key_2.keyN sequences
for var in vars:
keys = var[1:].split(".") # remove dollar and split by dots to make "key chain"
val = top # starting from top ...
for k in keys: # ... for each key in the key chain ...
val = val[k] # ... go one level down
input = input.replace(var,val) # replace $key sequence eith actual value
return input # return modified input
# TODO int, float, list, ...
with open("in.yml") as f: config = yaml.load(f) # load as is
convert(config) # convert it (in-place)
pprint.pprint(config)
Output:
{'root': {'data_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data',
'repo_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght'},
'service': {'root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv'}}
Note: YAML is not that important here, would work also with JSON, XML or other formats.
Note2: If you use exclusively YAML and exclusively python, some answers from this post may be useful (using anchors and references and application specific local tags)

Editing pyparsing parse results

This is similar to a question I've asked before.
I have written a pyparsing grammar logparser for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A is called, then a fast function B is called and finishes almost immediately, and after that function A finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.
My parser is able to parse the function calls (from now on called input_blocks) and their return values (from now on called output_blocks). My parse results (logparser.searchString(logfile)) look like this:
[0]: # first log
- input_blocks:
[0]:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
[1]:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_blocks:
[0]:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]: # second log
- input_blocks:
...
- output_blocks:
...
... # n-th log
I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block and the corresponding output_block into a function_block. My final parse results should look like this:
[0]: # first log
- function_blocks:
[0]:
- input_block:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
- output_block:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]:
- input_block:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_block:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]: # second log
- function_blocks:
[0]: ...
[1]: ...
... # n-th log
To achieve this, I define a function rearrange which iterates through input_blocks and output_blocks and checks whether func_name, thread, and the timestamps match. However, moving the matching blocks into one function_block is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)
def rearrange(log_token):
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
# modify log_token
return log_token
My question is: How do I put the matching output_block and input_block in a function_block in a way that I still enjoy the easy access methods of pyparsing.ParseResults?
My idea looks like this:
def rearrange(log_token):
# define a new ParseResults object in which I store matching input & output blocks
function_blocks = pp.ParseResults(name='function_blocks')
# find matching blocks
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
function_blocks.append(input_block.pop() + output_block.pop()) # this addition causes a maximum recursion error?
log_token.append(function_blocks)
return log_token
This doesn't work though. The addition causes a maximum recursion error and the .pop() doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.
It's also possible that some of theinput_blocks don't have a corresponding output_block (for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks, output_blocks (for the spare blocks), and function_blocks (for the matching blocks).
Thanks for your help!
EDIT:
I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults and how to properly create my own nested ParseResults-structure.
from pyparsing import *
def main():
log_data = '''\
Func1_in
Func2_in
Func2_out
Func1_out
Func3_in'''
ParserElement.inlineLiteralsUsing(Suppress)
input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
log = OneOrMore(input_block | output_block)
parse_results = log.parseString(log_data)
print('***** before rearranging *****')
print(parse_results.dump())
parse_results = rearrange(parse_results)
print('***** after rearranging *****')
print(parse_results.dump())
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and delete them from their original positions in log_token
# I have to do both __setitem__ and .append so it shows up in the dict and in the list
# and .copy() is necessary because I delete the original objects later
tmp_function_block = ParseResults()
tmp_function_block.__setitem__('input', input_block.copy())
tmp_function_block.append(input_block.copy())
tmp_function_block.__setitem__('output', output_block.copy())
tmp_function_block.append(output_block.copy())
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate data
function_blocks.append(function_block)
# delete from original position in log_token
input_block.clear()
output_block.clear()
log_token.__setitem__('function_blocks', sum(function_blocks))
return log_token
if __name__ == '__main__':
main()
Output:
***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
[0]:
['Func1']
- func_name: 'Func1'
[1]:
['Func2']
- func_name: 'Func2'
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
[0]:
['Func2']
- func_name: 'Func2'
[1]:
['Func1']
- func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []] # why is this duplicated? I just want the inner function_blocks!
- function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
[0]:
[['Func1'], ['Func1']]
- input: ['Func1']
- func_name: 'Func1'
- output: ['Func1']
- func_name: 'Func1'
[1]:
[['Func2'], ['Func2']]
- input: ['Func2']
- func_name: 'Func2'
- output: ['Func2']
- func_name: 'Func2'
[2]: # where does this come from?
[[], []]
- input: []
- output: []
- input_blocks: [[], [], ['Func3']]
[0]: # how do I delete these indexes?
[] # I think I only cleared their contents
[1]:
[]
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [[], []]
[0]:
[]
[1]:
[]
This version of rearrange addresses most of the issues I see in your example:
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
# look for match among output blocks that have not been cleared
for output_block in filter(None, log_token.output_blocks):
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and clear them from in their original positions in log_token
# create rearranged block, first with a list of the two blocks
# instead of append()'ing, just initialize with a list containing
# the two block copies
tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])
# now assign the blocks by name
# x.__setitem__(key, value) is the same as x[key] = value
tmp_function_block['input'] = tmp_function_block[0]
tmp_function_block['output'] = tmp_function_block[1]
# wrap that all in another ParseResults, as if we had matched a Group
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate name references
function_blocks.append(function_block)
# clear blocks in their original positions in log_token, so they won't be matched any more
input_block.clear()
output_block.clear()
# match found, no need to keep going looking for a matching output block
break
# find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
for input_block in filter(None, log_token.input_blocks):
# no matching output for this input
tmp_function_block = ParseResults([input_block.copy()])
tmp_function_block['input'] = tmp_function_block[0]
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'] # remove duplicate data
function_blocks.append(function_block)
input_block.clear()
# clean out log_token, and reload with rearranged function blocks
log_token.clear()
log_token.extend(function_blocks)
log_token['function_blocks'] = sum(function_blocks)
return log_token
And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:
# trailing '*' on the results name is equivalent to listAllMatches=True
input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
log = OneOrMore(input_block | output_block)
log.addParseAction(rearrange)
Since rearrange updated log_token in place, if you make it a parse action, the ending return statement would be unnecessary.
It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.
Generally, the assembly of tokens into ParseResults is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.
another demo for pyparsing setParseAction:
remove whitespace before the first value, preserve whitespace between values
i tried to solve this with pp.Optional(pp.White(' \t')).suppress()
but then i got a = ["b=1"] (parser did not stop at end-of-line)
def lstrip_first_value(src, loc, token):
"remove whitespace before first value"
# based on https://stackoverflow.com/a/51335710/10440128
if token == []:
return token
# update the values
copy = token[:]
copy[0] = copy[0].lstrip()
if copy[0] == "" and len(copy) > 1:
copy = copy[1:]
# update the token
token.clear()
token.extend(copy)
token["value"] = copy
return token
Values = (
pp.OneOrMore(Value.leaveWhitespace())
| pp.Empty().setParseAction(pp.replaceWith(""))
)("value").setParseAction(lstrip_first_value)
Value = pp.Combine(
pp.QuotedString(quoteChar='"', escChar="\\")
| pp.White(' \t') # parse whitespace to separate token
)
inputs
a=
b=2
a =
b=2
the values of a should always be [""]

Preserving anchors and aliases in YAML with Python

I'm editing a large YAML document in Python with extensive anchors and aliases. I'd like to be able to determine how the anchor is derived based on data from the node it references.
For instance the node has a 'name' field and I'd like the anchor to be the value of that field rather than a random id number.
Is this possible with PyYAML or ruamel.yaml?
There are a few things to keep in mind:
YAML has no fields. I assume that that is your interpretation of keys in a mapping, so that you want an anchor associated with a mapping to be the same as the value for the key 'name'
During load time the event created when encountering an anchor doesn't know about whether it is an anchor on a scalar, sequence or mapping. Let alone that it could access the value for 'name'.
Changing the anchor during load is tricky, as you have to keep track of aliases referring to the original anchor (and map them to its new value)
In PyYAML the anchor name gets created during dump-ing, so you would have to hook into that when using PyYAML. You can do the same with ruamel.yaml
Only ruamel.yaml has the capability to preserve an anchor on round-trip. I.e. if you can have the anchor to be persistent, even if the value for the key 'name' changes (assuming you test e.g. on the default generated form idNNNN)
When you use ruamel.yaml you can recursively walk the data-structure, keeping track of nodes already visited (in case a child contains an ancestor) and when encountering a ruamel.yaml.comments.CommentedMap, set the anchor (currently the attribute with the value of ruamel.yaml.comments.Anchor.attrib i.e. _yaml_anchor). Untested code:
if isinstance(x, ruamel.yaml.comments.CommentedMap):
if 'name' in x:
x.yaml_set_anchor(x['name'])
If you have a YAML document that you can round-trip you can hook into the representer:
import sys
import ruamel.yaml
from ruamel.yaml.representer import RoundTripRepresenter
yaml_str = """\
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &id001
a: 1
b: 2
name: mydata
- c: 3
- *id001
"""
class MyRTR(RoundTripRepresenter):
def represent_mapping(self, tag, mapping, flow_style=None):
if 'name' in mapping:
# if not isinstance(mapping, ruamel.yaml.comments.CommentedMap):
# mapping = ruamel.yaml.comments.CommentedMap(mapping)
mapping.yaml_set_anchor(mapping['name'])
mapping.yaml_set_anchor(mapping['name'])
return RoundTripRepresenter.represent_mapping(
self, tag, mapping, flow_style=flow_style)
yaml = ruamel.yaml.YAML()
yaml.Representer = MyRTR
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
which gives:
# data = [dict(a=1, b=2, name='mydata'), dict(c=3)]
# data.append(data[0])
- &mydata a: 1
b: 2
name: mydata
- c: 3
- *mydata
But note that this assumes that you loaded the data and that all dicts are actually CommentedMaps under the hood. If that is not the case (i.e. you added normal dicts, then uncomment the two lines doing the conversion.

Prettify a yml dump in python

I have this example:
import yaml
from collections import OrderedDict
data = [OrderedDict({"one": u"Hello\u2122", "two":["something", u"something2", u"something3"]})]
print yaml.dump(data, default_flow_style=False, default_style='"', allow_unicode=True, encoding="utf-8")
This prints out:
- !!python/object/apply:collections.OrderedDict
- - - "two"
- - "something"
- !!python/unicode "something2"
- !!python/unicode "something3"
- - "one"
- "Hello\u2122"
I use OrderedDict because I want to preserve the key order when dumping into YML. However, I don't care about the order when reading the YML back into python.
How can I prettify the dump to be something like:
- two:
- "something"
- "something2"
- "something3"
one:
- "Hello\xe2\x84\xa2"
And then read it back into python using yaml.load()?
One option is to use representers to change the serialization of some objects. But this has to be done on a case-by-case basis and I don't know if it will scale well for your particular use case.
Preserving the order in your OrderedDict will get a little more tricky, since the represent_mapping will always sort the items if your map has an items attribute, but passing the items as a tuple should work.
import yaml
from yaml.representer import SafeRepresenter
from collections import OrderedDict
data = [OrderedDict({"one": u"Hello\u2122",
"two":["something", u"something2", u"something3"]})]
# Represent an OrderedDict preserving order
def _represent_dict_in_order(dumper, odict):
return dumper.represent_mapping(u'tag:yaml.org,2002:map', odict.items())
# Use a safe dictionary representer for OrderectDict
yaml.add_representer(OrderedDict, _represent_dict_in_order)
# Use a safe string representer for unicode data
yaml.add_representer(unicode, SafeRepresenter.represent_unicode)
print yaml.dump(data, default_flow_style=False,
default_style='"', allow_unicode=True, encoding="utf-8")

Categories

Resources