Preserving order of dictionary while using ruamel.yaml - python

I am using ruamel.yaml for dumping a dict to a yaml file. While doing so, I want to keep the order of the dictionary. That is how I came across the question Keep YAML file order with ruamel. But this solution is not working in my case:
The order is not preserved.
adding tags like !!python/object/apply:ruamel.yaml.comments.CommentedMap or dictitems
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap as ordereddict
generated_file = os.path.join('data_TEST.yaml')
data_dict = {'Sources': {'coil': None}, 'Magnet': 'ABC', 'Current': ordereddict({'heat': {'i': [[]], 'h': None, }})}
data_dict = ordereddict(data_dict)
with open(generated_file, 'w') as yaml_file:
ruamel.yaml.dump(data_dict, yaml_file, default_flow_style=False)
The used dictionary is just an arbitrary one and in the end an automatically created array that could look different is going to be used. So, we cannot hard-code the mapping of the dictionaries in the dictionary like in my example.
Result:
!!python/object/apply:ruamel.yaml.comments.CommentedMap
dictitems:
Current: !!python/object/apply:ruamel.yaml.comments.CommentedMap
dictitems:
heat:
h: null
i:
- []
Magnet: ABC
Sources:
coil: null
Desired result:
Sources:
coil: null
Magnet: ABC
Current:
heat:
h: null
i:
- []

You should really not be using the old PyYAML API that sorts keys when dumping.
Instantiate a YAML instance and use its dump method:
yaml = ruamel.yaml.YAML()
yaml.dump(data, stream)

Related

remove double quotes around dictionary object - python

I have a dictionary that I am using to populate a YAML config file for each key.
{'id': ['HP:000111'], 'id1': ['HP:000111'], 'id2': ['HP:0001111', 'HP:0001123'])}
code to insert key:value pair into YAML template using ruamel.yaml
import ruamel.yaml
import sys
yaml = ruamel.yaml.YAML()
with open('yaml.yml') as fp:
data = yaml.load(fp)
for k in start.keys():
data['analysis']['hpoIds'] = start.get(key)
with open(f"path/yaml-{k}.yml","w+") as f:
yaml.dump(data, sys.stdout)
this is output I am getting
analysis:
# hg19 or hg38 - ensure that the application has been configured to run the specified assembly otherwise it will halt.
genomeAssembly: hg38
vcf:
ped:
proband:
hpoIds: "['HP:000111','HP:000112','HP:000113']"
but this is what I need
hpoIds: ['HP:000111','HP:000112','HP:000113']
ive tried using string tools i.e strip, replace but didnt
output from ast.literal_eval.
hpoIds:
- HP:000111
- HP:000112
- HP:000113
output from repr
hpoIds: "\"['HP:000111','HP: 000112','HP:000113']\""
any help would be greatly appreciated
It is not entirely clear to me what you are trying to do and why you e.g. open files
'w+' for dumping.
However if you have something that comes out block style and unquoted, that can easily be remedied
by using a small function:
import sys
from pathlib import Path
import ruamel.yaml
SQ = ruamel.yaml.scalarstring.SingleQuotedScalarString
def flow_seq_single_quoted(lst):
res = ruamel.yaml.CommentedSeq([SQ(x) if isinstance(x, str) else x for x in lst])
res.fa.set_flow_style()
return res
in_file = Path('yaml.yaml')
in_file.write_text("""\
hpoIds:
- HP:000111
- HP:000112
- HP:000113
""")
yaml = ruamel.yaml.YAML()
data = yaml.load(in_file)
data['hpoIds'] = flow_seq_single_quoted(data['hpoIds'])
yaml.dump(data, sys.stdout)
which gives:
hpoIds: ['HP:000111', 'HP:000112', 'HP:000113']
The recommended extension for YAML files has been .yaml since at least September 2006.

Get wrong values by parsing YAML

I'm somewhat confused by yaml parsing results. I made a test.yaml and the results are the same.
val_1: 05334000
val_2: 2345784
val_3: 0537380
str_1: foobar
val_4: 05798
val_5: 051342123
Parsing that with:
import yaml
with open('test.yaml', 'r', encoding='utf8') as f:
a = yaml.load(f, Loader=yaml.FullLoader)
returns:
{'val_1': 1423360,
'val_2': 2345784,
'val_3': '0537380',
'str_1': 'foobar',
'val_4': '05798',
'val_5': 10863699}
Why these values for val_1 and val_5? Is there something special?
In my real data with many yaml files there are values like val_1. For some they parsed correct but for some they don't? All starts with 05, followed by more numbers. Caused by the leading 0 results should be strings. But yaml parses something completely different.
If I read the yaml as textfile f.readlines(), all is fine:
['val_1: 05334000\n',
'val_2: 2345784\n',
'val_3: 0537380\n',
'str_1: foobar\n',
'val_4: 05798\n',
'val_5: 051342123\n']
Integers with a leading 0 are parsed as octal; in python you'd need to write them with a leading 0o:
0o5334000 == 1423360
as for '0537380': as there is an 8 present as digit it can not be parsed as an octal number. therefore it remains a string.
if you want to get strings for all your entries you can use the BaseLoader
from io import StringIO
import yaml
file = StringIO("""
val_1: 05334000
val_2: 2345784
val_3: 0537380
str_1: foobar
val_4: 05798
val_5: 051342123
""")
dct = yaml.load(file, Loader=yaml.BaseLoader)
with that i get:
{'val_1': '05334000', 'val_2': '2345784', 'val_3': '0537380',
'str_1': 'foobar', 'val_4': '05798', 'val_5': '051342123'}

Preserve quotes and also add data with quotes in Ruamel

I am using Ruamel to preserve quote styles in human-edited YAML files.
I have example input data as:
---
a: '1'
b: "2"
c: 3
I read in data using:
def read_file(f):
with open(f, 'r') as _f:
return ruamel.yaml.round_trip_load(_f.read(), preserve_quotes=True)
I then edit that data:
data = read_file('in.yaml')
data['foo'] = 'bar'
I write back to disk using:
def write_file(f, data):
with open(f, 'w') as _f:
_f.write(ruamel.yaml.dump(data, Dumper=ruamel.yaml.RoundTripDumper, width=1024))
write_file('out.yaml', data)
And the output file is:
a: '1'
b: "2"
c: 3
foo: bar
Is there a way I can enforce hard quoting of the string 'bar' without also enforcing that quoting style throughout the rest of the file?
(Also, can I stop it from deleting the three dashes --- ?)
In order to preserve quotes (and literal block style) for string scalars, ruamel.yaml¹—in round-trip-mode—represents these scalars as SingleQuotedScalarString, DoubleQuotedScalarString and PreservedScalarString. The class definitions for these very thin wrappers can be found in scalarstring.py.
When serializing such instances are written "as they were read", although sometimes the representer falls back to double quotes when things get difficult, as that can represent any string.
To get this behaviour when adding new key-value pairs (or when updating an existing pair), you just have to create these instances yourself:
import sys
from ruamel.yaml import YAML
from ruamel.yaml.scalarstring import SingleQuotedScalarString, DoubleQuotedScalarString
yaml_str = """\
---
a: '1'
b: "2"
c: 3
"""
yaml = YAML()
yaml.preserve_quotes = True
yaml.explicit_start = True
data = yaml.load(yaml_str)
data['foo'] = SingleQuotedScalarString('bar')
data.yaml_add_eol_comment('# <- single quotes added', 'foo', column=20)
yaml.dump(data, sys.stdout)
gives:
---
a: '1'
b: "2"
c: 3
foo: 'bar' # <- single quotes added
the yaml.explicit_start = True recreates the (superfluous) document start marker. Whether such a marker was in the original file or not is not "known" by the top-level dictionary object, so you have to re-add it by hand.
Please note that without preserve_quotes, there would be (single) quotes around the values 1 and 2 anyway to make sure they are seen as string scalars and not as integers.
¹ Of which I am the author.
Since Ruamel 0.15, set the preserve_quotes flag like this:
from ruamel.yaml import YAML
from pathlib import Path
yaml = YAML(typ='rt') # Round trip loading and dumping
yaml.preserve_quotes = True
data = yaml.load(Path("in.yaml"))
yaml.dump(data, Path("out.yaml"))

How to separate yaml.dump key:value pair by a new line?

I am trying to make yaml dump each key:value pair on a separate line. Is there a native option to do that? I have tried line_break but couldn't get it to work.
Here is a code example:
import yaml
def test_yaml_dump():
obj = {'key0': 1, 'key1': 2}
with open('test.yaml', 'w') as tmpf:
yaml.dump(obj, tmpf, line_break=0)
The output is:
{key0: 1, key1: 2}
I want it to be:
{key0: 1,
key1: 2}
If you add the argument default_flow_style=False to dump then the output will be:
key1: 2
key0: 1
(the so called block style). That is the much more readable way of dumping Python dicts to YAML mappings. In ruamel.yaml this is the default when using ruamel.yaml.round_trip_dump().
import sys
import ruamel.yaml as yaml
obj = dict(key0=1, key1=2)
yaml.round_trip_dump(obj, sys.stdout)

pythonic way of iterating over a collection of json objects stored in a text file

I have a text file that has several thousand json objects (meaning the textual representation of json) one after the other. They're not separated and I would prefer not to modify the source file. How can I load/parse each json in python? (I have seen this question, but if I'm not mistaken, this only works for a list of jsons (alreay separated by a comma?) My file looks like this:
{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}...
I don't see a clean way to do this without using the real JSON parser. The other options of modifying the text and using a non-JSON parser are risky. So the best way to go it find a way to iterate using the real JSON parser so that you're sure to comply with the JSON spec.
The core idea is to let the real JSON parser do all the work in identifying the groups:
import json, re
combined = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
start = 0
while start != len(combined):
try:
json.loads(combined[start:])
except ValueError as e:
pass
# Find the location where the parsing failed
end = start + int(re.search(r'column (\d+)', e.args[0]).group(1)) - 1
result = json.loads(combined[start:end])
start = end
print(result)
This outputs:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
I think the following would work as long as there are no non-comma-delimited json arrays of json sub-objects inside any of the outermost json objects. It's somewhat brute-force in that it reads the whole file into memory and attempts to fix it.
import json
def get_json_array(filename):
with open(filename, 'rt') as jsonfile:
json_array = '[{}]'.format(jsonfile.read().replace('}{', '},{'))
return json.loads(json_array)
for obj in get_json_array('multiobj.json'):
print(obj)
Output:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
Instead of modifying the source file, just make a copy. Use a regex to replace }{ with },{ and then hopefully a pre-built json reader will take care of it nicely.
EDIT: quick solution:
from re import sub
with open(inputfile, 'r') as fin:
text = sub(r'}{', r'},{', fin.read())
with open(outfile, 'w' as fout:
fout.write('[')
fout.write(text)
fout.write(']')
>>> import ast
>>> s = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
>>> [ast.literal_eval(ele + '}') for ele in s.split('}')[:-1]]
[{'json': 1}, {'json': 2}, {'json': 3}, {'json': 4}, {'json': 5}]
Provided you have no nested objects and splitting on '}' is feasible this can be accomplished pretty simply.
Here is one pythonic way to do it:
from json.scanner import make_scanner
from json import JSONDecoder
def load_jsons(multi_json_str):
s = multi_json_str.strip()
scanner = make_scanner(JSONDecoder())
idx = 0
objects = []
while idx < len(s):
obj, idx = scanner(s, idx)
objects.append(obj)
return objects
I think json was never supposed to be used this way, but it solves your problem.
I agree with #Raymond Hettinger, you need to use json itself to do the work, text manipulation doesn't work for complex JSON objects. His answer parses the exception message to find the split position. It works, but it looks like a hack, hence, not pythonic :)
EDIT:
Just found out this is actually supported by json module, just use raw_decode like this:
decoder = JSONDecoder()
first_obj, remaining = decoder.raw_decode(multi_json_str)
Read http://pymotw.com/2/json/index.html#mixed-data-streams

Categories

Resources