I'm trying to dump a Python dict to a YAML file using ruamel.yaml. I'm familiar with the json module's interface, where pretty-printing a dict is as simple as
import json
with open('outfile.json', 'w') as f:
json.dump(mydict, f, indent=4, sort_keys=True)
With ruamel.yaml, I've gotten as far as
import ruamel.yaml
with open('outfile.yaml', 'w') as f:
ruamel.yaml.round_trip_dump(mydict, f, indent=2)
but it doesn't seem to support the sort_keys option. ruamel.yaml also doesn't seem to have any exhaustive docs, and searching Google for "ruamel.yaml sort" or "ruamel.yaml alphabetize" didn't turn up anything at the level of simplicity I'd expect.
Is there a one-or-two-liner for pretty-printing a YAML file with sorted keys?
(Note that I need the keys to be alphabetized down through the whole container, recursively; just alphabetizing the top level is not good enough.)
Notice that if I use round_trip_dump, the keys are not sorted; and if I use safe_dump, the output is not "YAML-style" (or more importantly "Kubernetes-style") YAML. I don't want [] or {} in my output.
$ pip freeze | grep yaml
ruamel.yaml==0.12.5
$ python
>>> import ruamel.yaml
>>> mydict = {'a':1, 'b':[2,3,4], 'c':{'a':1,'b':2}}
>>> print ruamel.yaml.round_trip_dump(mydict) # right format, wrong sorting
a: 1
c:
a: 1
b: 2
b:
- 2
- 3
- 4
>>> print ruamel.yaml.safe_dump(mydict) # wrong format, right sorting
a: 1
b: [2, 3, 4]
c: {a: 1, b: 2}
You need some recursive function that handles mappings/dicts, sequence/list:
import sys
import ruamel.yaml
CM = ruamel.yaml.comments.CommentedMap
yaml = ruamel.yaml.YAML()
data = dict(a=1, c=dict(b=2, a=1), b=[2, dict(e=6, d=5), 4])
yaml.dump(data, sys.stdout)
def rec_sort(d):
try:
if isinstance(d, CM):
return d.sort()
except AttributeError:
pass
if isinstance(d, dict):
# could use dict in newer python versions
res = ruamel.yaml.CommentedMap()
for k in sorted(d.keys()):
res[k] = rec_sort(d[k])
return res
if isinstance(d, list):
for idx, elem in enumerate(d):
d[idx] = rec_sort(elem)
return d
print('---')
yaml.dump(rec_sort(data), sys.stdout)
which gives:
a: 1
c:
b: 2
a: 1
b:
- 2
- e: 6
d: 5
- 4
---
a: 1
b:
- 2
- d: 5
e: 6
- 4
c:
a: 1
b: 2
The commented map is the structure ruamel.yaml uses when doing a
round-trip (load+dump) and round-tripping is designed to keep the keys in
the order that they were during loading.
The above should do a reasonable job preserving comments on mappings/sequences when you load data from a commented YAML file
There is an undocumented sort() in ruamel.yaml that will work on a variation of this problem:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
test = """- name: a11
value: 11
- name: a2
value: 2
- name: a21
value: 21
- name: a3
value: 3
- name: a1
value: 1"""
test_yml = yaml.load(test)
yaml.dump(test_yml, sys.stdout)
not sorted output
- name: a11
value: 11
- name: a2
value: 2
- name: a21
value: 21
- name: a3
value: 3
- name: a1
value: 1
sort by name
test_yml.sort(lambda x: x['name'])
yaml.dump(test_yml, sys.stdout)
sorted output
- name: a1
value: 1
- name: a11
value: 11
- name: a2
value: 2
- name: a21
value: 21
- name: a3
value: 3
As pointed out in #Anthon's example, if you are using Python 3.7 or newer (and do not need to support lower versions), you just need:
import sys
from ruamel.yaml import YAML
yaml = YAML()
data = dict(a=1, c=dict(b=2, a=1), b=[2, dict(e=6, d=5), 4])
def rec_sort(d):
if isinstance(d, dict):
res = dict()
for k in sorted(d.keys()):
res[k] = rec_sort(d[k])
return res
if isinstance(d, list):
for idx, elem in enumerate(d):
d[idx] = rec_sort(elem)
return d
yaml.dump(rec_sort(data), sys.stdout)
Since dict is ordered as of that version.
Related
I have a dataframe with two columns that are json.
So for example,
df = A B C D
1. 2. {b:1,c:2,d:{r:1,t:{y:0}}} {v:9}
I want to flatten it entirely, so every value in the json will be in a seperate columns, and the name will be the full path. So here the value 0 will be in the column:
C_d_t_y
What is the best way to do it, and without having to predefine the depth of the json or the fields?
If your dataframe contains only nested dictionaries (no lists), you can try:
def get_values(df):
def _parse(val, current_path):
if isinstance(val, dict):
for k, v in val.items():
yield from _parse(v, current_path + [k])
else:
yield "_".join(map(str, current_path)), val
rows = []
for idx, row in df.iterrows():
tmp = {}
for i in row.index:
tmp.update(dict(_parse(row[i], [i])))
rows.append(tmp)
return pd.DataFrame(rows, index=df.index)
print(get_values(df))
Prints:
A B C_b C_c C_d_r C_d_t_y D_v
0 1 2 1 2 1 0 9
I have csv file that looks to have the structure of a tree (the file has 3000 lines)
A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a
I would like parse file to obtaining a table that looks like this
Parent Child
A B
B B1
B B2
B2 B2a
A C
C C1
C1 C1a
C C2
C2 C2a1a
Note the leaf with values B2a and C2a1a has more commas but is related to the closest father
you can try yaml
import yaml
import re
import io
s = """A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a"""
s_ = re.sub(r'(,*[\w\d]+)', r'\1:', s)
parsed = yaml.load(io.StringIO(s_.replace(',', ' ')))
def flatten_and_print(d):
for k,v in d.items():
if (isinstance(v, dict)):
for k2 in v:
print(k, k2)
flatten_and_print(v)
flatten_and_print(parsed)
# A B
# A C
# B B1
# B B2
# B2 B2a
# C C1
# C C2
# C1 C1a
# C2 C2a1a
You can use a stack for this, which maintains the path from the root to the currently node from the input.
As apparently the number of commas can increase with more than 1, the stack should include the depth of each element it stores.
Here is an implementation:
def pairs(csv):
stack = []
for line in csv.splitlines():
name = line.lstrip(",")
depth = len(line) - len(name)
name = name.rstrip(",")
while stack and depth <= stack[-1][0]:
stack.pop()
if stack:
yield stack[-1][1], name
stack.append((depth, name))
Here is how you could call it:
csv = """A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a"""
for pair in pairs(csv):
print(*pair)
With my code, I loop over files and count patterns in files. My code is as follows
from collections import defaultdict
import csv, os, re
from itertools import groupby
import glob
def count_kmers(read, k):
counts = defaultdict(list)
num_kmers = len(read) - k + 1
for i in range(num_kmers):
kmer = read[i:i+k]
if kmer not in counts:
counts[kmer] = 0
counts[kmer] += 1
for item in counts:
return(basename, sequence, item, counts[item])
for fasta_file in glob.glob('*.fasta'):
basename = os.path.splitext(os.path.basename(fasta_file))[0]
with open(fasta_file) as f_fasta:
for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
if k:
sequence = next(g).strip('>\n')
else:
d1 = list(''.join(line.strip() for line in g))
d2 = ''.join(d1)
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
reverse_complement = "".join(complement.get(base, base) for base in reversed(d1))
d3 = list(''.join(line.strip() for line in reverse_complement))
d4 = ''.join(d3)
d5 = (d2+d4)
counting = count_kmers(d5, 5)
with open('kmer.out', 'a') as text_file:
text_file.write(counting)
And my output looks like this
1035 1 GAGGA 2
1035 1 CGCAT 1
1035 1 TCCCG 1
1035 1 CTCAT 2
1035 1 CCTGG 2
1035 1 GTCCA 1
1035 1 CATGG 1
1035 1 TAGCC 2
1035 1 GCTGC 7
1035 1 TGCAT 1
The code works fine, but I cannot write my output to a file. I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-190-89e3487da562> in <module>()
37 counting = count_kmers(d5, 5)
38 with open('kmer.out', 'w') as text_file:
---> 39 text_file.write(counting)
TypeError: write() argument must be str, not tuple
What am I doing wrong and how can I solve this problem, to make sure that my code write the output to a txt file?
The original verions of count_kmers() did not contain a return statement, which means it has an implicit return None.
As you assign this to counting all of your errors became self explanatory.
After your edit, the end of the function looked like this:
for item in counts:
return(basename, sequence, item, counts[item])
which returns a tuple with four values. It also exits the function on the first pass through the loop.
I have a text file that contains current between multiple ports like:
current from A:
B - 10
C - 6
Current from B:
A - 11
C - 4
Current from C:
A - 5
B - 5
I need to find the avg current between same ports, my output should be like :
current A-B is 10.5
current A-C is 5.5
current B-C is 4.5
I was thinking of using nested key value pair. is there any other way i could solve this in python? The code i was thinking was
import re
pat = re.compile("current from")
current = {}
with open(fileName) as f:
for line in f:
if pat.search(line):
key1 = (line.split()[2])
elif line != "\n" :
current[key1][line.split()[0]].append(line.split()[2])
for key1 in current:
for key2 in current[key1]:
avg = ((current[key1][key2] + current[key2][key1])/2)
print("current " + key1 + "-" + key2 + " is " + str(avg))
how about this
import re, collections
def extraer_data(fname):
with open(fname) as file:
for raw in re.split(r'current from', file.read(), flags= re.IGNORECASE ):
raw = raw.strip()
if raw:
key,rest = raw.split(":")
data = [ (c,int(n)) for c,n in re.findall("(\w+) - (\d+)",rest) ]
yield (key, data)
def process(fname):
data = collections.defaultdict(list)
for p1, ports in extraer_data(fname):
for p2, val in ports:
data[frozenset((p1,p2))].append(val)
for key,val in data.items():
print( "current {} is {}".format("-".join(sorted(key)), sum(val)/len(val)))
as we are using re, lets try using it to its fullest, or at the very least to the best I can :)
first I take the whole file and divide it at current from which give us this
A:
B - 10
C - 6
------------------------------------------
B:
A - 11
C - 4
------------------------------------------
C:
A - 5
B - 5
from there the extraction is more easy, split at : to get the first letter and finall to get the pairs and process them accordingly
>>> list(extraer_data("test.txt"))
[('A', [('B', 10), ('C', 6)]), ('B', [('A', 11), ('C', 4)]), ('C', [('A', 5), ('B', 5)])]
>>>
once we get the data from the file in a format as show above, is the turn to group them in pairs, and as the order is irrelevant I pack them in a frozenset so they can be used as an dictionary key, and for said dictionary I use a defaultdict of list and once that everything is tie in a little nice package, the rest is a piece of cake
>>> process("test.txt")
current A-B is 10.5
current B-C is 4.5
current A-C is 5.5
>>>
I want to group by key some rows in a RDD so I can perform more advanced operations with the rows within one group. Please note, I do not want to calculate merely some aggregate values. The rows are key-value pairs, where the key is a GUID and the value is a complex object.
As per pyspark documentation, I first tried to implement this with combineByKey as it is supposed be more performant than groupByKey. The list at the beginning is just for illustration, not my real data:
l = list(range(1000))
numbers = sc.parallelize(l)
rdd = numbers.map(lambda x: (x % 5, x))
def f_init(initial_value):
return [initial_value]
def f_merge(current_merged, new_value):
if current_merged is None:
current_merged = []
return current_merged.append(new_value)
def f_combine(merged1, merged2):
if merged1 is None:
merged1 = []
if merged2 is None:
merged2 = []
return merged1 + merged2
combined_by_key = rdd.combineByKey(f_init, f_merge, f_combine)
c = combined_by_key.collectAsMap()
i = 0
for k, v in c.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
The output of this is:
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
Which is not what I expected. The same logic but implemented with groupByKey returns a correct output:
grouped_by_key = rdd.groupByKey()
d = grouped_by_key.collectAsMap()
i = 0
for k, v in d.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
Returns:
0 0 200
1 1 200
2 2 200
3 3 200
4 4 200
So unless I'm missing something, this is the case when groupByKey is preferred over reduceByKey or combineByKey (the topic of related discussion: Is groupByKey ever preferred over reduceByKey).
It is the case when understanding basic APIs is preferred. In particular if you check list.append docstring:
?list.append
## Docstring: L.append(object) -> None -- append object to end
## Type: method_descriptor
you'll see that like the other mutating methods in Python API it by convention doesn't return modified object. It means that f_merge always returns None and there is no accumulation whatsoever.
That being said for most of the problems there much more efficient solutions than groupByKey but rewriting it with combineByKey (or aggregateByKey) is never one of these.