How can I include a specific data from YAML file inside another?

How can I include a specific data from YAML file inside another? - python

I have a very similar structure and scenario to this question and it help me a lot, but I'm looking for a more specific situation, I wanna include to my yaml file just a data from another yaml file, not the complete file. Something like:
UPDATED: I correct the structure of files below to describe properly my scenario, sorry.
foo.yaml
a: 1
b:
- 1.43
- 543.55
- item : !include {I wanna just the [1, 2, 3] from} bar.yaml
bar.yaml
- 3.6
- [1, 2, 3]
Right now, I'm importing all the second file, but I don't need all and don't figure it out the proper solution since yesterday. Below is my actual structure:
foo.yaml
variables: !include bar.yaml #I'm importing the entire file for now and have to navegate in that to get what I need.
a: 1
b:
- 1.43
- 543.55

You can write your own custom include constructor:
bar_yaml = """
- 3.6
- [1, 2, 3]
"""
foo_yaml = """
variables: !include [bar.yaml, 1]
a: 1
b:
- 1.43
- 543.55
"""
def include_constructor(loader, node):
selector = loader.construct_sequence(node)
name = selector.pop(0)
# in actual code, load the file named by name.
# for this example, we'll ignore the name and load the predefined string
content = yaml.safe_load(bar_yaml)
# walk over the selector items and descend into the loaded structure each time.
for item in selector:
content = content[item]
return content
yaml.add_constructor('!include', include_constructor, Loader=yaml.SafeLoader)
print(yaml.safe_load(foo_yaml))
This !include will treat the first item in the given sequence as file name and the following items as sequence indexes (or dictionary keys). E.g. you can do !include [bar.yaml, 1, 2] to only load the 3.

Related

dictionary comprehension for counting items in the list doesn't work. but normal version works

Edit 2:
thanks to the guidance of #john-gordon, I found the problem.
I was searching through a wrong list. which i fixed by changing the first case to
splitted_subs = [] #> a list that will be used for splitting the values before searching in them.
for sub in self.subList: # Splitting the subtitles
splitted_subs.append(sub.rsplit(".",1)[-1])
subtitle_counter = {sub.rsplit(".",1)[-1]: splitted_subs.count(sub.rsplit(".",1)[-1]) for sub in self.subList}
I'm trying to count the "type of subtitle" I have in a folder.
how it works
it takes the subtitle_list and stores it in self.subList
then there is a for loop which loops for sub in self.subList
in the loop the sub gets rsplit(".",1)-ed and then will be the key of the dictionary while the self.subList.count(sub) will counts the occurance of the subtitle extension. like {srt : 2}
and there is a strange problem during the splitting phase of subtitle's file name for extracting its file extension.
the below code only works if countType = Multi-line_2 and the two other cases don't work.
subtitle_list = ['10_Dutch.srt', '11_Spanish.idx', '12_Finnish.sub', '13_French.srt','4_English.idx','5_French.sub']
countType = "Multi-line_2"
class MovieSub:
def __init__(self):
self.subList = subtitle_list
def subtitle_counter(self):
match countType:
case "oneLiner":
#> One-liner way of counting the different types of subtitles in subList which somehow doesn't work !!! (no it's not because of the self reference of subList., it's because of the split but which part i don't know.)
subtitle_counter = {sub.rsplit(".",1)[-1]: self.subList.count(sub.rsplit(".",1)[-1]) for sub in self.subList}
case "Multi-line_1":
subtitle_counter = {}
splitted_subs = []
for sub in self.subList:
splitted_subs.append(sub)
for sub in splitted_subs:
splitted_sub = sub.rsplit(".",1)[-1]
subtitle_counter[splitted_sub] = splitted_subs.count(splitted_sub)
case "Multi-line_2":
subtitle_counter = {}
splitted_subs = []
for sub in self.subList: # Splitting the subtitles
splitted_subs.append(sub.rsplit(".",1)[-1])
# print(sub.rsplit(".",1)[-1])
for sub in splitted_subs: # Counting the Splitted Subtitles
subtitle_counter[sub] = splitted_subs.count(sub)
print(subtitle_counter)
#> the multi-line version of the code (lame version)
movie = MovieSub()
movie.subtitle_counter()
the result of "Multi-line_1" and "oneLiner" cases:
>> {'srt': 0, 'idx': 0, 'sub': 0}
result of "Multi-line_2" Case:
>> {'srt': 2, 'idx': 2, 'sub': 2}
I tried to understand how it's possible and only found that it's a problem when I split the file name in the same scope which i count them (Multi-line_2 case), I don't know if it's relevent to the problem or not.
I will be thankful if anyone could help me out about what I'm missing.
edit 1 :
I think there is a need for an explanation
first of all, my variable names are a bit misleading and splitted_subs and splitted_sub are different variables.
second: the story of this match case system is that my first case which is a dictionary comprehension didn't work, so I tried to debug it by expanding the code which is Multi-line_1 case, then it didn't work again and I changed the position of split to before appending to the list and its the Multi-line_2 case, and I understood the problem was with the placement of my split method. but why? that's my question
so if you add a print statement before the final line of Multi-line_1 like below:
print(splitted_sub)
subtitle_counter[splitted_sub] = splitted_subs.count(splitted_sub)
and another before final line of Multi-line_2 like:
print(sub)
subtitle_counter[sub] = splitted_subs.count(sub)
they will print the same input but not the same results.
Multi-line_1 results:
>> srt
>> idx
>> sub
>> srt
>> idx
>> sub
>> {'srt': 0, 'idx': 0, 'sub': 0}
Multi-line_2 results:
>> srt
>> idx
>> sub
>> srt
>> idx
>> sub
>> {'srt': 2, 'idx': 2, 'sub': 2}

case "Multi-line_1":
subtitle_counter = {}
splitted_subs = []
for sub in self.subList:
splitted_subs.append(sub)
for sub in splitted_subs:
splitted_sub = sub.rsplit(".",1)[-1]
subtitle_counter[splitted_sub] = splitted_subs.count(splitted_sub)
splitted_sub is just the file extension, i.e. ".srt".
But the items in splitted_subs are the full filename, i.e. "10_Dutch.srt". (The variable name is misleading -- those values are not split.)
So of course .count() returns zero.

Run and evaluate imported text file in place as Python code

I have some Python code that is generated dynamically and stored in a text file. It basically consists of various variables like lists and strings that store data. This information is fed to a class to instantiate different objects. How can I feed the data from the text files into the class?
Here is my class:
class SomethingA(Else):
def construct(self):
// feed_data_a_here
self.call_method()
class SomethingB(Else):
def construct(self):
// feed_data_b_here
self.call_method()
Here is some sample content from the text_a file. As you can see, this is some valid Python code that I need to feed directly into the object. The call the call_method() depends on this data for the output.
self.height = 12
self.id = 463934
self.name = 'object_a'
Is there any way to load this data into the class without manually copying and pasting all of its from the text file one by one?
Thanks.

I would probably write a parser for your files which would delete 'self.' at the beginning and add the variable to the dictionary:
import re
# You could use more apprpriate regex depending on expected var names
regex = 'self\.(?P<var_name>\D+\d*) = (?P<var_value>.*)'
attributes= dict()
with open(path) as file:
for line in file:
search = re.search(regex, line)
var_name = search.group(var_name)
var_value = search.group(var_value).strip() # remove accidentalwhite spaces
attributes[var_name] = var_value
foo = classA(**attributes)
example of the regex in work
Edit
If you use the code I've proposed, all items in the dictionary will be of the string type. Probably you can try:
eval(), as proposed by #Welgriv but with small modification:
eval(f'attributes[{var_name}] = {var_value}')
If your data consists of standard python data and properly formated you can try using json:
import json
x = '12'
y = '[1, 2, 3]'
z = '{"A": 50.0, "B": 60.0}'
attributes = {}
for i, v in enumerate([x, y, z]):
attributes[f'var{i+1}'] = json.loads(v)
print(attributes)
# Prints
# {'var1': 12, 'var2': [1, 2, 3], 'var3': {'A': 50.0, 'B': 60.0}}

You probably look for the eval() function. It evaluate and try to execute a python expression as text. For example:
eval('a = 3')
Will create a variable named a equal to 3. In your case you should open the text file and then evaluate it.
Remarks
eval() function present some security issues because the user can potentially execute any code.
I'm not sure what is the overall context of what you try to implement but you might prefer to store your data (name, id, height...) in another way than python code such as key-values or something because it will make your application extremely dependent of the environment. As an example, if there is a python update and some code are deprecated your application will not work anymore.

Access elements inside yaml using python

I am using yaml and pyyaml to configure my application.
Is it possible to configure something like this -
config.yml -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: $root.repo_root/data
service:
root: $root.data_root/csv/xyz.csv
yaml loading function -
def load_config(config_path):
config_path = os.path.abspath(config_path)
if not os.path.isfile(config_path):
raise FileNotFoundError("{} does not exist".format(config_path))
else:
with open(config_path) as f:
config = yaml.load(f, Loader=yaml.SafeLoader)
# logging.info(config)
logging.info("Config used for run - \n{}".format(yaml.dump(config, sort_keys=False)))
return DotDict(config)
Current Output-
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: ${root.repo_root}/data
service:
root: ${root.data_root}/csv/xyz.csv
Desired Output -
root:
repo_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght
data_root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data
service:
root: /home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv
Is this even possible with python? If so any help would be really nice.
Thanks in advance.

A general approach:
read the file as is
search for strings containing $:
determine the "path" of "variables"
replace the "variables" with actual values
An example, using recursive call for dictionaries and replaces strings:
import re, pprint, yaml
def convert(input,top=None):
"""Replaces $key1.key2 with actual values. Modifies input in-place"""
if top is None:
top = input # top should be the original input
if isinstance(input,dict):
ret = {k:convert(v,top) for k,v in input.items()} # recursively convert items
if input != ret: # in case order matters, do it one or several times more until no change happens
ret = convert(ret)
input.update(ret) # update original input
return input # return updated input (for the case of recursion)
if isinstance(input,str):
vars = re.findall(r"\$[\w_\.]+",input) # find $key_1.key_2.keyN sequences
for var in vars:
keys = var[1:].split(".") # remove dollar and split by dots to make "key chain"
val = top # starting from top ...
for k in keys: # ... for each key in the key chain ...
val = val[k] # ... go one level down
input = input.replace(var,val) # replace $key sequence eith actual value
return input # return modified input
# TODO int, float, list, ...
with open("in.yml") as f: config = yaml.load(f) # load as is
convert(config) # convert it (in-place)
pprint.pprint(config)
Output:
{'root': {'data_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data',
'repo_root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght'},
'service': {'root': '/home/raghhuveer/code/data_science/papers/cv/AlexNet_lght/data/csv/xyz.csv'}}
Note: YAML is not that important here, would work also with JSON, XML or other formats.
Note2: If you use exclusively YAML and exclusively python, some answers from this post may be useful (using anchors and references and application specific local tags)

Editing pyparsing parse results

This is similar to a question I've asked before.
I have written a pyparsing grammar logparser for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A is called, then a fast function B is called and finishes almost immediately, and after that function A finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.
My parser is able to parse the function calls (from now on called input_blocks) and their return values (from now on called output_blocks). My parse results (logparser.searchString(logfile)) look like this:
[0]: # first log
- input_blocks:
[0]:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
[1]:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_blocks:
[0]:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]: # second log
- input_blocks:
...
- output_blocks:
...
... # n-th log
I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block and the corresponding output_block into a function_block. My final parse results should look like this:
[0]: # first log
- function_blocks:
[0]:
- input_block:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
- output_block:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]:
- input_block:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_block:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]: # second log
- function_blocks:
[0]: ...
[1]: ...
... # n-th log
To achieve this, I define a function rearrange which iterates through input_blocks and output_blocks and checks whether func_name, thread, and the timestamps match. However, moving the matching blocks into one function_block is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)
def rearrange(log_token):
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
# modify log_token
return log_token
My question is: How do I put the matching output_block and input_block in a function_block in a way that I still enjoy the easy access methods of pyparsing.ParseResults?
My idea looks like this:
def rearrange(log_token):
# define a new ParseResults object in which I store matching input & output blocks
function_blocks = pp.ParseResults(name='function_blocks')
# find matching blocks
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
function_blocks.append(input_block.pop() + output_block.pop()) # this addition causes a maximum recursion error?
log_token.append(function_blocks)
return log_token
This doesn't work though. The addition causes a maximum recursion error and the .pop() doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.
It's also possible that some of theinput_blocks don't have a corresponding output_block (for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks, output_blocks (for the spare blocks), and function_blocks (for the matching blocks).
Thanks for your help!
EDIT:
I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults and how to properly create my own nested ParseResults-structure.
from pyparsing import *
def main():
log_data = '''\
Func1_in
Func2_in
Func2_out
Func1_out
Func3_in'''
ParserElement.inlineLiteralsUsing(Suppress)
input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
log = OneOrMore(input_block | output_block)
parse_results = log.parseString(log_data)
print('***** before rearranging *****')
print(parse_results.dump())
parse_results = rearrange(parse_results)
print('***** after rearranging *****')
print(parse_results.dump())
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and delete them from their original positions in log_token
# I have to do both __setitem__ and .append so it shows up in the dict and in the list
# and .copy() is necessary because I delete the original objects later
tmp_function_block = ParseResults()
tmp_function_block.__setitem__('input', input_block.copy())
tmp_function_block.append(input_block.copy())
tmp_function_block.__setitem__('output', output_block.copy())
tmp_function_block.append(output_block.copy())
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate data
function_blocks.append(function_block)
# delete from original position in log_token
input_block.clear()
output_block.clear()
log_token.__setitem__('function_blocks', sum(function_blocks))
return log_token
if __name__ == '__main__':
main()
Output:
***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
[0]:
['Func1']
- func_name: 'Func1'
[1]:
['Func2']
- func_name: 'Func2'
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
[0]:
['Func2']
- func_name: 'Func2'
[1]:
['Func1']
- func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []] # why is this duplicated? I just want the inner function_blocks!
- function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
[0]:
[['Func1'], ['Func1']]
- input: ['Func1']
- func_name: 'Func1'
- output: ['Func1']
- func_name: 'Func1'
[1]:
[['Func2'], ['Func2']]
- input: ['Func2']
- func_name: 'Func2'
- output: ['Func2']
- func_name: 'Func2'
[2]: # where does this come from?
[[], []]
- input: []
- output: []
- input_blocks: [[], [], ['Func3']]
[0]: # how do I delete these indexes?
[] # I think I only cleared their contents
[1]:
[]
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [[], []]
[0]:
[]
[1]:
[]

This version of rearrange addresses most of the issues I see in your example:
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
# look for match among output blocks that have not been cleared
for output_block in filter(None, log_token.output_blocks):
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and clear them from in their original positions in log_token
# create rearranged block, first with a list of the two blocks
# instead of append()'ing, just initialize with a list containing
# the two block copies
tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])
# now assign the blocks by name
# x.__setitem__(key, value) is the same as x[key] = value
tmp_function_block['input'] = tmp_function_block[0]
tmp_function_block['output'] = tmp_function_block[1]
# wrap that all in another ParseResults, as if we had matched a Group
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate name references
function_blocks.append(function_block)
# clear blocks in their original positions in log_token, so they won't be matched any more
input_block.clear()
output_block.clear()
# match found, no need to keep going looking for a matching output block
break
# find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
for input_block in filter(None, log_token.input_blocks):
# no matching output for this input
tmp_function_block = ParseResults([input_block.copy()])
tmp_function_block['input'] = tmp_function_block[0]
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'] # remove duplicate data
function_blocks.append(function_block)
input_block.clear()
# clean out log_token, and reload with rearranged function blocks
log_token.clear()
log_token.extend(function_blocks)
log_token['function_blocks'] = sum(function_blocks)
return log_token
And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:
# trailing '*' on the results name is equivalent to listAllMatches=True
input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
log = OneOrMore(input_block | output_block)
log.addParseAction(rearrange)
Since rearrange updated log_token in place, if you make it a parse action, the ending return statement would be unnecessary.
It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.
Generally, the assembly of tokens into ParseResults is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.

another demo for pyparsing setParseAction:
remove whitespace before the first value, preserve whitespace between values
i tried to solve this with pp.Optional(pp.White(' \t')).suppress()
but then i got a = ["b=1"] (parser did not stop at end-of-line)
def lstrip_first_value(src, loc, token):
"remove whitespace before first value"
# based on https://stackoverflow.com/a/51335710/10440128
if token == []:
return token
# update the values
copy = token[:]
copy[0] = copy[0].lstrip()
if copy[0] == "" and len(copy) > 1:
copy = copy[1:]
# update the token
token.clear()
token.extend(copy)
token["value"] = copy
return token
Values = (
pp.OneOrMore(Value.leaveWhitespace())
| pp.Empty().setParseAction(pp.replaceWith(""))
)("value").setParseAction(lstrip_first_value)
Value = pp.Combine(
pp.QuotedString(quoteChar='"', escChar="\\")
| pp.White(' \t') # parse whitespace to separate token
)
inputs
a=
b=2
a =
b=2
the values of a should always be [""]

YAML file update and delete using python?

I have a YAML file and it looks like below
test:
- exam.com
- exam1.com
- exam2.com
test2:
- examp.com
- examp1.com
- examp2.com
I like to manage this file using python.
Task is, I like to add an entry under "test2" and delete entry from "test".

You first have to load the data, which will give you a top-level dict (in a variable called data in the following example), the values for the keys will be lists. On those lists you can do the del resp. insert() (or append())
import sys
import ruamel.yaml
yaml_str = """\
test:
- exam.com
- exam1.com
- exam2.com
test2:
- examp.com
- examp1.com # want to insert after this
- examp2.com
"""
data = ruamel.yaml.round_trip_load(yaml_str)
del data['test'][1]
data['test2'].insert(2, 'examp1.5')
ruamel.yaml.round_trip_dump(data, sys.stdout, block_seq_indent=1)
gives:
test:
- exam.com
- exam2.com
test2:
- examp.com
- examp1.com # want to insert after this
- examp1.5
- examp2.com
The block_seq_indent=1 is necessary as by default ruamel.yaml will left align a sequence value with the key.¹
If you want to get rid of the comment in the output you can do:
data['test2']._yaml_comment = None
¹ This was done using ruamel.yaml a YAML 1.2 parser, of which I am the author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I include a specific data from YAML file inside another? - python

Related

dictionary comprehension for counting items in the list doesn't work. but normal version works

Run and evaluate imported text file in place as Python code

Access elements inside yaml using python

Editing pyparsing parse results

YAML file update and delete using python?

Categories

Resources