Parsing Yaml in Python: Detect duplicated keys

Parsing Yaml in Python: Detect duplicated keys - python

The yaml library in python is not able to detect duplicated keys. This is a bug that has been reported years ago and there is not a fix yet.
I would like to find a decent workaround to this problem. How plausible could be to create a regex that returns all the keys ? Then it would be quite easy to detect this problem.
Could any regex master suggest a regex that is able to extract all the keys to find duplicates ?
File example:
mykey1:
subkey1: value1
subkey2: value2
subkey3:
- value 3.1
- value 3.2
mykey2:
subkey1: this is not duplicated
subkey5: value5
subkey5: duplicated!
subkey6:
subkey6.1: value6.1
subkey6.2: valye6.2

The yamllint command-line tool does what you
want:
sudo pip install yamllint
Specifically, it has a rule key-duplicates that detects repetitions and keys
over-writing one another:
$ yamllint test.yaml
test.yaml
1:1 warning missing document start "---" (document-start)
10:5 error duplication of key "subkey5" in mapping (key-duplicates)
(It has many other rules that you can enable/disable or tweak.)

Over-riding on of the build in loaders is a more lightweight approach:
import yaml
# special loader with duplicate key checking
class UniqueKeyLoader(yaml.SafeLoader):
def construct_mapping(self, node, deep=False):
mapping = []
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
assert key not in mapping
mapping.append(key)
return super().construct_mapping(node, deep)
then:
yaml_text = open(filename), 'r').read()
data[f] = yaml.load(yaml_text, Loader=UniqueKeyLoader)

Related

Extract text from a config file [duplicate]

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 1 year ago.
I'm using a config file to inform my Python script of a few key-values, for use in authenticating the user against a website.
I have three variables: the URL, the user name, and the API token.
I've created a config file with each key on a different line, so:
url:<url string>
auth_user:<user name>
auth_token:<API token>
I want to be able to extract the text after the key words into variables, also stripping any "\n" that exist at the end of the line. Currently I'm doing this, and it works but seems clumsy:
with open(argv[1], mode='r') as config_file:
lines = config_file.readlines()
for line in lines:
url_match = match('jira_url:', line)
if url_match:
jira_url = line[9:].split("\n")[0]
user_match = match('auth_user:', line)
if user_match:
auth_user = line[10:].split("\n")[0]
token_match = match('auth_token', line)
if token_match:
auth_token = line[11:].split("\n")[0]
Can anybody suggest a more elegant solution? Specifically it's the ... = line[10:].split("\n")[0] lines that seem clunky to me.
I'm also slightly confused why I can't reuse my match object within the for loop, and have to create new match objects for each config item.

you could use a .yml file and read values with yaml.load() function:
import yaml
with open('settings.yml') as file:
settings = yaml.load(file, Loader=yaml.FullLoader)
now you can access elements like settings["url"] and so on

If the format is always <tag>:<value> you can easily parse it by splitting the line at the colon and filling up a custom dictionary:
config_file = open(filename,"r")
lines = config_file.readlines()
config_file.close()
settings = dict()
for l in lines:
elements = l[:-1].split(':')
settings[elements[0]] = ':'.join(elements[1:])
So, you get a dictionary that has the tags as keys and the values as values. You can then just refer to these dictionary entries in your pogram.
(e.g.: if you need the auth_token, just call settings["auth_token"]

if you can add 1 line for config file, configparser is good choice
https://docs.python.org/3/library/configparser.html
[1] config file : 1.cfg
[DEFAULT] # configparser's config file need section name
url:<url string>
auth_user:<user name>
auth_token:<API token>
[2] python scripts
import configparser
config = configparser.ConfigParser()
config.read('1.cfg')
print(config.get('DEFAULT','url'))
print(config.get('DEFAULT','auth_user'))
print(config.get('DEFAULT','auth_token'))
[3] output
<url string>
<user name>
<API token>
also configparser's methods is useful
whey you can't guarantee config file is always complete

You have a couple of great answers already, but I wanted to step back and provide some guidance on how you might approach these problems in the future. Getting quick answers sometimes prevents you from understanding how those people knew about the answers in the first place.
When you zoom out, the first thing that strikes me is that your task is to provide config, using a file, to your program. Software has the remarkable property of solve-once, use-anywhere. Config files have been a problem worth solving for at least 40 years, so you can bet your bottom dollar you don't need to solve this yourself. And already-solved means someone has already figured out all the little off-by-one and edge-case dramas like stripping line endings and dealing with expected input. The challenge of course, is knowing what solution already exists. If you haven't spent 40 years peeling back the covers of computers to see how they tick, it's difficult to "just know". So you might have a poke around on Google for "config file format" or something.
That would lead you to one of the most prevalent config file systems on the planet - the INI file. Just as useful now as it was 30 years ago, and as a bonus, looks not too dissimilar to your example config file. Then you might search for "read INI file in Python" or something, and come across configparser and you're basically done.
Or you might see that sometime in the last 30 years, YAML became the more trendy option, and wouldn't you know it, PyYAML will do most of the work for you.
But none of this gets you any better at using Python to extract from text files in general. So zooming in a bit, you want to know how to extract parts of lines in a text file. Again, this problem is an age-old problem, and if you were to learn about this problem (rather than just be handed the solution), you would learn that this is called parsing and often involves tokenisation. If you do some research on, say "parsing a text file in python" for example, you would learn about the general techniques that work regardless of the language, such as looping over lines and splitting each one in turn.
Zooming in one more step closer, you're looking to strip the new line off the end of the string so it doesn't get included in your value. Once again, this ain't a new problem, and with the right keywords you could dig up the well-trodden solutions. This is often called "chomping" or "stripping", and with some careful search terms, you'd find rstrip() and friends, and not have to do awkward things like splitting on the '\n' character.
Your final question is about re-using the match object. This is much harder to research. But again, the "solution" wont necessarily show you where you went wrong. What you need to keep in mind is that the statements in the for loop are sequential. To think them through you should literally execute them in your mind, one after one, and imagine what's happening. Each time you call match, it either returns None or a Match object. You never use the object, except to check for truthiness in the if statement. And next time you call match, you do so with different arguments so you get a new Match object (or None). Therefore, you don't need to keep the object around at all. You can simply do:
if match('jira_url:', line):
jira_url = line[9:].split("\n")[0]
if match('auth_user:', line):
auth_user = line[10:].split("\n")[0]
and so on. Not only that, if the first if triggered then you don't need to bother calling match again - it will certainly not trigger any of other matches for the same line. So you could do:
if match('jira_url:', line):
jira_url = line[9:].rstrip()
elif match('auth_user:', line):
auth_user = line[10:].rstrip()
and so on.
But then you can start to think - why bother doing all these matches on the colon, only to then manually split the string at the colon afterwards? You could just do:
tokens = line.rstrip().split(':')
if token[0] == 'jira_url':
jira_url = token[1]
elif token[0] == 'auth_user':
auth_user = token[1]
If you keep making these improvements (and there's lots more to make!), eventually you'll end up re-writing configparse, but at least you'll have learned why it's often a good idea to use an existing library where practical!

Dictionary lookup of loaded yaml via a string representation in Python

So I am using ruamel to read a yaml file located in github and all goes well. I can see it loaded correctly. Now my scenario is that this load is in a function in a class that I am accessing. Now this function has a string variable "entry" which is what I want to search for. The goal is to search at varying depths and I know the locations so I am not hunting for it.
Sample Yaml file:
description: temp_file
image: beta1
group: dev
depends:
- image: base
basetag: 1.0.0
Entry string variable passing in
I want to keep this a string for the variable "entry" as if looking for a top level lookup using the .get will return my value just fine. Its just if I want to gather something like like the value of ["depends"][0]["image"], I cannot figure out how construct this so I can do the proper get.
entry = "image" # works fine
entry = '["depends"][0]["image"]' #never works
gho.get_entry_from_ivpbuild_yml(repo, commit, entry)
Code in Class
# imports
from ruamel.yaml import YAML
yaml = YAML()
yaml.preserve_quotes = True
def get_entry_from_loaded_yml(self, repo, commit, entry, failflag=True):
"""
:param repo: str (github repo)
:param commit: str (sha of commit to view)
:param entry: str (entry within yml you want to search for)
:param failflag: bool (Determines if script fails or not if entry is not found within yaml)
:return: str: (value of entry in yml you want to search for)
"""
yml_file = "sample.yaml"
try:
logger.debug("opening yaml for commit:{}".format(commit))
yml = self.gho.get_repo(repo).get_file_contents(yml_file, commit)
except Exception as e:
logger.error("Could not open yaml file:{} for repo:{} commit:{}:{}".format(yml_file, repo, commit, e))
sys.exit(1)
loaded = yaml.load(yml.decoded_content)
if not loaded.get(entry, default=None):
logger.error("Could not find value for {} in {}".format(entry, yml_file))
if failflag:
sys.exit(1)
return None

As you already know, you can't pass something like the string '["depends"][0]["image"]' to dict.get and expect that to work. But there are a couple of options if you really need to specify a "path" to the object within a nested data structure like this.
The first is to do it explicitly, just passing a sequence of keys instead of a single key:
def get_entry_from_loaded_yml(self, repo, commit, entry_keys, failflag=True):
# ...
try:
entry = loaded
for key in entry_keys:
entry = entry[key]
except LookupError:
logger.error("Could not find value for {} in {}".format(entry, yml_file))
if failflag:
sys.exit(1)
return None
else:
return entry
And now, you can do this:
gho.get_entry_from_ivpbuild_yml(repo, commit, ('depends', 0, 'image'))
Alternatively, you can use a library that handles "key paths", in a format like dpath (which is essentially a simplified version of XPath) or ObjectiveC's KVC. There are multiple libraries on PyPI for doing this (although some work on undecoded JSON strings rather than decoded nested objects, to allow searching huge JSON texts efficiently; those obviously won't work for you… and I don't know of any that work on YAML instead of JSON, but they might exist). Then your code would look something like this:
def get_entry_from_loaded_yml(self, repo, commit, entry, failflag=True):
# ...
result = dpath_search(loaded, entry, default=None):
if result is None:
logger.error("Could not find value for {} in {}".format(entry, yml_file))
if failflag:
sys.exit(1)
return None
else:
return result
# ...
gho.get_entry_from_ivpbuild_yml(repo, commit, 'depends/0/image')
This has the advantage that if you ever need to look up a (possibly nested) sequence of multiple values, it can be as simple as this:
for result in dpath_search(loaded, 'depends/*/image'):

YAML response from Flask MySQL query doesn't seem formatted correctly [duplicate]

I'm writing a file type converter using Python and PyYAML for a project where I am translating to and from YAML files multiple times. These file are then used by a separate service that I have no control over, so I need to translate back the YAML the same as I originally got it. My original file has sections of the following:
key:
- value1
- value2
- value3
Which evaluates to {key: [value1,value2,value3]} using yaml.load(). When I translate this back to YAML my new file reads like this:
key: [value1,value2,value3]
My question is whether these two forms are equivalent as far as the various language parsers of YAML files are concerned. Obviously using PyYaml, these are equivalent, but does this hold true for Ruby or other languages, which the application is using? If not, then the application will not be able to display the data properly.

As Jordan already pointed out the node style is a serialization detail. And the output is equivalent to your input.
With PyYAML you can get the same block style output by using the default_flow_style keyword when dumping:
yaml.dump(yaml.load("""\
key:
- value1
- value2
- value3
"""), sys.stdout, default_flow_style=False)
gives you:
key:
- value1
- value2
- value3
If you would be using the round-trip capabilities from ruamel.yaml (disclaimer: I am the author of that package) you could do:
import sys
import ruamel.yaml as yaml
yaml_str = """\
key:
- value1
- value2 # this is the second value
- value3
"""
data = yaml.load(yaml_str, Loader=yaml.RoundTripLoader)
yaml.dump(data, sys.stdout, Dumper=yaml.RoundTripDumper, default_flow_style=False)
to get:
key:
- value1
- value2 # this is the second value
- value3
Not only does it preserve the flow/block style, but also the comment and the key ordering and some more transparently. This makes comparison (e.g. when using some revision control system to check in the YAML file), much easier.
For the service reading the YAML file this all makes no difference, but for the ease of checking whether you are transforming things correctly, it does.

Yes, to any YAML parser that follows the spec, they are equivalent. You can read the spec here: http://www.yaml.org/spec/1.2/spec.html
Section 3.2.3.1 is particularly relevant (emphasis mine):
3.2.3.1. Node Styles
Each node is presented in some style, depending on its kind. The node style is a presentation detail and is not reflected in the serialization tree or representation graph. There are two groups of styles. Block styles use indentation to denote structure; In contrast, flow styles styles rely on explicit indicators.
To clarify, a node is any structure in YAML, including arrays (called sequences in the spec). The single-line style is called a flow sequence (see section 7.4.1) and the multi-line style is called a block sequence (section 8.2.1). A compliant parser will deserialize both into identical objects.

Working with Parameters containing Escaped Characters in Python Config file

I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.

PyYAML : Control ordering of items called by yaml.load()

I have a yaml setting file which creates some records in db:
setting1:
name: [item,item]
name1: text
anothersetting2:
name: [item,item]
sub_setting:
name :[item,item]
when i update this file with setting3 and regenerate records in db by:
import yaml
fh = open('setting.txt', 'r')
setting_list = yaml.load(fh)
for i in setting_list:
add_to_db[i]
it's vital that the order of them settings (id numbers in db) stay the same each time as im addig them to the db... and setting3 just gets appended to the yaml.load()'s end so that its id doesn't confuse any records which are already in the db ...
At the moment each time i add another setting and call yaml.load() records get loaded in different order which results in different ids. I would welcome any ideas ;)
EDIT:
I've followed abarnert tips and took this gist https://gist.github.com/844388
Works as expected thanks !

My project oyaml is a drop-in replacement for PyYAML, which will load maps into collections.OrderedDict instead of regular dicts. Just pip install it and use as normal - works on both Python 3 and Python 2.
Demo with your example:
>>> import oyaml as yaml # pip install oyaml
>>> yaml.load('''setting1:
... name: [item,item]
... name1: text
... anothersetting2:
... name: [item,item]
... sub_setting:
... name :[item,item]''')
OrderedDict([('setting1',
OrderedDict([('name', ['item', 'item']), ('name1', 'text')])),
('anothersetting2',
OrderedDict([('name', ['item', 'item']),
('sub_setting', 'name :[item,item]')]))])
Note that if the stdlib dict is order preserving (Python >= 3.7, CPython >= 3.6) then oyaml will use an ordinary dict.

You can now use ruaml.yaml for this.
From https://pypi.python.org/pypi/ruamel.yaml:
ruamel.yaml is a YAML parser/emitter that supports roundtrip
preservation of comments, seq/map flow style, and map key order

The YAML spec clearly says that the key order within a mapping is a "representation detail" that cannot be relied on. So your settings file is already invalid if it's relying on the mapping, and you'd be much better off using valid YAML, if at all possible.
Of course YAML is extensible, and there's nothing stopping you from adding an "ordered mapping" type to your settings files. For example:
!omap setting1:
name: [item,item]
name1: text
!omap anothersetting2:
name: [item,item]
!omap sub_setting:
name :[item,item]
You didn't mention which yaml module you're using. There is no such module in the standard library, and there are at least two packages just on PyPI that provide modules with that name. However, I'm going to guess it's PyYAML, because as far as I know that's the most popular.
The extension described above is easy to parse with PyYAML. See http://pyyaml.org/ticket/29:
def omap_constructor(loader, node):
return loader.construct_pairs(node)
yaml.add_constructor(u'!omap', omap_constructor)
Now, instead of:
{'anothersetting2': {'name': ['item', 'item'],
'sub_setting': 'name :[item,item]'},
'setting1': {'name': ['item', 'item'], 'name1': 'text'}}
You'll get this:
(('anothersetting2', (('name', ['item', 'item']),
('sub_setting', ('name, [item,item]'),))),
('setting1', (('name', ['item', 'item']), ('name1', 'text'))))
Of course this gives you a tuple of key-value tuples, but you can easily write a construct_ordereddict and get an OrderedDict instead. You can also write a representer that stores OrdereredDict objects as !omaps, if you need to output as well as input.
If you really want to hook PyYAML to make it use an OrderedDict instead of a dict for default mappings, it's pretty easy to do if you're already working directly on parser objects, but more difficult if you want to stick with the high-level convenience methods. Fortunately, the above-linked ticket has an implementation you can use. Just remember that you're not using real YAML anymore, but a variant, so any other software that deals with your files can, and likely will, break.

For a given single item that is known to be an ordered dictionary just make the items of a list and used collections.OrderedDict:
setting1:
- name: [item,item]
- name1: text
anothersetting2:
- name: [item,item]
- sub_setting:
name :[item,item]
import collections
import yaml
fh = open('setting.txt', 'r')
setting_list = yaml.load(fh)
setting1 = collections.OrderedDict(list(x.items())[0] for x in setting_list['setting1'])

Last I heard, PyYAML did not support this, though it would probably be easy to modify it to accept a dictionary or dictionary-like object as a starting point.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.