Syntax for using config data in rules

Syntax for using config data in rules - python

Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.

How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list

YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?

Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.

Related

Why the yaml can't load value as expected?

I use next minimal example to explain my problem:
test.py
#! /usr/bin/python3
import jinja2
import yaml
from yaml import CSafeLoader as SafeLoader
devices = [
"usb_otg_path: 1:8",
"usb_otg_path: m1:8",
"usb_otg_path: 18",
]
for device in devices:
template = jinja2.Template(device)
device_template = template.render()
print(device_template)
obj = yaml.load(device_template, Loader=SafeLoader)
print(obj)
The run result is:
root#pie:~# python3 test.py
usb_otg_path: 1:8
{'usb_otg_path': 68}
usb_otg_path: m1:8
{'usb_otg_path': 'm1:8'}
usb_otg_path: 18
{'usb_otg_path': 18}
You could see if the value of device_template is usb_otg_path: 1:8, then after yaml.load, the 1:8 becomes 68, looks like because we have : in it. But it's ok for other 2 inputs.
You know above is a simplify of a complex system, in which "usb_otg_path: 1:8" is the input value which I could not change, also the yaml.load is the basic mechanism it used to change a string to a python object.
Then if possible for me to get {'usb_otg_path': '1:8'} with some small changes (We need to upstream to that project, so may can't do big changes to affect others)? Something like change any parameters of yaml.load or something else?

YAML allows numerical literals (scalars) formatted as x:y:z and interprets them as "sexagesimal," that is to say: base 60.
1:8 is thus interpreted by YAML as 1*60**1 + 8*60**0, obviously giving you 68.
Notably you also have m1:8 as a string and 18 as a number. You sound like you want all strings? This answer might be useful:
yaml.load(yaml, Loader=yaml.BaseLoader)
This disables automatic value conversion, as BaseLoader "does not resolve or support any tags and construct only basic Python objects: lists, dictionaries, and Unicode strings." - See reference below

Handling equivalent file extensions in Snakemake

Essentially, I want to know what the recomended way of handling equivalent file extensions is in snakemake. For example, lets say I have a rule that counts the number of entries in a fasta file. The rule might look something like....
rule count_entries:
input:
["{some}.fasta"]
output:
["{some}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
This works great. But what if I want this rule to also permit "{some}.fa" as input?
Is there any clean way to do this?
EDIT:
Here is my best guess at the first proposed sollution. This can probably be turned into a higher order function to be more general purpose but this is the basic idea as I understand it. I don't think this idea really fits any general use case though as it doesn't cooperate with other rules at the "building DAG" stage.
import os
def handle_ext(wcs):
base = wcs["base"]
for file_ext in [".fasta", ".fa"]:
if(os.path.exists(base + file_ext)):
return [base + file_ext]
rule count_entries:
input:
handle_ext
output:
["{base}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
EDIT2: Here is the best current sollution as I see it...
count_entries_cmd = 'grep -c ">" {input} > {output}'
count_entries_output = "{some}.entry_count"
rule count_entries_fasta:
input:
"{some}.fasta"
output:
count_entries_output
shell:
count_entries_cmd
rule count_entries_fa:
input:
"{some}.fa"
output:
count_entries_output
shell:
count_entries_cmd

One thing I noticed is that you are trying to specify lists of files in both input and output sections but actually your rule takes a single file and produces another file.
I propose you a straightforward solution of specifying two separate rules for different extensions:
rule count_entries_fasta:
input:
"{some}.fasta"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
rule count_entries_fa:
input:
"{some}.fa"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
These rules are not ambiguous unless you keep files with the same {some} name and different extension in the same folder (which I hope you don't do).

One possible solution is to only allow the original rule to take .fasta files as input, but enable .fa files to be renamed to that. For example,
rule fa_to_fasta:
input:
"{some}.fa"
output:
temp("{some}.fasta")
shell:
"""
cp {input} {output}
"""
Clearly this has the disadvantage of making a temporary copy of the file. Also, if foo.fa and foo.fasta are both provided (not through the copying), then foo.fasta will silently overshadow foo.fa, even if they are different.

Even though OP has edited his entry and included the possible workaround via the input functions, I think it is best to list it also here as an answer to highlight this as possible solution. At least for me, this was the case :)
So, for example if you have an annotation table for your samples, which includes the respective extensions for each sample-file (e.g. via PEP), then you can create a function that returns these entries and pass this function as input to a rule. My example:
# Function indicates needed input files, based on given wildcards (here: sample) and sample annotations
# In my case the sample annotations were provided via PEP
def get_files_dynamically(wildcards):
sample_file1 = pep.sample_table["file1"][wildcards.sample]
sample_read2 = pep.sample_table["file"][wildcards.sample]
return {"file1": sample_file1, "file2": sample_file2}
# 1. Perform trimming on fastq-files
rule run_rule1:
input:
unpack(get_files_dynamically) # Unpacking allows naming the inputs
output:
output1="output/somewhere/{sample}_1.xyz.gz",
output2="output/somewhere/{sample}_2.xyz.gz"
shell:
"do something..."

Vim and python - jump to definition key binding

The following youtube video shows that it is possible to jump to definition using vim for python.
However when I try the same shortcut (Ctrl-G) it doesnt work...
How is it possible to perform the same "jump to definition"?
I installed the plugin Ctrl-P but not rope.

This does not directly answer your question but provides a better alternative. I use JEDI with VIM, as a static code analyser, it offers far better options than ctags. I use the spacemacs key-binding in vim so with localleader set to ','
" jedi
let g:jedi#use_tabs_not_buffers = 0 " use buffers instead of tabs
let g:jedi#show_call_signatures = "1"
let g:jedi#goto_command = "<localleader>gt"
let g:jedi#goto_assignments_command = "<localleader>ga"
let g:jedi#goto_definitions_command = "<localleader>gg"
let g:jedi#documentation_command = "K"
let g:jedi#usages_command = "<localleader>u"
let g:jedi#completions_command = "<C-Space>"
let g:jedi#rename_command = "<leader>r"

Vim's code navigation is based on a universal database called tags file. It needs to be generated (and updated) manually. :help ctags lists some applications that can do that. Exuberant ctags is a common one that supports many programming languages, but there are also specialized ones, like ptags.py (found in your Python source directory at Tools/scripts/ptags.py).
Plugins like easytags.vim provide more convenience by e.g. automatically updating the tags file on each save.
The default command for jumping to the definition is CTRL-] (not CTRL-G; that prints the current filename; see :help CTRL-G), or the Ex command :tag {identifier}; see all at :help tag-commands.

Some suggestions for people reading other answers to this question in the future:
tags file has one limitation. If in your code multiple objects has the same name you will have problem using ctrl-] as it will jump to first one and not necessary correct one. In this situation you can use g ctrl-] (or :tjump command or :tselect command) to get selection list. Potentially you want to map ctrl-] to "g ctrl-]"
It is possible that you want to have possibility to jump to correct object. In that case you might want to use jedi vim and if you are used to c-] you might want to use this mapping for jedi goto let g:jedi#goto_command = ""
Lastly you want to use universal ctags instead of excuberant ctags because of better new files support (not necessary python).

If you're using YouCompleteMe there is a command for that
:YcmCompleter GoToDefinition
if you want to add a shortcut for doing that in a new tab
nnoremap <leader>d :tab split \| YcmCompleter GoToDefinition<CR>

Working with Parameters containing Escaped Characters in Python Config file

I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.

How to turn Perl blessed objects into YAML that Python can read

We have a REST web service written in Perl Dancer. It returns perl data structures in YAML format and also takes in parameters in YAML format - it is supposed to work with some other teams who query it using Python.
Here's the problem -- if I'm passing back just a regular old perl hash by Dancer's serialization everything works completely fine. JSON, YAML, XML... they all do the job.
HOWEVER, sometimes we need to pass Perl objects back that the Python can later pass back in as a parameter to help with unnecessary loading, etc. I played around and found that YAML is the only one that works with Perl's blessed objects in Dancer.
The problem is that Python's YAML can't parse through the YAMLs of the Perl objects (whereas it can handle regular old perl hash YAMLs without an issue).
The perl objects start out like this in YAML:
First one:
--- &1 !!perl/hash:Sequencing_API
Second:
--- !!perl/hash:SDB::DBIO
It errors out like this.
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:perl/hash:SDB::DBIO'
The regular files seem to get passed through like this:
---
fields:
library:
It seems like the extra stuff after --- are causing the issues. What can I do to address this? Or am I trying to do too much by passing around Perl objects?

the short answer is
!! is yaml shorthand for tag:yaml.org,2002: ... as such !!perl/hash is really tag:yaml.org,2002:perl/hash
now you need to tell python yaml how to deal with this type
so you add a constructor for it as follows
import yaml
def construct_perl_object(loader, node):
print "S:",suffix,"N:",node
return loader.construct_yaml_node(node)#this is likely wrong ....
yaml.add_multi_constructor(u"tag:yaml.org,2002:perl/hash:SDB::DBIO", construct_perl_object)
yaml.load(yaml_string)
or maybe just parse it out or return None maybe ... its hard to test with just that line ... but that may be what you are looking for

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.