Handling equivalent file extensions in Snakemake

Handling equivalent file extensions in Snakemake - python

Essentially, I want to know what the recomended way of handling equivalent file extensions is in snakemake. For example, lets say I have a rule that counts the number of entries in a fasta file. The rule might look something like....
rule count_entries:
input:
["{some}.fasta"]
output:
["{some}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
This works great. But what if I want this rule to also permit "{some}.fa" as input?
Is there any clean way to do this?
EDIT:
Here is my best guess at the first proposed sollution. This can probably be turned into a higher order function to be more general purpose but this is the basic idea as I understand it. I don't think this idea really fits any general use case though as it doesn't cooperate with other rules at the "building DAG" stage.
import os
def handle_ext(wcs):
base = wcs["base"]
for file_ext in [".fasta", ".fa"]:
if(os.path.exists(base + file_ext)):
return [base + file_ext]
rule count_entries:
input:
handle_ext
output:
["{base}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
EDIT2: Here is the best current sollution as I see it...
count_entries_cmd = 'grep -c ">" {input} > {output}'
count_entries_output = "{some}.entry_count"
rule count_entries_fasta:
input:
"{some}.fasta"
output:
count_entries_output
shell:
count_entries_cmd
rule count_entries_fa:
input:
"{some}.fa"
output:
count_entries_output
shell:
count_entries_cmd

One thing I noticed is that you are trying to specify lists of files in both input and output sections but actually your rule takes a single file and produces another file.
I propose you a straightforward solution of specifying two separate rules for different extensions:
rule count_entries_fasta:
input:
"{some}.fasta"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
rule count_entries_fa:
input:
"{some}.fa"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
These rules are not ambiguous unless you keep files with the same {some} name and different extension in the same folder (which I hope you don't do).

One possible solution is to only allow the original rule to take .fasta files as input, but enable .fa files to be renamed to that. For example,
rule fa_to_fasta:
input:
"{some}.fa"
output:
temp("{some}.fasta")
shell:
"""
cp {input} {output}
"""
Clearly this has the disadvantage of making a temporary copy of the file. Also, if foo.fa and foo.fasta are both provided (not through the copying), then foo.fasta will silently overshadow foo.fa, even if they are different.

Even though OP has edited his entry and included the possible workaround via the input functions, I think it is best to list it also here as an answer to highlight this as possible solution. At least for me, this was the case :)
So, for example if you have an annotation table for your samples, which includes the respective extensions for each sample-file (e.g. via PEP), then you can create a function that returns these entries and pass this function as input to a rule. My example:
# Function indicates needed input files, based on given wildcards (here: sample) and sample annotations
# In my case the sample annotations were provided via PEP
def get_files_dynamically(wildcards):
sample_file1 = pep.sample_table["file1"][wildcards.sample]
sample_read2 = pep.sample_table["file"][wildcards.sample]
return {"file1": sample_file1, "file2": sample_file2}
# 1. Perform trimming on fastq-files
rule run_rule1:
input:
unpack(get_files_dynamically) # Unpacking allows naming the inputs
output:
output1="output/somewhere/{sample}_1.xyz.gz",
output2="output/somewhere/{sample}_2.xyz.gz"
shell:
"do something..."

Related

How to get the number of total scattergather items in Snakemake scatter/gather?

I'm trying out Snakemake's scatter/gather inbuilts but am stumbling over how to get the number of total splits configured.
The documentation doesn't mention how I can access that variable as defined in the workflow or passed through CLI.
Docs say I should define a scattergather directive:
scattergather:
split=8
But how do I get the value of split which is 8 in this case inside my split rule where I would assign it to params.split_total?
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = config["scattergather"]["split"]
shell: "split -l {params.split_total} input"
This fails with: KeyError 'scattergather'
Am I missing something obvious? This is the docs I'm looking at: KeyError in line 48 of /Users/corneliusromer/code/ncov-ingest/workflow/snakemake_rules/curate.smk:
2 'scattergather'

There is a possibility of accessing specific setting via workflow internal property ._scatter:
scattergather:
split=8
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = workflow._scatter["split"]
shell: "split -l {params.split_total} input"
This will dynamically change when CLI param set-scatter is provided.
For other cases, one could leverage python. In the snippet below this is done via setting a specific value, however any valid way to set/obtain value in python will work:
# python variable/label
split_total = 8
scattergather:
split=split_total
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = split_total
shell: "split -l {params.split_total} input"

Snakemake with one input but multiple parameters permutations

I have been trying to wrap my head around this problem which probably has a very easy solution.
I am running a bioinformatics workflow where I have one file as input and I want to run a program on it. However I want that program to be run with multiple parameters. Let me explain.
I have file.fastq and I want to run cutadapt (in the shell) with two flags: --trim and -e. I want to run trim with values --trim 0 and --trim 5. Also I want -e with values -e 0.1 and -e 0.5
Thererfore I want to run the following:
cutadapt file.fastq --trim0 -e0.5 --output ./outputs/trim0_error0.5/trimmed_file.fastq
cutadapt file.fastq --trim5 -e0.5 --output ./outputs/trim5_error0.5/trimmed_file.fastq
cutadapt file.fastq --trim0 -e0.1 --output ./outputs/trim0_error0.1/trimmed_file.fastq
cutadapt file.fastq --trim5 -e0.1 --output ./outputs/trim5_error0.1/trimmed_file.fastq
I thought snakemake would be perfect for this. So far I tried:
E = [0.1, 0.5]
TRIM = [5, 0]
rule cutadapt:
input:
"file.fastq"
output:
expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
params:
trim = TRIM,
e = E
shell:
"cutadapt {input} -e{params.e} --trim{params.trim} --output {output}"
However I get an error like this:
shell:
cutadapt file.fastq -e0.1 0.5 --trim0 5 --output {output}
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
So, as you can see, snakemake is not taking each argument of the TRIM and E variables, but putting them together like a string. How could I solve this problem? Thank you in advance

When specifying params, right now you are providing full lists rather than specific values. Contrast the following parameter values:
E = [0.1, 0.5]
TRIM = [5, 0]
rule all
input: expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
rule cutadapt:
input:
"file.fastq"
output: "../outputs/trim{TRIM}_error{E}/trimmed_file.fastq"
params:
trim_list = TRIM,
trim_value = lambda wildcards: wildcards.TRIM,
shell:
"cutadapt {input} -e{wildcards.E} --trim{wildcards.TRIM} --output {output}"
Note that in the shell directive there was no need to reference params, since this directive is aware of wildcards.

Expanding on #SultanOrzbayev 's answer, The key was that I needed rule all with the parameters in order to use wildcards in my rule cutadapt.
The parameters section can actually be erased, so in the end:
E = [0.1, 0.5]
TRIM = [5, 0]
rule all
input: expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
rule cutadapt:
input:
"file.fastq"
output: "../outputs/trim{TRIM}_error{E}/trimmed_file.fastq"
shell:
"cutadapt {input} -e{wildcards.E} --trim{wildcards.TRIM} --output {output}"

snakemake: correct syntax for accessing dictionary values

Here's an example of what I am trying to do:
mydictionary={
'apple': 'crunchy fruit',
'banana': 'mushy and yellow'
}
rule all:
input:
expand('{key}.txt', key=mydictionary.keys())
rule test:
output: temp('{f}.txt')
shell:
"""
echo {mydictionary[wildcards.f]} > {output}
cat {output}
"""
For some reason, I am not able to access the dictionary contents. I tried using double-curly brackets, but the content of the text files becomes literal {mydictionary[wildcards.f]} (while I want the content of the corresponding entry in the dictionary).

I'm pretty sure the bracket markup can only replace variables with string representations of their values, but does not support any code evaluation within the brackets. That is, {mydictionary[wildcards.f]} will try to look up a variable literally named "mydictionary[wildcards.f]". Likewise, {mydictionary}[{wildcards.f}] will just paste the string values together. So, I don't think you can do what you want within the shell section alone. Instead, you can accomplish what you want in the params section:
rule test:
output: temp('{f}.txt')
params:
value=lambda wcs: mydictionary[wcs.f]
shell:
"""
echo '{params.value}' > {output}
cat {output}
"""

Syntax for using config data in rules

Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.

How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list

YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?

Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.

Incremental Saves

I am trying to write up a script on incremental saves but there are a few hiccups that I am running into.
If the file name is "aaa.ma", I will get the following error - ValueError: invalid literal for int() with base 10: 'aaa' # and it does not happens if my file is named "aaa_0001"
And this happens if I wrote my code in this format: Link
As such, to rectify the above problem, I input in an if..else.. statement - Link, it seems to have resolved the issue on hand, but I was wondering if there is a better approach to this?
Any advice will be greatly appreciated!

Use regexes for better flexibility especially for file rename scripts like these.
In your case, since you know that the expected filename format is "some_file_name_<increment_number>", you can use regexes to do the searching and matching for you. The reason we should do this is because people/users may are not machines, and may not stick to the exact naming conventions that our scripts expect. For example, the user may name the file aaa_01.ma or even aaa001.ma instead of aaa_0001 that your script currently expects. To build this flexibility into your script, you can use regexes. For your use case, you could do:
# name = lastIncFile.partition(".")[0] # Use os.path.split instead
name, ext = os.path.splitext(lastIncFile)
import re
match_object = re.search("([a-zA-Z]*)_*([0-9]*)$", name)
# Here ([a-zA-Z]*) would be group(1) and would have "aaa" for ex.
# and ([0-9]*) would be group(2) and would have "0001" for ex.
# _* indicates that there may be an _, or not.
# The $ indicates that ([0-9]*) would be the LAST part of the name.
padding = 4 # Try and parameterize as many components as possible for easy maintenance
default_starting = 1
verName = str(default_starting).zfill(padding) # Default verName
if match_object: # True if the version string was found
name = match_object.group(1)
version_component = match_object.group(2)
if version_component:
verName = str(int(version_component) + 1).zfill(padding)
newFileName = "%s_%s.%s" % (name, verName, ext)
incSaveFilePath = os.path.join(curFileDir, newFileName)
Check out this nice tutorial on Python regexes to get an idea what is going on in the above block. Feel free to tweak, evolve and build the regex based on your use cases, tests and needs.
Extra tips:
Call cmds.file(renameToSave=True) at the beginning of the script. This will ensure that the file does not get saved over itself accidentally, and forces the script/user to rename the current file. Just a safety measure.
If you want to go a little fancy with your regex expression and make them more readable, you could try doing this:
match_object = re.search("(?P<name>[a-zA-Z]*)_*(?P<version>[0-9]*)$", name)
name = match_object.group('name')
version_component = match_object('version')
Here we use the ?P<var_name>... syntax to assign a dict key name to the matching group. Makes for better readability when you access it - mo.group('version') is much more clearer than mo.group(2).
Make sure to go through the official docs too.
Save using Maya's commands. This will ensure Maya does all it's checks while and before saving:
cmds.file(rename=incSaveFilePath)
cmds.file(save=True)
Update-2:
If you want space to be checked here's an updated regex:
match_object = re.search("(?P<name>[a-zA-Z]*)[_ ]*(?P<version>[0-9]*)$", name)
Here [_ ]* will check for 0 - many occurrences of _ or (space). For more regex stuff, trying and learn on your own is the best way. Check out the links on this post.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.