I'm trying out Snakemake's scatter/gather inbuilts but am stumbling over how to get the number of total splits configured.
The documentation doesn't mention how I can access that variable as defined in the workflow or passed through CLI.
Docs say I should define a scattergather directive:
scattergather:
split=8
But how do I get the value of split which is 8 in this case inside my split rule where I would assign it to params.split_total?
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = config["scattergather"]["split"]
shell: "split -l {params.split_total} input"
This fails with: KeyError 'scattergather'
Am I missing something obvious? This is the docs I'm looking at: KeyError in line 48 of /Users/corneliusromer/code/ncov-ingest/workflow/snakemake_rules/curate.smk:
2 'scattergather'
There is a possibility of accessing specific setting via workflow internal property ._scatter:
scattergather:
split=8
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = workflow._scatter["split"]
shell: "split -l {params.split_total} input"
This will dynamically change when CLI param set-scatter is provided.
For other cases, one could leverage python. In the snippet below this is done via setting a specific value, however any valid way to set/obtain value in python will work:
# python variable/label
split_total = 8
scattergather:
split=split_total
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = split_total
shell: "split -l {params.split_total} input"
Essentially, I want to know what the recomended way of handling equivalent file extensions is in snakemake. For example, lets say I have a rule that counts the number of entries in a fasta file. The rule might look something like....
rule count_entries:
input:
["{some}.fasta"]
output:
["{some}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
This works great. But what if I want this rule to also permit "{some}.fa" as input?
Is there any clean way to do this?
EDIT:
Here is my best guess at the first proposed sollution. This can probably be turned into a higher order function to be more general purpose but this is the basic idea as I understand it. I don't think this idea really fits any general use case though as it doesn't cooperate with other rules at the "building DAG" stage.
import os
def handle_ext(wcs):
base = wcs["base"]
for file_ext in [".fasta", ".fa"]:
if(os.path.exists(base + file_ext)):
return [base + file_ext]
rule count_entries:
input:
handle_ext
output:
["{base}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
EDIT2: Here is the best current sollution as I see it...
count_entries_cmd = 'grep -c ">" {input} > {output}'
count_entries_output = "{some}.entry_count"
rule count_entries_fasta:
input:
"{some}.fasta"
output:
count_entries_output
shell:
count_entries_cmd
rule count_entries_fa:
input:
"{some}.fa"
output:
count_entries_output
shell:
count_entries_cmd
One thing I noticed is that you are trying to specify lists of files in both input and output sections but actually your rule takes a single file and produces another file.
I propose you a straightforward solution of specifying two separate rules for different extensions:
rule count_entries_fasta:
input:
"{some}.fasta"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
rule count_entries_fa:
input:
"{some}.fa"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
These rules are not ambiguous unless you keep files with the same {some} name and different extension in the same folder (which I hope you don't do).
One possible solution is to only allow the original rule to take .fasta files as input, but enable .fa files to be renamed to that. For example,
rule fa_to_fasta:
input:
"{some}.fa"
output:
temp("{some}.fasta")
shell:
"""
cp {input} {output}
"""
Clearly this has the disadvantage of making a temporary copy of the file. Also, if foo.fa and foo.fasta are both provided (not through the copying), then foo.fasta will silently overshadow foo.fa, even if they are different.
Even though OP has edited his entry and included the possible workaround via the input functions, I think it is best to list it also here as an answer to highlight this as possible solution. At least for me, this was the case :)
So, for example if you have an annotation table for your samples, which includes the respective extensions for each sample-file (e.g. via PEP), then you can create a function that returns these entries and pass this function as input to a rule. My example:
# Function indicates needed input files, based on given wildcards (here: sample) and sample annotations
# In my case the sample annotations were provided via PEP
def get_files_dynamically(wildcards):
sample_file1 = pep.sample_table["file1"][wildcards.sample]
sample_read2 = pep.sample_table["file"][wildcards.sample]
return {"file1": sample_file1, "file2": sample_file2}
# 1. Perform trimming on fastq-files
rule run_rule1:
input:
unpack(get_files_dynamically) # Unpacking allows naming the inputs
output:
output1="output/somewhere/{sample}_1.xyz.gz",
output2="output/somewhere/{sample}_2.xyz.gz"
shell:
"do something..."
for key in dictionary:
file = file.replace(str(key), dictionary[key])
With this simple snippet I am able to replace each occurence of dictionary key, with it's value, in a file. (Python)
Is there a similar way to go about in bash?
Exampple:
file="addMesh:"0x234544"
addMesh="0x12353514"
${!dictionary[i]}: 0x234544
${dictionary[i]}: 0x234544x0
${!dictionary[i]}: 0x12353514
${!dictionary[i]}: 0x12353514x0
Wanted output (new content of file):"addMesh:"0x234544x0"
addMesh="0x12353514x0"
:
for i in "${!dictionary[#]}"
do
echo "key : $i"
echo "value: ${dictionary[$i]}"
echo
done
While there certainly are more sophisticated methods to do this, I find the following much easier to understand, and maybe it's just fast enough for your use case:
#!/bin/bash
# Create copy of source file: can be omitted
cat addMesh.txt > newAddMesh.txt
file_to_modify=newAddMesh.txt
# Declare the dictionary
declare -A dictionary
dictionary["0x234544"]=0x234544x0
dictionary["0x12353514"]=0x12353514x0
# use sed to perform all substitutions
for i in "${!dictionary[#]}"
do
sed -i "s/$i/${dictionary[$i]}/g" "$file_to_modify"
done
# Display the result: can be omitted
echo "Content of $file_to_modify :"
cat "$file_to_modify"
Assuming that the input file addMesh.txt contains
"addMesh:"0x234544"
addMesh="0x12353514"
the resulting file will contain:
"addMesh:"0x234544x0"
addMesh="0x12353514x0"
This method is not very fast, because it invokes sed multiple times. But it does not require sed to generate other sed scripts or anything like that. Therefore, it is closer to the original Python script. If you need better performance, refer to the answers in the linked question.
There is no perfect equivalent in Bash. You could do it in a roundabout way, given that dict is the associative array:
# traverse the dictionary and build command file for sed
for key in "${!dict[#]}"; do
printf "s/%s/%s/g;\n" "$key" "${dict[$key]}"
done > sed.commands
# run sed
sed -f sed.commands file > file.modified
# clean up
rm -f sed.commands
Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.
How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list
YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?
Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.
I have an xml file which looks like below
<name>abcdefg</name>
<value>123456</value>
I am trying to write a script using sed to search for the tag "abcdefg" and then replace the corresponding value "123456" but unfortunately I am not able to find a logic to achieve above.
Need help!
Sample data used:
cat key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>123456</value>
<name>abcdaaefg</name>
<value>123456</value>
sed solution:
sed '/abcdefg/!b;n;c<value>OLA<value>' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>OLA<value>
<name>abcdaaefg</name>
<value>123456</value>
For doing changes in file.
sed -i.bak '/abcdefg/!b;n;c<value>OLA<value>' key
awk Solution:
awk '/abcdefg/ {print $0;getline;sub(/>.*</,">ola<")} {print $0}' key
<name>abcdaaefg</name>
<value>123456</value>
<name>abcdefg</name>
<value>ola</value>
<name>abcdaaefg</name>
<value>123456</value>
Search for a line containing abcdefg and then do following actions:
1. print that line,
2.move to next line and replace the value inside html tag to something else. Here , I have replaced 123456 with ola.
Whenever you have tag->value pairs in your data it's a good idea to create a tag->value array in your code:
$ awk -F'[<>]' '{tag=$2; v[tag]=$3} tag=="value" && v["name"]=="abcdefg" {sub(/>.*</,">blahblah<")} 1' file
<name>abcdefg</name>
<value>blahblah</value>
Use an XML-aware tool. This will make your approach far more robust: It means that tiny changes in the textual description (like added or removed newlines, or extra attributes added to a preexisting element) won't break your script.
Assuming that your input's structure looks like this (with being under a single parent item, here called item, defining the relationship between a name and a value):
<config>
<item><name>abcdef</name><value>123456</value></item>
<item><name>fedcba</name><value>654321</value></item>
</config>
...you can edit it like so:
# edit the value under an item having name "abcdef"
xmlstarlet ed -u '//item[name="abcdef"]/value' -v "new-value"
If instead it's like this (with ordering between name/value pairs describing their relationship):
<config>
<name>abcdef</name><value>123456</value>
<name>fedcba</name><value>654321</value>
</config>
...then you can edit it like so:
# update the value immediately following a name of "abcdef"
xmlstarlet ed -u '//name[. = "abcdef"]/following-sibling::value[1]' -v new-value