I have been trying to wrap my head around this problem which probably has a very easy solution.
I am running a bioinformatics workflow where I have one file as input and I want to run a program on it. However I want that program to be run with multiple parameters. Let me explain.
I have file.fastq and I want to run cutadapt (in the shell) with two flags: --trim and -e. I want to run trim with values --trim 0 and --trim 5. Also I want -e with values -e 0.1 and -e 0.5
Thererfore I want to run the following:
cutadapt file.fastq --trim0 -e0.5 --output ./outputs/trim0_error0.5/trimmed_file.fastq
cutadapt file.fastq --trim5 -e0.5 --output ./outputs/trim5_error0.5/trimmed_file.fastq
cutadapt file.fastq --trim0 -e0.1 --output ./outputs/trim0_error0.1/trimmed_file.fastq
cutadapt file.fastq --trim5 -e0.1 --output ./outputs/trim5_error0.1/trimmed_file.fastq
I thought snakemake would be perfect for this. So far I tried:
E = [0.1, 0.5]
TRIM = [5, 0]
rule cutadapt:
input:
"file.fastq"
output:
expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
params:
trim = TRIM,
e = E
shell:
"cutadapt {input} -e{params.e} --trim{params.trim} --output {output}"
However I get an error like this:
shell:
cutadapt file.fastq -e0.1 0.5 --trim0 5 --output {output}
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
So, as you can see, snakemake is not taking each argument of the TRIM and E variables, but putting them together like a string. How could I solve this problem? Thank you in advance
When specifying params, right now you are providing full lists rather than specific values. Contrast the following parameter values:
E = [0.1, 0.5]
TRIM = [5, 0]
rule all
input: expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
rule cutadapt:
input:
"file.fastq"
output: "../outputs/trim{TRIM}_error{E}/trimmed_file.fastq"
params:
trim_list = TRIM,
trim_value = lambda wildcards: wildcards.TRIM,
shell:
"cutadapt {input} -e{wildcards.E} --trim{wildcards.TRIM} --output {output}"
Note that in the shell directive there was no need to reference params, since this directive is aware of wildcards.
Expanding on #SultanOrzbayev 's answer, The key was that I needed rule all with the parameters in order to use wildcards in my rule cutadapt.
The parameters section can actually be erased, so in the end:
E = [0.1, 0.5]
TRIM = [5, 0]
rule all
input: expand("../outputs/trim{TRIM}_error{E}/trimmed_file.fastq", E=E, TRIM=TRIM)
rule cutadapt:
input:
"file.fastq"
output: "../outputs/trim{TRIM}_error{E}/trimmed_file.fastq"
shell:
"cutadapt {input} -e{wildcards.E} --trim{wildcards.TRIM} --output {output}"
Related
I'm trying out Snakemake's scatter/gather inbuilts but am stumbling over how to get the number of total splits configured.
The documentation doesn't mention how I can access that variable as defined in the workflow or passed through CLI.
Docs say I should define a scattergather directive:
scattergather:
split=8
But how do I get the value of split which is 8 in this case inside my split rule where I would assign it to params.split_total?
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = config["scattergather"]["split"]
shell: "split -l {params.split_total} input"
This fails with: KeyError 'scattergather'
Am I missing something obvious? This is the docs I'm looking at: KeyError in line 48 of /Users/corneliusromer/code/ncov-ingest/workflow/snakemake_rules/curate.smk:
2 'scattergather'
There is a possibility of accessing specific setting via workflow internal property ._scatter:
scattergather:
split=8
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = workflow._scatter["split"]
shell: "split -l {params.split_total} input"
This will dynamically change when CLI param set-scatter is provided.
For other cases, one could leverage python. In the snippet below this is done via setting a specific value, however any valid way to set/obtain value in python will work:
# python variable/label
split_total = 8
scattergather:
split=split_total
# downstream rule can refer to the python variable
rule split:
input: "input.txt"
output: scatter.split("splitted/{scatteritem}.txt")
params: split_total = split_total
shell: "split -l {params.split_total} input"
Essentially, I want to know what the recomended way of handling equivalent file extensions is in snakemake. For example, lets say I have a rule that counts the number of entries in a fasta file. The rule might look something like....
rule count_entries:
input:
["{some}.fasta"]
output:
["{some}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
This works great. But what if I want this rule to also permit "{some}.fa" as input?
Is there any clean way to do this?
EDIT:
Here is my best guess at the first proposed sollution. This can probably be turned into a higher order function to be more general purpose but this is the basic idea as I understand it. I don't think this idea really fits any general use case though as it doesn't cooperate with other rules at the "building DAG" stage.
import os
def handle_ext(wcs):
base = wcs["base"]
for file_ext in [".fasta", ".fa"]:
if(os.path.exists(base + file_ext)):
return [base + file_ext]
rule count_entries:
input:
handle_ext
output:
["{base}.entry_count"]
shell:
'grep -c ">" {input[0]} > {output[0]}'
EDIT2: Here is the best current sollution as I see it...
count_entries_cmd = 'grep -c ">" {input} > {output}'
count_entries_output = "{some}.entry_count"
rule count_entries_fasta:
input:
"{some}.fasta"
output:
count_entries_output
shell:
count_entries_cmd
rule count_entries_fa:
input:
"{some}.fa"
output:
count_entries_output
shell:
count_entries_cmd
One thing I noticed is that you are trying to specify lists of files in both input and output sections but actually your rule takes a single file and produces another file.
I propose you a straightforward solution of specifying two separate rules for different extensions:
rule count_entries_fasta:
input:
"{some}.fasta"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
rule count_entries_fa:
input:
"{some}.fa"
output:
"{some}.entry_count"
shell:
'grep -c ">" {input} > {output}'
These rules are not ambiguous unless you keep files with the same {some} name and different extension in the same folder (which I hope you don't do).
One possible solution is to only allow the original rule to take .fasta files as input, but enable .fa files to be renamed to that. For example,
rule fa_to_fasta:
input:
"{some}.fa"
output:
temp("{some}.fasta")
shell:
"""
cp {input} {output}
"""
Clearly this has the disadvantage of making a temporary copy of the file. Also, if foo.fa and foo.fasta are both provided (not through the copying), then foo.fasta will silently overshadow foo.fa, even if they are different.
Even though OP has edited his entry and included the possible workaround via the input functions, I think it is best to list it also here as an answer to highlight this as possible solution. At least for me, this was the case :)
So, for example if you have an annotation table for your samples, which includes the respective extensions for each sample-file (e.g. via PEP), then you can create a function that returns these entries and pass this function as input to a rule. My example:
# Function indicates needed input files, based on given wildcards (here: sample) and sample annotations
# In my case the sample annotations were provided via PEP
def get_files_dynamically(wildcards):
sample_file1 = pep.sample_table["file1"][wildcards.sample]
sample_read2 = pep.sample_table["file"][wildcards.sample]
return {"file1": sample_file1, "file2": sample_file2}
# 1. Perform trimming on fastq-files
rule run_rule1:
input:
unpack(get_files_dynamically) # Unpacking allows naming the inputs
output:
output1="output/somewhere/{sample}_1.xyz.gz",
output2="output/somewhere/{sample}_2.xyz.gz"
shell:
"do something..."
Here's an example of what I am trying to do:
mydictionary={
'apple': 'crunchy fruit',
'banana': 'mushy and yellow'
}
rule all:
input:
expand('{key}.txt', key=mydictionary.keys())
rule test:
output: temp('{f}.txt')
shell:
"""
echo {mydictionary[wildcards.f]} > {output}
cat {output}
"""
For some reason, I am not able to access the dictionary contents. I tried using double-curly brackets, but the content of the text files becomes literal {mydictionary[wildcards.f]} (while I want the content of the corresponding entry in the dictionary).
I'm pretty sure the bracket markup can only replace variables with string representations of their values, but does not support any code evaluation within the brackets. That is, {mydictionary[wildcards.f]} will try to look up a variable literally named "mydictionary[wildcards.f]". Likewise, {mydictionary}[{wildcards.f}] will just paste the string values together. So, I don't think you can do what you want within the shell section alone. Instead, you can accomplish what you want in the params section:
rule test:
output: temp('{f}.txt')
params:
value=lambda wcs: mydictionary[wcs.f]
shell:
"""
echo '{params.value}' > {output}
cat {output}
"""
Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.
How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list
YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?
Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.
ipython's %his command outputs recent commands entered by the user. Is it possible to search within these commands? Something like this:
[c for c in %history if c.startswith('plot')]
EDIT I am not looking for a way to rerun a command, but to locate it in the history list. Of course, sometimes I will want to rerun a command after locating it, either verbatim or with modifications.
EDIT searching with ctr-r and then typing plot gives the most recent command that starts with "plot". It won't list all the commands that start with it. Neither can you search within the middle or the end of the commands
Solution
Expanding PreludeAndFugue's solution here what I was looking for:
[l for l in _ih if l.startswith('plot')]
here, the if condition can be substituted by a regex
Even better: %hist -g pattern greps your past history for pattern. You can additionally restrict your search to the current session, or to a particular range of lines. See %hist?
So for #BorisGorelik's question you would have to do
%hist -g plot
Unfortunately you cannot do
%hist -g ^plot
nor
%hist -g "^plot"
If you want to re-run a command in your history, try Ctrl-r and then your search string.
I usually find myself wanting to search the entire ipython history across all previous and current sessions. For this I use:
from IPython.core.history import HistoryAccessor
hista = HistoryAccessor()
z1 = hista.search('*numpy*corr*')
z1.fetchall()
OR (don't run both or you will corrupt/erase your history)
ip = get_ipython()
sqlite_cursor = ip.history_manager.search('*numpy*corr*')
sqlite_cursor.fetchall()
The search string is not a regular expression. The iPython history_manager uses sqlite's glob * search syntax instead.
Similar to the first answer you can do the following:
''.join(_ih).split('\n')
However, when iterating through the command history items you can do the following. Thus you can create your list comprehension from this.
for item in _ih:
print item
This is documented in the following section of the documentation:
http://ipython.org/ipython-doc/dev/interactive/reference.html#input-caching-system
There is the way you can do it:
''.join(_ip.IP.shell.input_hist).split('\n')
or
''.join(_ip.IP.shell.input_hist_raw).split('\n')
to prevent magick expansion.
from IPython.core.history import HistoryAccessor
def search_hist(pattern,
print_matches=True,
return_matches=True,
wildcard=True):
if wildcard:
pattern = '*' + pattern + '*'
matches = HistoryAccessor().search(pattern).fetchall()
if not print_matches:
return matches
for i in matches:
print('#' * 60)
print(i[-1])
if return_matches:
return matches
%history [-n] [-o] [-p] [-t] [-f FILENAME] [-g [PATTERN [PATTERN ...]]]
[-l [LIMIT]] [-u]
[range [range ...]]
....
-g <[PATTERN [PATTERN …]]>
treat the arg as a glob pattern to search for in (full) history. This includes the saved history (almost all commands ever written). The pattern may contain ‘?’ to match one unknown character and ‘*’ to match any number of unknown characters. Use ‘%hist -g’ to show full saved history (may be very long).
Example (in my history):
In [23]: hist -g cliente*aza
655/58: cliente.test.alguna.update({"orden" : 1, "nuevo" : "azafran"})
655/59: cliente.test.alguna.update({"orden" : 1} , {$set : "nuevo" : "azafran"})
655/60: cliente.test.alguna.update({"orden" : 1} , {$set : {"nuevo" : "azafran"}})
Example (in my history):
In [24]: hist -g ?lie*aza
655/58: cliente.test.alguna.update({"orden" : 1, "nuevo" : "azafran"})
655/59: cliente.test.alguna.update({"orden" : 1} , {$set : "nuevo" : "azafran"})
655/60: cliente.test.alguna.update({"orden" : 1} , {$set : {"nuevo" : "azafran"}})