Adding wildcards to a workflow -- best practices - python

I have a quite complex snakemake bioinformatics workflow consisting of >200 rules. It basically starts from a set of FASTQ files, from which variables are inferred, like so:
(WC1, WC2, WC3, WC4) = glob_wildcards(FASTQPATH + "{wc1}_{wc2}_{wc3}_{wc4}.fastq.gz")
Those are then expanded to generate the target files, for example (I am skipping intermediate rules for brevity):
rule all:
expand("mappings/{wc1}_{wc2}_{wc3}_{wc4}.bam", wc1=WC1, wc2=WC2, wc3=WC3, wc4=WC4),
Over the course of a project, metadata can evolve and wildcards need to be added, e.g. wc5:
(WC1, WC2, WC3, WC4, WC5) = glob_wildcards(FASTQPATH + "{wc1}_{wc2}_{wc3}_{wc4}_{wc5}.fastq.gz")
This results in manually editing ~200 workflow rules to comply with the new input. I wonder if anyone in the community has come up with a more elegant, less cumbersome solution (using input functions perhaps?), or is it just a Snakemake limitation we all have to live with?
Thanks in advance

I have a workflow for ChIP-seq data, and my fastq files are named in the format MARK_TISSUE_REPLICATE.fastq.gz, so for example H3K4me3_Liver_B.fastq.gz. For many of my rules, I don't need to have separate wildcards for the mark, tissue, and replicate. I can just write my rules this way:
rule example:
input: "{library}.fq.gz"
output: "{library}.bam"
Then for the rules where I need to have multiple inputs, maybe to combine by replicates together or to do something across all tissues, I have a function I called "libraries" that returns a list of libraries with certain criteria. For example libraries(mark="H3K4me3") would return all libraries for that mark, or libraries(tissue="Liver", replicate="A") would return the libraries for all marks from that specific tissue sample. I can use this to write rules that need to combine multiple libraries, such as:
rule example2:
input: lambda wildcards: expand("{library}.bam", library=libraries(mark=wildcards.mark))
output: "{mark}_Heatmap_Clustering.png"
To fix some weird or ambiguous rule problems, I found it helpful to set some wildcard constraints like this:
wildcard_constraints:
mark="[^_/]+",
tissue="[^_/]+",
replicate="[^_/]+",
library="[^_/]+_[^_/]+_[^_/]+"
Hopefully you can apply some of these ideas to your own workflow.

I think #Colin is on the right (most snakemake-ish) path here. However if you want to make use of the wildcards, e.g. in the log or they dictate certain parameters then you could try to replace the wildcards by a variable, and inject this in the input and output of rules:
metadata = "{wc1}_{wc2}_{wc3}_{wc4}"
WC1, WC2, WC3, WC4 = glob_wildcards(FASTQPATH + metadata + ".fastq.gz")
rule map:
input:
expand(f"unmapped/{metadata}.fq")
input:
expand(f"mappings/{metadata}.fq")
shell:
"""
echo {wildcards.wc1};
mv {input} {output}
"""
rule all:
expand("mappings/{wc1}_{wc2}_{wc3}_{wc4}.bam", wc1=WC1, wc2=WC2, wc3=WC3, wc4=WC4)
This way changing to more or less wildcards is relatively easy.
disclaimer I haven't tested whether any of this actually works :)

Related

Get input from specific wildcard configuration as input to another file in Snakemake

the wildcards configuration in Snakemake is well defined, but I have one question: Is it possible to use the output from one specific wildcard configuration as the input for another?
For example the rule is
rule preprocessing:
input:
"raw_data{dataset}.txt"
output:
"{dataset}/inputfile"
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
With the wildcard sets
dataset:
"textA"
"textB"
"textC"
group:
"someGroupA"
"someGroupB"
"someGroupC"
What I am looking for is that scenario A is the base scenario and should update values in scenario C. However, when defining this with wildcards it results in a cyclic dependency. If I add this as "raw_dataA.txt", it will not execute or use it. Is there any way to define that only the first configuration should be used as an input?

Is there a way for Snakemake to evaluate dynamic Snakefile constructs like `eval` does in GNU Make?

I would like to have various dynamic "shortcuts" (rule names) in my Snakemake workflow without needing marker files. The method I have in mind is similar to eval in GNU Make, but it doesn't seem like Snakemake can evaluate variable-expanded code in the Snakefile syntax. Is there a way to accomplish this?
Here's a simplified example Snakefile. I want to have a rule name corresponding to each output "stage", and right now I have to define them manually. Imagine if I had more "stages" and "steps" and wanted to have a rule that could make all "b", "d", or "z" files if I add those stages. It would be far more concise to dynamically define the rule name than to define every single combination, updated each time I add a new stage.
stages = ['a', 'b']
steps = [1, 2]
rule all:
input:
expand('{stage}{step}_file', stage=stages, step=steps)
rule:
output:
touch('{stage}{step}_file')
# Can these two be combined so that I don't have to add more
# rules for each new "stage" above while retaining the shorthand
# rule name corresponding to the stage?
rule a:
input: expand('a{step}_file', step=steps)
rule b:
input: expand('b{step}_file', step=steps)
Since Snakefile is a Python file, the following might help to achieve what you are after:
for stage in stages:
rule:
name: f'{stage}'
input: expand(f'{stage}{{step}}_file', step=steps)

How to make Snakemake input optional but not empty?

I'm building an SQL script out of text data. The (part of) script shall consist of a CREATE TABLE statement and an optional INSERT INTO statement. The values for INSERT INTO statement are taken from the list of files, each one may exist or may not; all values of existing files are merged. The crucial part is that the INSERT INTO statement shall be skipped whenever no one data file exists.
I've created a script in Snakemake that does that. There are two ambiguous rules that create a script: the one that creates a script for empty data, and the one that creates table but inserts data (the ambiguity is resolved with ruleorder statement).
The interesting part is the rule that merges values from data files. It shall create the output whenever at least one input is present, and this rule shall not be considered otherwise. There are two difficulties: to make each input optional, and to prevent Snakemake using this rule whenever no files exist. I've done that with a trick:
def require_at_least_one(filelist):
existing = [file for file in filelist if os.path.isfile(file)]
return existing if len(existing) else "non_existing_file"
rule merge_values:
input: require_at_least_one(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"]))
output: ...
shell: ...
The require_at_least_one function takes a list of filenames, and filters out those filenames that don't represent a file. This allows to make each input optional. For the corner case when no one file exists, this function returns a special value that represents a non-existing file. This allows to prune this branch and prefer the one that creates a script without INSERT statement.
I feel like reinventing the wheel, moreover the "non_existing_file" trick looks a little dirty. Are there better and idiomatic ways to do that in Snakemake?
my solution would be something along the lines that you should not force snakemake to use or not to use a rule inside the rule but specify which outputs do you need and snakemake will decide if it needs to use the rule. So for your example, I would do something as:
def required_files(filelist):
return [file for file in filelist if os.path.isfile(file)]
rule what_to_gen:
input:
merged = [] if required_files(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"])) else 'merged_files.txt'
rule merge_values:
input: required_files(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"]))
output: 'merged_files.txt'
shell: ...
This will execute the rule merge_values only if required_files is non-empty.

snakemake: optional input for rules

I was wondering if there is a way to have optional inputs in rules.
An example case is excluding unpaired reads for alignment (or having only unpaired reads). A pseudo rule example:
rule hisat2_align:
input:
rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',
r1: lambda wildcards: '-1 '+ read_files[wildcards.reads]['R1'],
r2: lambda wildcards: '-2 '+ read_files[wildcards.reads]['R2']
output:
'aligned.sam'
params:
idx: 'index_prefix',
extra: ''
shell:
'hisat2 {params.extra} -x {params.idx} {input.rU} {input.r1} {input.r2}'
Here, not having trimmed reads (rU='') would result in missing input file error.
I can go around this through a duplicate rule with adjusted input/shell statement or handling the input through params (i'm sure there are other ways). I'm trying to handle this neatly so that this step can be run through a snakemake wrapper (currently a custom one).
The closest example I've seen is on https://groups.google.com/d/msg/snakemake/qX7RfXDTDe4/XTMOoJpMAAAJ
and Johannes' answer. But there we have a conditional assignment (eg. input: 'a' if condition else 'b') not an optional one.
Any help/guidance will be appreciated.
ps. optional input can help with varying number of hisat2 indexes as well (as noted here: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/hisat2.html).
EDIT
To clarify the potential inputs:
1) Use single-end reads alone and declare them in rU. Reads files for the sample might be
sample1_single_1.fastq.gz
sample1_single_2.fastq.gz
In this case r1 and r2 maybe empty lists or not declared at all in the rule.
2) Use paired-end reads and declared them in r1 and r2. Reads files for the sample might be
sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz
In this case `rU`` maybe empty list or not declared at all in the rule.
3) Use paired and single-end reads together (e.g. output from trimmomatic where some pairs are broken). Reads files for the sample might be
sample1_paired_1_R1.fastq.gz
sample1_paired_1_R2.fastq.gz
sample1_paired_2_R1.fastq.gz
sample1_paired_2_R2.fastq.gz
sample1_unpaired_1_R1.fastq.gz
sample1_unpaired_1_R2.fastq.gz
sample1_unpaired_2_R1.fastq.gz
sample1_unpaired_2_R2.fastq.gz
As a solution. I ended up using #timofeyprodanov approach. I didn't realize an empty list can be used for this. Thanks for all the answers and comments!
I usually do it using expand with either empty or non-empty list:
rule a:
input:
expand('filename', proxy=[] if no_input else [None])
One solution would be to pass the endedness info through the output filename or path. Something like the following should work with the existing wrapper:
def get_fastq_reads(wcs):
if wcs.endedness == 'PE': # Paired-end
return ["reads/{wcs.sample}.1.fastq.gz", "reads/{wcs.sample}.2.fastq.gz"]
if wcs.endedness == 'SE': # Single-end
return ["reads/{wcs.sample}.fastq.gz"]
raise(ValueError("Unrecognized wildcard value for 'endedness': %s" % wcs.endedness))
rule hisat2:
input:
reads=get_fastq_reads
output:
"mapped/{sample}.{endedness}.bam"
log: # optional
"logs/hisat2/{sample}.{endedness}.log"
params: # idx is required, extra is optional
idx="genome",
extra="--min-intronlen 1000"
wildcard_constraints:
endedness="(SE|PE)"
threads: 8 # optional, defaults to 1
wrapper:
"0.27.1/bio/hisat2"
With this single rule, one could then map reads/tardigrade.fastq.gz with
> snakemake mapped/tardigrade.SE.bam
or reads/tardigrade.{1,2}.fastq.gz with
> snakemake mapped/tardigrade.PE.bam
Note on the Index Note
I'm confused by the note on the index files, and think it may be wrong. HISAT2 doesn't accept files for that argument, but instead a single prefix that all index files have in common, so there should be only ever one argument for that. The example, idx="genome.fa", in the documentation is misleading. The index that results from building the toy reference (22_20-21M.fa) that comes with HISAT2 is 22_20-21M_snp.{1..8}.ht2, in which case one would use idx="22_20-21M_snp".
In order to create a conditional input in snakemake and avoid a missing input file error you can use [] instead of "".
rule example:
input:
in1="this_input_is_always_there.json",
in2="this_input_is_conditional.json" if condition else [],
In your example, by replacing
rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else '',
with
rU: lambda wildcards: ('-U '+ read_files[wildcards.reads]['unpaired']) if wildcards.read_type=='trimmed' else [],
you achieve the conditionality of the input without having to use expand or a additional function.
I think mere's approach, i.e. to let the output file name contain the information about the endedness, is the most natural one in Snakemake. An alternative way that requires you to have duplicate rules but not conditional statements is to use the ruleorder directive, e.g. ruleorder: align_pe > align_se. Then the higher priority rule will be used if the optional input exists.

Snakemake: merging inputs with different suffixes to same-suffix output

Okay, I've been trying all day to solve this, to no avail... I am working with downloading and analysing RNA-sequencing data, and my analysis incorporates public datasets that come in two flavours: single-end reads and paired-end reads. In essence, every raw file that my workflow start to work on can either be a single file named {sample}.fastq.gz or two files, named {sample}_1.fastq.gz and {sample}_2.fastq.gz, respectively.
I have all the samples and their read layouts (and some other info) in a metadata file which I parse with pandas into a dataframe. I need to be able to give parameters to my scripts (here simply abstracted to touch {output}) in order for them to perform their function depending on the read layout (they are all bash scripts using command line software like sratools and STAR). What I want to achieve is something along the following snakemake pseudocode:
# Metadata in a pandas dataframe
metadata = data.frame(SAMPLES, LAYOUTS, ...)
# Function for retrieving metadata
def get_metadata(sample, column):
result = metadata.loc[metadata['sample'] == sample][column].values[0]
return result
# Rules
rule all:
input:
expand('{sample}.bam', sample = SAMPLES)
rule: download:
output:
'{sample}.fastq.gz' for 'SINGLE' in metadata[LAYOUT],
'{sample}_1.fastq.gz' for 'PAIRED' in metadata[LAYOUT]
params:
layout = lambda wildcards:
get_metadata(wildcards.sample, layout_col)
shell:
'touch {output}'
rule align:
input:
'{sample}.fastq.gz' for 'SINGLE' in metadata[LAYOUT],
'{sample}_1.fastq.gz' for 'PAIRED' in metadata[LAYOUT]
params:
layout = lambda wildcards:
get_metadata(wildcards.sample, layout_col)
output:
'{sample}.bam'
shell:
'touch {output}'
In all code variations I have tried so far I either create ambiguous rules, create single-end reads for paired-end IDs (and vice versa) or it all just fails. I have come up with two very unsatisfactory solutions:
Have two entirely separate workflows, one working on the single-end inputs and the other for the paired-end, requiring the user to manually start both
Have a single workflow that separates the read layouts by adding a prefix 'single'/'paired' for every file in the workflow (i.e. single/{sample}.bam, etc.)
The first is unsatisfactory because the user has to start two different workflows, and the second because it adds a level of input data abstraction that is not present in the output data (since the output .bam-files are created regardless of the input read layouts through options in the sub-scripts I have).
Does somebody have a better idea as to how to achieve this? If it's unclear as to what I'm after I'd be happy to elaborate.
You could use a function as input:
def align_input(wildcards):
# Check if wildcards.sample is paired end or single end
# If single end, return '{sample}.fastq.gz'.format(wildcards.sample)
# Else, return '{sample}_1.fastq.gz'.format(wildcards.sample) and
# '{sample}_2.fastq.gz'.format(wildcards.sample) as list
rule align:
input: align_input
output: '{sample}.bam'
shell: ...
One thing is that you have your align rule written where the input lists every fastq file for all your samples. You want to write your rules so that the input has the fastq file(s) for just a single sample, and the command to align that single sample. The wildcard {sample} means that it will apply that rule for all the samples you have, one at a time. You should probably do something similar with your download rule.
Another solution would be to pre-download all the files, outside of the workflow, then you can use two separate align rules:
rule align_se:
input: '{sample}.fastq.gz'
output: '{sample}.bam'
rule align_pe:
input: '{sample}_1.fastq.gz', '{sample}_2.fastq.gz'
output: '{sample}.bam'
Since the fastq files already exist, snakemake will see for each sample that only one of the rules can be applied because the input files are missing for the other rule, and it has no rule to create them.

Categories

Resources