snakemake - wildcards from python dictionary

snakemake - wildcards from python dictionary - python

I am writing a snakemake file that has input files on a specific location, with specific folder names (for this example, barcode9[456]). I need to change the naming conventions in these directories so I now want to add a first rule to my snakemake, which would link the folders in the original location (FASTQ_PATH) to an output folder in the snakemake working directory. The names of the link folders in this directory come from a python dictionay d, defined in the snakemake. I would then use these directories as input for the downstream rules.
So the first rule of my snakemake is actually a python script (scripts/ln.py) that maps the naming convention in the original directory to the desired naming conventions, and links the fastqs:
The snakemake looks like so:
FASTQ_PATH = '/path/to/original_location'
# dictionary that maps the subdirectories in FASTQ_PATH (keys) with the directory names that I want in the snakemake working directory (values)
d = {'barcode94': 'IBC_UZL-CV5-04',
'barcode95': 'IBC_UZL-CV5-42',
'barcode96': 'IBC_UZL-CV5-100'}
rule all:
input:
expand('symLinkFq/{barcode}', barcode = list(d.values())) # for each element in list(d.values()) I want to create a subdirectory that would point to the corresponding path in the original location (FASTQ_PATH)
rule symlink:
input:
FASTQ_PATH,
d
output:
'symLinkFq/{barcode}'
script:
"scripts/ln.py"
The python script that I am calling to make the links is shown below
import pandas as pd
import subprocess as sp
import os
# parsing variables from Snakefile
d_sampleNames = snakemake.input[1]
fastq_path = snakemake.input[0]
os.makedirs('symLinkFq')
for barcode in list(d_sampleNames.keys()):
idx = list(d_sampleNames.keys()).index(barcode)
sampleName = list(d_sampleNames.values())[idx]
sp.run(f"ln -s {fastq_path}/{barcode} symLinkFq/{sampleName}", shell=True) # the relative path with respect to the working directory should suffice for the DEST in the ln -s command
But when I call snakemake -np -s Snakefile I get
Building DAG of jobs...
MissingInputException in line 15 of /nexusb/SC2/ONT/scripts/SnakeMake/minimalExample/renameFq/Snakefile:
Missing input files for rule symlink:
barcode95
barcode94
barcode96
The error kind of makes sense to me. The only 'input' files that I have are python variables instead of being files that do exist in my system.
I guess the issue that I am having comes down to the fact that the wildcards that I want to use for all rules are not present in any file that can be used as input, so what I can think of using is the dictionary with the correspondence, though it is not working as I tried.
Does anyone know how to get around this, any other different approach is welcome.

If I understand correctly, I think it could be easier...
I would reverse the key/value mapping (here with dict(zip(...))) than use a lambda input function to get the source directory for each output key:
FASTQ_PATH = '/path/to/files'
d = {'barcode94': 'IBC_UZL-CV5-04',
'barcode95': 'IBC_UZL-CV5-42',
'barcode96': 'IBC_UZL-CV5-100'}
d = dict(zip(d.values(), d.keys())) # Values become keys and viceversa
rule all:
input:
expand('symLinkFq/{barcode}', barcode = d.keys())
rule symlink:
input:
indir= lambda wc: os.path.join(FASTQ_PATH, d[wc.barcode]),
output:
outdir= directory('symLinkFq/{barcode}'),
shell:
r"""
ln -s {input.indir} {output.outdir}
"""
As an aside, in a python script I would use os.symlink() instead of spawning a subprocess and call ln -s - I think it's easier to debug if something goes wrong.

Related

Snakemake pipeline using directories and files

I am building a snakemake pipeline with python scripts.
Some of the python scripts take as input a directory, while others take as input files inside those directories.
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
Example of what I am doing showing only two rules:
FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")
rule targets:
input:
processed_csv = expand("{files}raw_processed.csv", files =FILES),
normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)
rule process_raw_csv:
input:
script = "process.py",
csv = "{sample}raw.csv"
output:
processed_csv = "{sample}raw_processed.csv"
shell:
"python {input.script} -i {input.csv} -o {output.processed_csv}"
rule normalise_processed_csv:
input:
script = "normalise.py",
processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
params:
folder = "{folders}"
output:
normalised_csv = "{folders}/normalised.csv" # The output
shell:
"python {input.script} -i {params.folder}"
Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.
I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.
Thank you very much in advance.
P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
I don't see anything special with this requirement... What about this?
rule one:
output:
d=directory('{sample}'),
a='{sample}/a.txt',
b='{sample}/b.txt',
shell:
r"""
mkdir -p {output.d}
touch {output.a}
touch {output.b}
"""
rule use_dir:
input:
d='{sample}',
output:
out='outdir/{sample}.out',
shell:
r"""
cat {input.d}/* > {output.out}
"""
rule use_files:
input:
a='{sample}/a.txt',
b='{sample}/b.txt',
output:
out='outfiles/{sample}.out',
shell:
r"""
cat {input.a} {input.b} > {output.out}
"""
rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

Handle environmental variables in config options

I have snakemake command line with configuration options like this:
snakemake --config \
f1=$PWD/file1.txt \
f2=$PWD/file2.txt \
f3=/path/to/file3.txt \
...more key-value pairs \
--directory /path/to/output/dir
file1.txt and file2.txt are expected to be in the same directory as the snakefile, file3.txt is somewhere else. I need the paths to files to be absolute, hence the $PWD variable, so Snakemake can find the files after moving to /path/to/output/dir.
Because I start having several configuration options, I would like to move all the --config items to a separate yaml configuration file. The problem is: How do I transfer the variable $PWD to a configuration file?
I could have a dummy string in the yaml file indicating that that string is to be replaced by the directory where the Snakefile is (e.g. f1: <replace me>/file1.txt) but I feel it's awkward. Any better ideas? It may be that I should rethink how the files fileX.txt are passed to snakemake...

You can access the directory the Snakefile lives in with workflow.basedir - you might be able to get away with specifying the relative path in the config file and then defining the absolute path in your Snakefile e.g. as
file1 = pathlib.Path(workflow.basedir) / config["f1"]
file2 = pathlib.Path(workflow.basedir) / config["f2"]

One option is to use an external module, intake, to handle the environmental variable integration. There is a similar answer, but a more specific example for this question is as follow.
A yaml file which follows the syntax expected by intake, a field called sources that contains a list of nested entries that specify at the very least a (possibly local) url at which the file can be access:
# config.yml
sources:
file1:
args:
url: "{{env(PWD)}}/file1.txt"
file2:
args:
url: "{{env(PWD)}}/file2.txt"
Inside the Snakefile, the relevant code would be:
import intake
cat = intake.open_catalog('config.yml')
f1 = cat['file1'].urlpath
f2 = cat['file2'].urlpath
Note that for less verbose yaml files, intake provides syntax for parameterization, see the docs or this example.

Running multiple config files with Bppancestor

I need to run Bppancestor with multiple config files, I have tried different approaches but none of them worked. I have around 150 files, so doing it one by one is not an efficient solution.
The syntax to run bppancestor is the following one:
bppancestor params=config_file
I tried doing:
bppancestor params=directory_of_config_files/*
and using a Snakefile to try to automatize the workflow:
ARCHIVE_FILE = 'bpp_output.tar.gz'
# a single config file
CONFIG_FILE = 'config_files/{sims}.conf'
# Build the list of input files.
CONF = glob_wildcards(CONFIG_FILE).sims
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# run bppancestor
rule bpp:
input:
CONF,
shell:
'bppancestor params={input}'
# create an archive with all results
rule create_archive:
input: CONF,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Could someone give me advice on this?

You're very close. Rule bpp should use as input a specific config file and specify concrete output (not sure if the output is a file or a folder). If I understand the syntax correctly, this link suggests that output files can be specified using output.sites.file and output.nodes.file:
rule bpp:
input:
CONFIG_FILE,
output:
sites='sites.{sims}',
nodes='nodes.{sims}',
shell:
'bppancestor params={input} output.sites.file={output.sites} output.nodes.file={output.nodes}'
Rule create_archive will collect all the outputs and archive them:
rule create_archive:
input: expand('sites.{sims}', CONF), expand('nodes.{sims}', CONF)
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'

Using expand() in snakemake to input any file from list of directories

I have a rule that takes any and every TSV file (multiple TSVs) from a list of directories defined as tasks. For example:
tasks
foo
example1.tsv
circle.tsv
bar
rectangle.tsv
square
triangle.tsv
triangle_1.tsv
I then have a rule in a Snakemake workflow that runs a script on the list of files as such:
task_list = ["bar", "square"]
rule gather_files:
input:
tsv=expand("results/stats/{tasks}/*.tsv", tasks=task_list)
output:
"results/plots/visualizations.pdf"
script:
"Rscript plot_script.R"
The *.tsv results in errors when I try to run the rule and I know it's not the correct way either. What is the best way to do this? Should I use regex to match any string in {task}/*.tsv? I want to limit the combinations of directories to expand on (tasks) but have no constraints on the filenames in them.

This is not very elegant but should work
from glob import glob
task_stats = ["foo", "bar", "square"]
rule combine_files:
input:
tsv=[j for i in expand("results/stats/{tasks}/*.tsv", tasks=task_stats) for j in glob(i)]
output:
"results/plots/stats_visualizations.html"
script:
"../scripts/plot_all_stats.Rmd"

I've had the same question, and a hacky way that I used to solve this problem was by passing a directory as an input, and a matching pattern as a parameter of the rule:
rule analysis:
params:
csvglob = "*.csv"
input:
folder="results/stats/{tasks}"
output:
"results/plots/stats_visualizations.html"
script:
"../scripts/plot_all_stats.Rmd"
And in my script I read from that parameter as
rootdir = snakemake.input["folder"]
csvglob = snakemake.params["csvglob"]
files = glob(f"{rootdir}/{csvglob}")
I take a similar approach would also work for R.
Downside: - feels hacky. Upside: - quite easy to change the pattern, or manipulate and filter it from within the script.

Missing input files for rule all in snakemake

I am trying to construct a snakemake pipeline for biosynthetic gene cluter detection but am struggling with the error:
Missing input files for rule all:
antismash-output/Unmap_09/Unmap_09.txt
antismash-output/Unmap_12/Unmap_12.txt
antismash-output/Unmap_18/Unmap_18.txt
And so on with more files. As far as I can see the file generation in the snakefile should be working:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = config["samples"]
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = "{sample}{separator}1.{extension}",
reverse = "{sample}{separator}2.{extension}"
output:
"merged_reads/{sample}.{extension}"
conda:
"~/miniconda3/envs/antismash"
shell:
"pear -f {input.forward} -r {input.reverse} -o {output} -t 21"
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 2 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
"merged_reads/{sample}.{extension}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
conda:
"~/miniconda3/envs/antismash"
shell:
"prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta"
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
conda:
"~/miniconda3/envs/antismash"
shell:
"antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}"
This is an example of what my config.yaml file looks like:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_14
- Unmap_55
- Unmap_37
separator: _
I can not see where i am going wrong within the snakefile to produce such an error. Apologies for the simple question, I am new to snakemake.

The problem is that you setup your global wildcard constraints wrong:
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|'.join(config["samples"]) # <-- this should fix the problem
Then immediatly another problem follows with extension and seperator wildcards. Snakemake can only infer what these should be from other filenames, you can not actually set these through wildcard constraints. We can make use of f-string syntax to fill in what the values should be:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
...
and:
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
...
Take a look at snakemake regex if the wildcard constraints confuse you, and find a blog about f-strings if you are confused about the f"" syntax and when to use single { and when to use double {{ to escape them.
Hope that helps!

(Since I can't comment yet ...)
You might have a problem with your relative paths, and we cannot see where your files actually are found.
A way to debug this is to use config["path_to_files"] to create absolute paths in input:
That would give you better error message on where Snakemake expects the files - input/output files are relative to the working directory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

snakemake - wildcards from python dictionary - python

Related

Snakemake pipeline using directories and files

Handle environmental variables in config options

Running multiple config files with Bppancestor

Using expand() in snakemake to input any file from list of directories

Missing input files for rule all in snakemake

Categories

Resources