Snakemake - Rule that downloads data using sftp

Snakemake - Rule that downloads data using sftp - python

I want to download files from a password protected FTP server in a Snakemake rule. I have seen the answer of Maarten-vd-Sande on specifying it using wildcards. Is it also possible using inputs without running into the MissingInputException?
FILES = ['file1.txt',
'file2.txt']
#remote file retrieval
rule download_file:
# replacing input by output would download all files in one job?
input:
file = expand("{file}", file=FILES)
shell:
# #this assumes your runtime has the SSHPASS env variable set
"sshpass -e sftp -B 258048 server<< get {input.file} data/{input.file}; exit"
I have seen the hint on the SFTP class in snakemake, but I am unsure how to use it in this context.
Thanks in advance!

I haven't tested this, but I am guessing something like this should work! We say that all the output we want is in rule all. Then we have the download rule to download those. I have no experience with using snakemake.remote, so I might be completely wrong in this though.
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
FILES = ['file1.txt',
'file2.txt']
rule all:
input:
FILES
rule download_file:
input:
SFTP.remote("{filename}.txt")
output:
"{filename}.txt"
# shell: # I am not sure if the shell keyword is required, if not, then you can remove these two lines.
# The : does nothing, just for the sake of having something there
# ":"

So I ended up using the following. The trick was how to pass the command to sftp using <<< "command". The envvars let snakemake check that the SSHPASSis set for sshpass to pick up.
envvars:
"SSHPASS"
#remote file retrieval
# #Idea: Replace using SFTP class
rule download_file:
output:
raw = temp(os.path.join(config['DATADIR'], "{file}", "{file}.txt"))
params:
file="{file}.txt"
resources:
walltime="300", nodes=1, mem_mb=2048
threads:
1
shell:
"sshpass -e sftp -B 258048 server <<< \"get {params.file} {output.raw} \""

Related

Snakemake pipeline using directories and files

I am building a snakemake pipeline with python scripts.
Some of the python scripts take as input a directory, while others take as input files inside those directories.
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
Example of what I am doing showing only two rules:
FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")
rule targets:
input:
processed_csv = expand("{files}raw_processed.csv", files =FILES),
normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)
rule process_raw_csv:
input:
script = "process.py",
csv = "{sample}raw.csv"
output:
processed_csv = "{sample}raw_processed.csv"
shell:
"python {input.script} -i {input.csv} -o {output.processed_csv}"
rule normalise_processed_csv:
input:
script = "normalise.py",
processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
params:
folder = "{folders}"
output:
normalised_csv = "{folders}/normalised.csv" # The output
shell:
"python {input.script} -i {params.folder}"
Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.
I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.
Thank you very much in advance.
P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
I don't see anything special with this requirement... What about this?
rule one:
output:
d=directory('{sample}'),
a='{sample}/a.txt',
b='{sample}/b.txt',
shell:
r"""
mkdir -p {output.d}
touch {output.a}
touch {output.b}
"""
rule use_dir:
input:
d='{sample}',
output:
out='outdir/{sample}.out',
shell:
r"""
cat {input.d}/* > {output.out}
"""
rule use_files:
input:
a='{sample}/a.txt',
b='{sample}/b.txt',
output:
out='outfiles/{sample}.out',
shell:
r"""
cat {input.a} {input.b} > {output.out}
"""
rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

Running multiple config files with Bppancestor

I need to run Bppancestor with multiple config files, I have tried different approaches but none of them worked. I have around 150 files, so doing it one by one is not an efficient solution.
The syntax to run bppancestor is the following one:
bppancestor params=config_file
I tried doing:
bppancestor params=directory_of_config_files/*
and using a Snakefile to try to automatize the workflow:
ARCHIVE_FILE = 'bpp_output.tar.gz'
# a single config file
CONFIG_FILE = 'config_files/{sims}.conf'
# Build the list of input files.
CONF = glob_wildcards(CONFIG_FILE).sims
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# run bppancestor
rule bpp:
input:
CONF,
shell:
'bppancestor params={input}'
# create an archive with all results
rule create_archive:
input: CONF,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Could someone give me advice on this?

You're very close. Rule bpp should use as input a specific config file and specify concrete output (not sure if the output is a file or a folder). If I understand the syntax correctly, this link suggests that output files can be specified using output.sites.file and output.nodes.file:
rule bpp:
input:
CONFIG_FILE,
output:
sites='sites.{sims}',
nodes='nodes.{sims}',
shell:
'bppancestor params={input} output.sites.file={output.sites} output.nodes.file={output.nodes}'
Rule create_archive will collect all the outputs and archive them:
rule create_archive:
input: expand('sites.{sims}', CONF), expand('nodes.{sims}', CONF)
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'

snakemake: pass input that does not exist (or pass multiple params)

I am trying and struggling mightily to write a snakemake pipeline to download files from an aws s3 instance.
Because the organization and naming of my files on s3 is inconsistent, I do not want to use snakemake's remote options. Instead, I use a mix of grep and python to enumerate the paths I want on s3, and put them in a text file:
#s3paths.txt
s3://path/to/sample1.bam
s3://path/to/sample2.bam
In my config file I specify the samples I want to work with:
#config.yaml
samplesToDownload: [sample1, sample3, sample18]
I want to make a pipeline where the first rule downloads files from s3 who contain a string present in config['samplesToDownload']. A runtime code snippet does this for me:
pathsToDownload: [path for path in s3paths.txt if path contains string in samplesToDownload]
All this works fine, and I am left with a global variable pathsToDownload that looks something like this:
pathsToDownload: ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
Now I try to get snakemake involved and struggle. If I try to put the python variable in inputs, snakemake refuses because the file does not exist locally:
rule download_bams_from_s3:
input:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {input.s3Path} where/I/want/file/{sample}.bam
This fails because input.s3Path cannot be found as it is a path on s3, not a local path. I then try to do the same but with the pathsToDownload as a param:
rule download_bams_from_s3:
params:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {params.s3Path} where/I/want/file/{sample}.bam
This doesn't produce an error, but it produces the wrong type of shell command. Instead of producing what I want, which is 3 total shell commands:
shell: aws s3 cp path/to/sample1 where/I/want/file/sample1.bam
shell: aws s3 cp path/to/sample3 where/I/want/file/sample3.bam
shell: aws s3 cp path/to/sample18 where/I/want/file/sample18.bam
it produces one shell command with all three paths:
shell: aws s3 cp path/to/sample1 path/to/sample3 path/to/sample18 where/I/want/file/sample1.bam where/I/want/file/sample3.bam where/I/want/file/sample18.bam
Even if I were able to properly construct one massive shell command it is not what I want because I want separate shell commands to take advantage of snakemakes parallelization and ability to not redownload the same file if it already exists.
I feel like this use case for snakemake is not a big ask but I have spent hours trying to construct something workable to no avail. A clean solution is much appreciated!

You could create a dictionary that maps samples to aws paths and use that dictionary to download files one by one. Like:
samplesToDownload = [sample1, sample3, sample18]
pathsToDownload = ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
samplesToPaths = dict(zip(samplesToDownload, pathsToDownload))
rule all:
input:
expand('where/I/want/file/{sample}.bam', sample= samplesToDownload),
rule download_bams_from_s3:
params:
s3Path= lambda wc: samplesToPaths[wc.sample],
output:
bam='where/I/want/file/{sample}.bam',
shell:
r"""
aws s3 cp {params.s3Path} {output.bam}
"""

How to handle ftp links provided in config file in snakemake?

I am attempting to build a snakemake workflow that will provide a symlink to a local file if it exists or if the file does not exist will download the file and integrate it into the workflow. To do this I am using two rules with the same output with preference given to the linking rule (ln_fastq_pe below) using ruleorder.
Whether the file exists or not is known before execution of the workflow. The file paths or ftp links are provided in tab-delimited config file that is used by the workflow to read in samples.
e.g. the contents of samples.txt:
id sample_name fq1 fq2
b test_paired resources/SRR1945436_1.fastq.gz resources/SRR1945436_2.fastq.gz
c test_paired2 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR194/005/SRR1945435/SRR1945435_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR194/005/SRR1945435/SRR1945435_2.fastq.gz
relevant code from the workflow here:
import pandas as pd
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
configfile: "config/config.yaml"
samples = pd.read_table("config/samples.tsv").set_index("id", drop=False)
all_ids=list(samples["id"])
ruleorder: ln_fastq_pe > dl_fastq_pe
rule dl_fastq_pe:
"""
download file from ftp link
"""
input:
fq1=lambda wildcards: FTP.remote(samples.loc[wildcards.id, "fq1"], keep_local=True),
fq2=lambda wildcards: FTP.remote(samples.loc[wildcards.id, "fq2"], keep_local=True)
output:
"resources/fq/{id}_1.fq.gz",
"resources/fq/{id}_2.fq.gz"
shell:
"""
mv {input.fq1} {output[0]}
mv {input.fq2} {output[1]}
"""
rule ln_fastq_pe:
"""
link file
"""
input:
fq1=lambda wildcards: samples.loc[wildcards.id, "fq1"],
fq2=lambda wildcards: samples.loc[wildcards.id, "fq2"]
output:
"resources/fq/{id}_1.fq.gz",
"resources/fq/{id}_2.fq.gz"
shell:
"""
ln -sr {input.fq1} {output[0]}
ln -sr {input.fq2} {output[1]}
"""
When I run this workflow, I receive the following error pointing to the line describing the ln_fastq_pe rule.
WorkflowError in line 58 of /path/to/Snakefile:
Function did not return str or list of str.
I think the error is in how I am describing the FTP links in the samples.txt config file in the dl_fastq_pe rule. What is the proper way to describe FTP links given in a tabular config file so that snakemake will understand them and can download and use the files in a workflow?
Also, is it possible to do what I am trying to do and will this method get me there? I have tried other solutions (e.g. using python code to check if file exists and executing one set of shell commands if it does and the other if it doesn't) to no avail.

You are trying to pass the objects from pandas to Snakemake. The latter expects the values of types str or list[str] in the input section of the rule, but the values you provide (samples.loc[wildcards.id, "fq1"]) are of type pandas.core.frame.DataFrame or pandas.core.series.Series. You need to convert them to what Snamemake expects. For example, this may help: samples.loc[wildcards.id, "fq1"].tolist().

I figured out how to do this by omitting input and instead reading in the fields from samples.tsv through params and merging the two rules into one rule. Snakemake is not picky about what is read in through params unlike input. I then use test command to ask if a file exists. If it exists, proceed with the symlink, if not, download with wget.
Solution is as follows:
import os
import pandas as pd
samples = pd.read_table("config/samples.tsv").set_index("id", drop=False)
all_ids=list(samples["id"])
rule all:
input:
expand("resources/fq/{id}_1.fq.gz", id=all_ids),
expand("resources/fq/{id}_2.fq.gz", id=all_ids)
rule dl_fastq_pe:
"""
if file exists, symlink. If file doesn't exist, download to resources
"""
params:
fq1=lambda wildcards: samples.loc[wildcards.id,"fq1"],
fq2=lambda wildcards: samples.loc[wildcards.id,"fq2"]
output:
"resources/fq/{id}_1.fq.gz",
"resources/fq/{id}_2.fq.gz"
shell:
"""
if test -f {params.fq1}
then
ln -sr {params.fq1} {output[0]}
ln -sr {params.fq2} {output[1]}
else
wget --no-check-certificate -O {output[0]} {params.fq1}
wget --no-check-certificate -O {output[1]} {params.fq2}
fi
"""

MissingOutputException with Snakemake

I'm having the following problem:
My Snakemake program does not recognise the output my python script generated. I tried both, writing the ouput to the stdout and from there into the correct output file and directly writing from the python script (which is the version below).
Setting --latency-wait to 600 did not help either.
Other users reported that running ls helped which I tried while waiting for the latency but that didn't help either.
Additionally, when running again,snakemake wants to run all input files again, despite some output files already existing.
Does anyone have a suggestion what else I could try?
This is the snakemake command I'm using:
snakemake -j 2 --use-conda
Below is my snakefile:
import os
dir = "my/data/dir"
cell_lines = os.listdir(dir)
files = os.listdir(dir+"GM12878/25kb_resolution_intrachromosomal")
chromosomes = [i.split("_")[0] for i in files]
rule all:
input:
expand("~/TADs/{cell_lines}_{chromosomes}_TADs.tsv", cell_lines = cell_lines, chromosomes = chromosomes)
rule tad_calling:
input:
"my/data/dir/{cell_lines}/25kb_resolution_intrachromosomal/{chromosomes}_25kb.RAWobserved"
output:
"~/TADs/{cell_lines}_{chromosomes}_TADs.tsv"
benchmark:
"benchmarks/{cell_lines}_{chromosomes}.txt"
conda:
"my_env.yaml"
shell:
"""
python ~/script.py {input} {output}
"""

I think the problem is with the tilde (~), snakemake does not expand those (or e.g. $HOME). It takes those as the literal path. You could do something like:
from pathlib import Path
home = str(Path.home())
rule tad_calling:
...
output:
f"{home}/TADs/{cell_lines}_{chromosomes}_TADs.tsv"
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Snakemake - Rule that downloads data using sftp - python

Related

Snakemake pipeline using directories and files

Running multiple config files with Bppancestor

snakemake: pass input that does not exist (or pass multiple params)

How to handle ftp links provided in config file in snakemake?

MissingOutputException with Snakemake

Categories

Resources