My Snakemake code is deleting the source directory after moving files

My Snakemake code is deleting the source directory after moving files - python

I have a snakemake script which processes samples from a source folder (containing raw data). So in my last rule I have written a code which checks if the final file exist or not, if it exist it should move the main file from the source folder to other folder(folder with the name of sample).
But the thing is after the rule is executed, it automatically deletes the source folder. How can i prevent this.
I have written it in Python code.
rule movingData:
input:
file = config["ResultDir"]+"{sample}/Annotation/{sample}_FINAL.csv",
first = config["SampleDir"]+"{sample}_R1.fastq.gz",
second = config["SampleDir"]+"{sample}_R2.fastq.gz"
output:
temporary(touch(config["ResultDir"]+"{sample}/RawData/file.txt"))
run:
if os.path.exists(input.file):
for sample in rawSAMPLES:
os.rename(input.first,f'{config["ResultDir"]}{sample}/RawData/{sample}_R1.fastq.gz')
os.rename(input.second,f'{config["ResultDir"]}{sample}/RawData/{sample}_R2.fastq.gz')
Same thing happens when I did it in shell.
rule movingData:
input:
file = config["ResultDir"]+"{sample}/Annotation/{sample}_FINAL.csv",
first = config["SampleDir"]+"{sample}_R1.fastq.gz",
second = config["SampleDir"]+"{sample}_R2.fastq.gz"
output:
temporary(touch(config["ResultDir"]+"{sample}/RawData/file.txt"))
params: directory(config["ResultDir"]+"{sample}/RawData/")
shell:
'''
[[ -f {input.file} ]] && mv {input.first} {input.second} {params}
'''

Related

Snakemake pipeline using directories and files

I am building a snakemake pipeline with python scripts.
Some of the python scripts take as input a directory, while others take as input files inside those directories.
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
Example of what I am doing showing only two rules:
FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")
rule targets:
input:
processed_csv = expand("{files}raw_processed.csv", files =FILES),
normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)
rule process_raw_csv:
input:
script = "process.py",
csv = "{sample}raw.csv"
output:
processed_csv = "{sample}raw_processed.csv"
shell:
"python {input.script} -i {input.csv} -o {output.processed_csv}"
rule normalise_processed_csv:
input:
script = "normalise.py",
processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
params:
folder = "{folders}"
output:
normalised_csv = "{folders}/normalised.csv" # The output
shell:
"python {input.script} -i {params.folder}"
Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.
I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.
Thank you very much in advance.
P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
I don't see anything special with this requirement... What about this?
rule one:
output:
d=directory('{sample}'),
a='{sample}/a.txt',
b='{sample}/b.txt',
shell:
r"""
mkdir -p {output.d}
touch {output.a}
touch {output.b}
"""
rule use_dir:
input:
d='{sample}',
output:
out='outdir/{sample}.out',
shell:
r"""
cat {input.d}/* > {output.out}
"""
rule use_files:
input:
a='{sample}/a.txt',
b='{sample}/b.txt',
output:
out='outfiles/{sample}.out',
shell:
r"""
cat {input.a} {input.b} > {output.out}
"""
rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

Running multiple config files with Bppancestor

I need to run Bppancestor with multiple config files, I have tried different approaches but none of them worked. I have around 150 files, so doing it one by one is not an efficient solution.
The syntax to run bppancestor is the following one:
bppancestor params=config_file
I tried doing:
bppancestor params=directory_of_config_files/*
and using a Snakefile to try to automatize the workflow:
ARCHIVE_FILE = 'bpp_output.tar.gz'
# a single config file
CONFIG_FILE = 'config_files/{sims}.conf'
# Build the list of input files.
CONF = glob_wildcards(CONFIG_FILE).sims
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# run bppancestor
rule bpp:
input:
CONF,
shell:
'bppancestor params={input}'
# create an archive with all results
rule create_archive:
input: CONF,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Could someone give me advice on this?

You're very close. Rule bpp should use as input a specific config file and specify concrete output (not sure if the output is a file or a folder). If I understand the syntax correctly, this link suggests that output files can be specified using output.sites.file and output.nodes.file:
rule bpp:
input:
CONFIG_FILE,
output:
sites='sites.{sims}',
nodes='nodes.{sims}',
shell:
'bppancestor params={input} output.sites.file={output.sites} output.nodes.file={output.nodes}'
Rule create_archive will collect all the outputs and archive them:
rule create_archive:
input: expand('sites.{sims}', CONF), expand('nodes.{sims}', CONF)
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'

snakemake - wildcards from python dictionary

I am writing a snakemake file that has input files on a specific location, with specific folder names (for this example, barcode9[456]). I need to change the naming conventions in these directories so I now want to add a first rule to my snakemake, which would link the folders in the original location (FASTQ_PATH) to an output folder in the snakemake working directory. The names of the link folders in this directory come from a python dictionay d, defined in the snakemake. I would then use these directories as input for the downstream rules.
So the first rule of my snakemake is actually a python script (scripts/ln.py) that maps the naming convention in the original directory to the desired naming conventions, and links the fastqs:
The snakemake looks like so:
FASTQ_PATH = '/path/to/original_location'
# dictionary that maps the subdirectories in FASTQ_PATH (keys) with the directory names that I want in the snakemake working directory (values)
d = {'barcode94': 'IBC_UZL-CV5-04',
'barcode95': 'IBC_UZL-CV5-42',
'barcode96': 'IBC_UZL-CV5-100'}
rule all:
input:
expand('symLinkFq/{barcode}', barcode = list(d.values())) # for each element in list(d.values()) I want to create a subdirectory that would point to the corresponding path in the original location (FASTQ_PATH)
rule symlink:
input:
FASTQ_PATH,
d
output:
'symLinkFq/{barcode}'
script:
"scripts/ln.py"
The python script that I am calling to make the links is shown below
import pandas as pd
import subprocess as sp
import os
# parsing variables from Snakefile
d_sampleNames = snakemake.input[1]
fastq_path = snakemake.input[0]
os.makedirs('symLinkFq')
for barcode in list(d_sampleNames.keys()):
idx = list(d_sampleNames.keys()).index(barcode)
sampleName = list(d_sampleNames.values())[idx]
sp.run(f"ln -s {fastq_path}/{barcode} symLinkFq/{sampleName}", shell=True) # the relative path with respect to the working directory should suffice for the DEST in the ln -s command
But when I call snakemake -np -s Snakefile I get
Building DAG of jobs...
MissingInputException in line 15 of /nexusb/SC2/ONT/scripts/SnakeMake/minimalExample/renameFq/Snakefile:
Missing input files for rule symlink:
barcode95
barcode94
barcode96
The error kind of makes sense to me. The only 'input' files that I have are python variables instead of being files that do exist in my system.
I guess the issue that I am having comes down to the fact that the wildcards that I want to use for all rules are not present in any file that can be used as input, so what I can think of using is the dictionary with the correspondence, though it is not working as I tried.
Does anyone know how to get around this, any other different approach is welcome.

If I understand correctly, I think it could be easier...
I would reverse the key/value mapping (here with dict(zip(...))) than use a lambda input function to get the source directory for each output key:
FASTQ_PATH = '/path/to/files'
d = {'barcode94': 'IBC_UZL-CV5-04',
'barcode95': 'IBC_UZL-CV5-42',
'barcode96': 'IBC_UZL-CV5-100'}
d = dict(zip(d.values(), d.keys())) # Values become keys and viceversa
rule all:
input:
expand('symLinkFq/{barcode}', barcode = d.keys())
rule symlink:
input:
indir= lambda wc: os.path.join(FASTQ_PATH, d[wc.barcode]),
output:
outdir= directory('symLinkFq/{barcode}'),
shell:
r"""
ln -s {input.indir} {output.outdir}
"""
As an aside, in a python script I would use os.symlink() instead of spawning a subprocess and call ln -s - I think it's easier to debug if something goes wrong.

MissingOutputException with Snakemake

I'm having the following problem:
My Snakemake program does not recognise the output my python script generated. I tried both, writing the ouput to the stdout and from there into the correct output file and directly writing from the python script (which is the version below).
Setting --latency-wait to 600 did not help either.
Other users reported that running ls helped which I tried while waiting for the latency but that didn't help either.
Additionally, when running again,snakemake wants to run all input files again, despite some output files already existing.
Does anyone have a suggestion what else I could try?
This is the snakemake command I'm using:
snakemake -j 2 --use-conda
Below is my snakefile:
import os
dir = "my/data/dir"
cell_lines = os.listdir(dir)
files = os.listdir(dir+"GM12878/25kb_resolution_intrachromosomal")
chromosomes = [i.split("_")[0] for i in files]
rule all:
input:
expand("~/TADs/{cell_lines}_{chromosomes}_TADs.tsv", cell_lines = cell_lines, chromosomes = chromosomes)
rule tad_calling:
input:
"my/data/dir/{cell_lines}/25kb_resolution_intrachromosomal/{chromosomes}_25kb.RAWobserved"
output:
"~/TADs/{cell_lines}_{chromosomes}_TADs.tsv"
benchmark:
"benchmarks/{cell_lines}_{chromosomes}.txt"
conda:
"my_env.yaml"
shell:
"""
python ~/script.py {input} {output}
"""

I think the problem is with the tilde (~), snakemake does not expand those (or e.g. $HOME). It takes those as the literal path. You could do something like:
from pathlib import Path
home = str(Path.home())
rule tad_calling:
...
output:
f"{home}/TADs/{cell_lines}_{chromosomes}_TADs.tsv"
...

Input Wildcard Constraints Snakemake

I am new to snakemake, and I am using it to merge steps from two pipelines into a single larger pipeline. The issue that several steps create similarly named files, and I cannot find a way to limit the wildcards, so I am getting a Missing input files for rule error on one step that I just cannot resolve.
The full snakefile is long, and is available here: https://mdd.li/snakefile
The relevant sections are (sections of the file are missing below):
wildcard_constraints:
sdir="[^/]+",
name="[^/]+",
prefix="[^/]+"
# Several mapping rules here
rule find_intersecting_snps:
input:
bam="hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.bam"
params:
hornet=os.path.expanduser(config['hornet']),
snps=config['snps']
output:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq1.gz",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq2.gz",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.bam",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.num.gz"
shell:
dedent(
"""\
python2 {params.hornet}/find_intersecting_snps.py \
-p {input.bam} {params.snps}
"""
)
# Several remapping steps, similar to the first mapping steps, but in a different directory
rule wasp_merge:
input:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.remap.kept.bam"
output:
"hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.bwt2pairs.filt.bam"
params:
threads=config['job_cores']
shell:
dedent(
"""\
{module}
module load samtools
samtools merge --threads {params.threads} {output} {input}
"""
)
# All future steps use the name style wildcard, like below
rule move_results:
input:
"hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.filt.bam"
output:
"hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.bam"
shell:
dedent(
"""\
mv {input} {output}
"""
)
This pipeline is essentially doing some mapping steps in one directory structure that looks like hic_mapping/bowtie_results/bwt2/<subdir>/<file>, (where subdir is three different directories) then filtering the results, and doing another almost identical mapping step in hic_remap/bowtie_results/bwt2/<subdir>/<file>, before merging the results into an entirely new directory and collapsing the subdirectories into the file name: hic_mapping/wasp_results/<subdir>_<file>.
The problem I have is that the wasp_merge step breaks the find_intersecting_snps step if I collapse the subdirectory name into the filename. If I just maintain the subdirectory structure, everything works fine. Doing this would break future steps of the pipeline though.
The error I get is:
MissingInputException in line 243 of /godot/quertermous/PooledHiChip/pipeline/Snakefile:
Missing input files for rule find_intersecting_snps:
hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam
The correct file is:
hic_mapping/bowtie_results/bwt2/HCASMC5-8/HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_hg19.bwt2pairs.bam
But it is looking for:
hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam
Which is not created anywhere, nor defined by any rule. I think it is somehow getting confused by the existence of the file created by the wasp_merge step:
hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.filt.bam
Or possibly a downstream file (after the target that creates this error):
hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.bam
However, I have no idea why either of those files would confuse the find_intersecting_snps rule, because the directory structures are totally different.
I feel like I must be missing something obvious, because this error is so absurd, but I cannot figure out what it is.

The problem is that both the directory name and the file name contain underscores, and in the final file name I separate the two components by underscores.
By either changing that separation character, or replacing the rule with a python function that get the names from elsewhere, I can solve the issue.
This works:
rule wasp_merge:
input:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.remap.kept.bam"
output:
"hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.bwt2pairs.filt.bam"
params:
threads=config['job_cores']
shell:
dedent(
"""\
{module}
module load samtools
samtools merge --threads {params.threads} {output} {input}
"""
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

My Snakemake code is deleting the source directory after moving files - python

Related

Snakemake pipeline using directories and files

Running multiple config files with Bppancestor

snakemake - wildcards from python dictionary

MissingOutputException with Snakemake

Input Wildcard Constraints Snakemake

Categories

Resources