snakemake workflow on split files

snakemake workflow on split files - python

Perhaps this question has already been answered but I could not come up with the correct query to find it...
I have a big file that needs to be analyzed. In order to do this quickly, I first split the big file into multiple small files and do my analysis on each of them separately in parallel. For this, I have something like this:
rule all:
input:
'bigfile.{wildcards.partnum}.out',
rule split_big_file:
input: 'bigfile'
output: touch('splitting_file.done')
shell: 'split {input}'
rule process_small_files:
input:
small_file = 'bigfile.{wildcards.partnum}',
done = 'splitting_file.done'
output: 'bigfile.{wildcards.partnum}.out'
shell:
'some_command {input.small_file} > {output}'
The rule split_big_file uses split command and generates files that have filenames like bigfile.001, bigfile.002, etc. I use touch('splitting_file.done') in the rule split_big_file to make sure that the next rule process_small_files does not start before it finishes. When I try to run this, I get a Missing input files for rule process_small_files error. How can I get around this?

The rule "process_small_files" sees that it needs a file such as bigfile.001, but as far as snakemake knows, no rule in the workflow can make that file. While split_big_file will make that file, in the "output" section it only states that it will make the file "splitting_file.done" so snakemake doesn't think the workflow can make bigfile.001 and assumes it should already exist.
Because the split command makes a different number of files depending on the size of the input file, you will need to use the dynamic files feature of snakemake: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files

Related

Snakemake pipeline using directories and files

I am building a snakemake pipeline with python scripts.
Some of the python scripts take as input a directory, while others take as input files inside those directories.
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
Example of what I am doing showing only two rules:
FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")
rule targets:
input:
processed_csv = expand("{files}raw_processed.csv", files =FILES),
normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)
rule process_raw_csv:
input:
script = "process.py",
csv = "{sample}raw.csv"
output:
processed_csv = "{sample}raw_processed.csv"
shell:
"python {input.script} -i {input.csv} -o {output.processed_csv}"
rule normalise_processed_csv:
input:
script = "normalise.py",
processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
params:
folder = "{folders}"
output:
normalised_csv = "{folders}/normalised.csv" # The output
shell:
"python {input.script} -i {params.folder}"
Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.
I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.
Thank you very much in advance.
P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake

I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
I don't see anything special with this requirement... What about this?
rule one:
output:
d=directory('{sample}'),
a='{sample}/a.txt',
b='{sample}/b.txt',
shell:
r"""
mkdir -p {output.d}
touch {output.a}
touch {output.b}
"""
rule use_dir:
input:
d='{sample}',
output:
out='outdir/{sample}.out',
shell:
r"""
cat {input.d}/* > {output.out}
"""
rule use_files:
input:
a='{sample}/a.txt',
b='{sample}/b.txt',
output:
out='outfiles/{sample}.out',
shell:
r"""
cat {input.a} {input.b} > {output.out}
"""
rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

Using expand() in snakemake to input any file from list of directories

I have a rule that takes any and every TSV file (multiple TSVs) from a list of directories defined as tasks. For example:
tasks
foo
example1.tsv
circle.tsv
bar
rectangle.tsv
square
triangle.tsv
triangle_1.tsv
I then have a rule in a Snakemake workflow that runs a script on the list of files as such:
task_list = ["bar", "square"]
rule gather_files:
input:
tsv=expand("results/stats/{tasks}/*.tsv", tasks=task_list)
output:
"results/plots/visualizations.pdf"
script:
"Rscript plot_script.R"
The *.tsv results in errors when I try to run the rule and I know it's not the correct way either. What is the best way to do this? Should I use regex to match any string in {task}/*.tsv? I want to limit the combinations of directories to expand on (tasks) but have no constraints on the filenames in them.

This is not very elegant but should work
from glob import glob
task_stats = ["foo", "bar", "square"]
rule combine_files:
input:
tsv=[j for i in expand("results/stats/{tasks}/*.tsv", tasks=task_stats) for j in glob(i)]
output:
"results/plots/stats_visualizations.html"
script:
"../scripts/plot_all_stats.Rmd"

I've had the same question, and a hacky way that I used to solve this problem was by passing a directory as an input, and a matching pattern as a parameter of the rule:
rule analysis:
params:
csvglob = "*.csv"
input:
folder="results/stats/{tasks}"
output:
"results/plots/stats_visualizations.html"
script:
"../scripts/plot_all_stats.Rmd"
And in my script I read from that parameter as
rootdir = snakemake.input["folder"]
csvglob = snakemake.params["csvglob"]
files = glob(f"{rootdir}/{csvglob}")
I take a similar approach would also work for R.
Downside: - feels hacky. Upside: - quite easy to change the pattern, or manipulate and filter it from within the script.

How to compare two binary files or sets of files and displays the differences between them in Python?

I have two text files and want to compare them in order to write the comparison report in the seperate file, like what we get in Batch script with the command
fc /B file1.txt file2.txt > result.txt.
I tried using the filecmp.cmp('file1.txt','file2.txt'), but this is returning me the Boolean value.
What is the correct method to do this?

Take a look at difflib.
https://docs.python.org/3/library/difflib.html
It's meant for this and difflib.context_diff should be what you're looking for.

diff output of two python programs in windows cmd

So I am trying to compare output of two python programs, which have files that I will call trace1.py and trace2.py. Currently I am using process substitution with diff to try and compare their outputs, however I'm having trouble with finding both files, since they are in separate sub-directories of my current directory:
diff <(python /subdir1/tracing1.py) <(python /subdir2/tracing2.py)
When I run this, I get
The system cannot find the file specified.
I think I'm messing up some sort of path formatting, or else I'm using the process substitution incorrectly.
EDIT: In the end I decided that I didn't need to use process substitution, and instead could just diff program output after each program is run. However thanks to Fallenreaper in the comments, I was able to find a single command that does what I initially wanted:
python subdir1/tracing1.py > outfile1.txt & python subdir2/tracing2.py > outfile2.txt & diff outfile1.txt outfile2.txt

Sorry, not enough rep to comment yet :(
Your line works perfectly when you remove that slash. I would suggest using absolute path names or a relative path from current directory cos that front slash would take you to your root directory.
Cheers.

Input Wildcard Constraints Snakemake

I am new to snakemake, and I am using it to merge steps from two pipelines into a single larger pipeline. The issue that several steps create similarly named files, and I cannot find a way to limit the wildcards, so I am getting a Missing input files for rule error on one step that I just cannot resolve.
The full snakefile is long, and is available here: https://mdd.li/snakefile
The relevant sections are (sections of the file are missing below):
wildcard_constraints:
sdir="[^/]+",
name="[^/]+",
prefix="[^/]+"
# Several mapping rules here
rule find_intersecting_snps:
input:
bam="hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.bam"
params:
hornet=os.path.expanduser(config['hornet']),
snps=config['snps']
output:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq1.gz",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq2.gz",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.bam",
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.num.gz"
shell:
dedent(
"""\
python2 {params.hornet}/find_intersecting_snps.py \
-p {input.bam} {params.snps}
"""
)
# Several remapping steps, similar to the first mapping steps, but in a different directory
rule wasp_merge:
input:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.remap.kept.bam"
output:
"hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.bwt2pairs.filt.bam"
params:
threads=config['job_cores']
shell:
dedent(
"""\
{module}
module load samtools
samtools merge --threads {params.threads} {output} {input}
"""
)
# All future steps use the name style wildcard, like below
rule move_results:
input:
"hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.filt.bam"
output:
"hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.bam"
shell:
dedent(
"""\
mv {input} {output}
"""
)
This pipeline is essentially doing some mapping steps in one directory structure that looks like hic_mapping/bowtie_results/bwt2/<subdir>/<file>, (where subdir is three different directories) then filtering the results, and doing another almost identical mapping step in hic_remap/bowtie_results/bwt2/<subdir>/<file>, before merging the results into an entirely new directory and collapsing the subdirectories into the file name: hic_mapping/wasp_results/<subdir>_<file>.
The problem I have is that the wasp_merge step breaks the find_intersecting_snps step if I collapse the subdirectory name into the filename. If I just maintain the subdirectory structure, everything works fine. Doing this would break future steps of the pipeline though.
The error I get is:
MissingInputException in line 243 of /godot/quertermous/PooledHiChip/pipeline/Snakefile:
Missing input files for rule find_intersecting_snps:
hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam
The correct file is:
hic_mapping/bowtie_results/bwt2/HCASMC5-8/HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_hg19.bwt2pairs.bam
But it is looking for:
hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam
Which is not created anywhere, nor defined by any rule. I think it is somehow getting confused by the existence of the file created by the wasp_merge step:
hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.filt.bam
Or possibly a downstream file (after the target that creates this error):
hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.bam
However, I have no idea why either of those files would confuse the find_intersecting_snps rule, because the directory structures are totally different.
I feel like I must be missing something obvious, because this error is so absurd, but I cannot figure out what it is.

The problem is that both the directory name and the file name contain underscores, and in the final file name I separate the two components by underscores.
By either changing that separation character, or replacing the rule with a python function that get the names from elsewhere, I can solve the issue.
This works:
rule wasp_merge:
input:
"hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
"hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.remap.kept.bam"
output:
"hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.bwt2pairs.filt.bam"
params:
threads=config['job_cores']
shell:
dedent(
"""\
{module}
module load samtools
samtools merge --threads {params.threads} {output} {input}
"""
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

snakemake workflow on split files - python

Related

Snakemake pipeline using directories and files

Using expand() in snakemake to input any file from list of directories

How to compare two binary files or sets of files and displays the differences between them in Python?

diff output of two python programs in windows cmd

Input Wildcard Constraints Snakemake

Categories

Resources