snakemake: pass input that does not exist (or pass multiple params) - python

I am trying and struggling mightily to write a snakemake pipeline to download files from an aws s3 instance.
Because the organization and naming of my files on s3 is inconsistent, I do not want to use snakemake's remote options. Instead, I use a mix of grep and python to enumerate the paths I want on s3, and put them in a text file:
#s3paths.txt
s3://path/to/sample1.bam
s3://path/to/sample2.bam
In my config file I specify the samples I want to work with:
#config.yaml
samplesToDownload: [sample1, sample3, sample18]
I want to make a pipeline where the first rule downloads files from s3 who contain a string present in config['samplesToDownload']. A runtime code snippet does this for me:
pathsToDownload: [path for path in s3paths.txt if path contains string in samplesToDownload]
All this works fine, and I am left with a global variable pathsToDownload that looks something like this:
pathsToDownload: ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
Now I try to get snakemake involved and struggle. If I try to put the python variable in inputs, snakemake refuses because the file does not exist locally:
rule download_bams_from_s3:
input:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {input.s3Path} where/I/want/file/{sample}.bam
This fails because input.s3Path cannot be found as it is a path on s3, not a local path. I then try to do the same but with the pathsToDownload as a param:
rule download_bams_from_s3:
params:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {params.s3Path} where/I/want/file/{sample}.bam
This doesn't produce an error, but it produces the wrong type of shell command. Instead of producing what I want, which is 3 total shell commands:
shell: aws s3 cp path/to/sample1 where/I/want/file/sample1.bam
shell: aws s3 cp path/to/sample3 where/I/want/file/sample3.bam
shell: aws s3 cp path/to/sample18 where/I/want/file/sample18.bam
it produces one shell command with all three paths:
shell: aws s3 cp path/to/sample1 path/to/sample3 path/to/sample18 where/I/want/file/sample1.bam where/I/want/file/sample3.bam where/I/want/file/sample18.bam
Even if I were able to properly construct one massive shell command it is not what I want because I want separate shell commands to take advantage of snakemakes parallelization and ability to not redownload the same file if it already exists.
I feel like this use case for snakemake is not a big ask but I have spent hours trying to construct something workable to no avail. A clean solution is much appreciated!

You could create a dictionary that maps samples to aws paths and use that dictionary to download files one by one. Like:
samplesToDownload = [sample1, sample3, sample18]
pathsToDownload = ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
samplesToPaths = dict(zip(samplesToDownload, pathsToDownload))
rule all:
input:
expand('where/I/want/file/{sample}.bam', sample= samplesToDownload),
rule download_bams_from_s3:
params:
s3Path= lambda wc: samplesToPaths[wc.sample],
output:
bam='where/I/want/file/{sample}.bam',
shell:
r"""
aws s3 cp {params.s3Path} {output.bam}
"""

Related

Snakemake pipeline using directories and files

I am building a snakemake pipeline with python scripts.
Some of the python scripts take as input a directory, while others take as input files inside those directories.
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
Example of what I am doing showing only two rules:
FILES = glob.glob("data/*/*raw.csv")
FOLDERS = glob.glob("data/*/")
rule targets:
input:
processed_csv = expand("{files}raw_processed.csv", files =FILES),
normalised_csv = expand("{folders}/normalised.csv", folders=FOLDERS)
rule process_raw_csv:
input:
script = "process.py",
csv = "{sample}raw.csv"
output:
processed_csv = "{sample}raw_processed.csv"
shell:
"python {input.script} -i {input.csv} -o {output.processed_csv}"
rule normalise_processed_csv:
input:
script = "normalise.py",
processed_csv = "{sample}raw_processed.csv" #This is input to the script but is not parsed, instead it is fetched within the code normalise.py
params:
folder = "{folders}"
output:
normalised_csv = "{folders}/normalised.csv" # The output
shell:
"python {input.script} -i {params.folder}"
Some python scripts (process.py) take all the files they needed or produced as inputs and they need to be parsed. Some python scripts only take the main directory as input and the inputs are fetched inside and the outputs are written on it.
I am considering rewriting all the python scripts so that they take the main directory as input, but I think there could be a smart solution to be able to run these two types on the same snakemake pipeline.
Thank you very much in advance.
P.S. I have checked and this question is similar but not the same: Process multiple directories and all files within using snakemake
I would like to be able to do have some rules which take as input the directory and some that take as input the files. Is this possible?
I don't see anything special with this requirement... What about this?
rule one:
output:
d=directory('{sample}'),
a='{sample}/a.txt',
b='{sample}/b.txt',
shell:
r"""
mkdir -p {output.d}
touch {output.a}
touch {output.b}
"""
rule use_dir:
input:
d='{sample}',
output:
out='outdir/{sample}.out',
shell:
r"""
cat {input.d}/* > {output.out}
"""
rule use_files:
input:
a='{sample}/a.txt',
b='{sample}/b.txt',
output:
out='outfiles/{sample}.out',
shell:
r"""
cat {input.a} {input.b} > {output.out}
"""
rule use_dir will use the content of directory {sample}, whatever it contains. Rule use_files will use specifically files a.txt and b.txt from directory {sample}.

How to change filepath names in Amazon S3 bucket?

all.
I have around 900 objects in the following structure:
s3/bucketname/INGESTIONDATE=2022-02-20/file.csv
s3/bucketname/INGESTIONDATE=2022-02-21/file.csv
s3/bucketname/INGESTIONDATE=2022-02-22/file.csv
etc..
I need to change the file path to:
s3/bucketname/ingest_date=2022-02-20/file.json
s3/bucketname/ingest_date=2022-02-21/file.json
s3/bucketname/ingest_date=2022-02-22/file.json
This is around 900 objects so I do not plan on doing it by hand through the console.
Also, I am not too bothered about the JSON conversion, I can do that. It is mainly changing the filepaths and copying to a new bucket.
Any ideas?
You can do something like using aws cli -
aws s3 ls s3://bucketname/ | awk '{print $4}' | while read f; do newf=${f//INGESTIONDATE/ingest_date/} && aws s3 mv s3://bucketname/${f} s3://bucketname/${newf%???}json ; done
aws s3 ls s3://bucketname/ - lists the files in the bucket
newf=${f//INGESTIONDATE/ingest_date/} replace INGESTIONDATE with ingest_date and store in newf
We iterate through the file names as f and corresponding newf
And move the suffix of file from csv to json(${newf%???} removes the last 3 chars of f) using aws mv command
For testing, you can run aws mv with --dry-run command to see the mv commands which would be ran.
As pointed out by #Konrad, this assumes that there are no whitespace/newline chars in the filepath(which would make the awk return truncated file path).

Snakemake - Rule that downloads data using sftp

I want to download files from a password protected FTP server in a Snakemake rule. I have seen the answer of Maarten-vd-Sande on specifying it using wildcards. Is it also possible using inputs without running into the MissingInputException?
FILES = ['file1.txt',
'file2.txt']
#remote file retrieval
rule download_file:
# replacing input by output would download all files in one job?
input:
file = expand("{file}", file=FILES)
shell:
# #this assumes your runtime has the SSHPASS env variable set
"sshpass -e sftp -B 258048 server<< get {input.file} data/{input.file}; exit"
I have seen the hint on the SFTP class in snakemake, but I am unsure how to use it in this context.
Thanks in advance!
I haven't tested this, but I am guessing something like this should work! We say that all the output we want is in rule all. Then we have the download rule to download those. I have no experience with using snakemake.remote, so I might be completely wrong in this though.
from snakemake.remote.SFTP import RemoteProvider
SFTP = RemoteProvider()
FILES = ['file1.txt',
'file2.txt']
rule all:
input:
FILES
rule download_file:
input:
SFTP.remote("{filename}.txt")
output:
"{filename}.txt"
# shell: # I am not sure if the shell keyword is required, if not, then you can remove these two lines.
# The : does nothing, just for the sake of having something there
# ":"
So I ended up using the following. The trick was how to pass the command to sftp using <<< "command". The envvars let snakemake check that the SSHPASSis set for sshpass to pick up.
envvars:
"SSHPASS"
#remote file retrieval
# #Idea: Replace using SFTP class
rule download_file:
output:
raw = temp(os.path.join(config['DATADIR'], "{file}", "{file}.txt"))
params:
file="{file}.txt"
resources:
walltime="300", nodes=1, mem_mb=2048
threads:
1
shell:
"sshpass -e sftp -B 258048 server <<< \"get {params.file} {output.raw} \""

Pig Script: STORE command not working

this is my first time posting to StackOverflow and I'm hoping someone can assist. I'm fairly new at pig scripts and have encountered a problem I can't solve.
Below is a pig script that fails when I attempt to write results to a file:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage('$delimiter') AS ($fields);
B = FILTER A by ($field_nm) IS NOT NULL;
C = FOREACH B GENERATE ($field_nm) as fld;
D = GROUP C ALL;
E = FOREACH D GENERATE myfuncs.theResult(C.fld);
--DUMP E;
STORE E INTO 'myoutput/theResult';
EXEC;
I see the results of E when I Dump to the screen. However, I need to store the results temporarily in a file. After the Store command, the error I receive is: Output Location Validation Failed.
I've tried numerous workarounds, like removing the theResult folder and removing the earlier contents of theResult, but none of the commands I use work. These have been along the lines of:
hdfs dfs -rm myoutput/theResult
and
hadoop fs -rm myoutput/theResult
...using both the shell (hs) and file system (fs) commands. I've tried to call another function (shell script, python function, etc.) to clear the earlier results stored in the myoutput/theResult folder. I've read every website I can find and nothing is working. Any ideas??
the output location of a mapreduce is a directory. So, you must have tried it this way
hadoop fs -rmr myoutput/theResult
and then run the pig script. It will work.
"rmr" - remove recursive, which deletes both folder/file
"rm" - is just remove which removes only file
Everytime, you need to either change output path or delete and use the same, since HDFS is worm(write once read many) model storage.
Couple of things you can try-
making sure output director is a valid path.
Remove the entire directory and not just content within it. Remove directory with 'rmr and check that path doesn't exist before running pig script.
Thanks for both of your replies. I now have a solution that is working:
fs -mkdir -p myoutput/theResult
fs -rm -r myoutput/theResult
The first line attempts to create a directory, but the "-p" prevents an error if it already exists. Then the second line removes it. Either way, there will be a directory to remove, so no error!
The output of store is confusing when we are using Pig for the first time.
store grp into '/output1';
This will create the folder named 'output1' in root. The folder should not be already present
You can give your own hdfs path here like /user/thewhitetulip.
hdfs dfs -ls /output1
output:
/output1/_SUCCESS
/output1/part-r-00000
The part-r-00000 file is the output of the store program.

How To Delete S3 Files Starting With

Let's say I have images of different sizes on S3:
137ff24f-02c9-4656-9d77-5e761d76a273.webp
137ff24f-02c9-4656-9d77-5e761d76a273_500_300.webp
137ff24f-02c9-4656-9d77-5e761d76a273_400_280.webp
I am using boto to delete a single file:
bucket = get_s3_bucket()
s3_key = Key(bucket)
s3_key.key = '137ff24f-02c9-4656-9d77-5e761d76a273.webp'
bucket.delete_key(s3_key)
But I would like to delete all keys starting with 137ff24f-02c9-4656-9d77-5e761d76a273.
Keep in mind there might be hundreds of files in the bucket so I don't want to iterate over all files. Is there a way to delete only files starting with certain string?
Maybe some regex delete function.
The S3 service does support a multi-delete operation allowing you to delete up to 1000 objects in a single API call. However, this API call doesn't provide support for server-side filtering of the keys. You have to provide the list of keys you want to delete.
You could roll your own. First, you would want to get a list of all the keys you want to delete.
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
to_delete = list(bucket.list(prefix='137ff24f-02c9-4656-9d77-5e761d76a273'))
The list call returns a generator but I'm converting that to a list using list so, the to_delete variable now points to list of all of the objects in the bucket that match the prefix I have provided.
Now, we need to create chunks of up to 1000 objects from the big list and use the chunk to call the delete_keys method of the bucket object.
for chunk in [to_delete[i:i+1000] for i in range(0, len(to_delete), 1000)]:
result = bucket.delete_keys(chunk)
if result.errors:
print('The following errors occurred')
for error in result.errors:
print(error)
There are more efficient ways to do this (e.g. without converting the bucket generator into a list) and you probably want to do something different when handling the errors but this should give you a start.
you can do it using aws cli : https://aws.amazon.com/cli/ and some unix command.
this aws cli commands should work:
aws s3 rm <your_bucket_name> --exclude "*" --include "*137ff24f-02c9-4656-9d77-5e761d76a273*"
if you want to include sub-folders you should add the flag --recursive
or with unix commands:
aws s3 ls s3://<your_bucket_name>/ | awk '{print $4}' | xargs -I% <your_os_shell> -c 'aws s3 rm s3:// <your_bucket_name> /% $1'
explanation:
list all files on the bucket --pipe-->
get the 4th parameter(its the file name) --pipe-->
run delete script with aws cli
Yes. try usings3cmd, command line tool for S3. First get the list of all files in the bucket.
cmd = 's3cmd ls s3://bucket_name'
args = shlex.split(cmd)
ls_lines = subprocess.check_output(args).splitlines()
Then find all lines that start with your desired string(using regex, should be simple). The delete all of thrm using the command:
s3cmd del s3://bucket_name/file_name(s)
Or if you just wanna use a single command:
s3cmd del s3://bucket_name/string*
I mentioned the first method so that you can test the names of files you are deleting and don't accidently delete anything else.
For boto3 the following snippet removes all files starting with a particular prefix:
import boto3
botoSession = boto3.Session(
aws_access_key_id = <your access key>,
aws_secret_access_key = <your secret key>,
region_name = <your region>,
)
s3 = botoSession.resource('s3')
bucket = s3.Bucket(bucketname)
objects = bucket.objects.filter(Prefix=<your prefix>)
objects.delete()
While there's no direct boto method to do what you want, you should be able to do it efficiently by using get_all_keys, filtering them with the said regex, and then calling delete_keys.
Doing it this way will use only two requests, and doing the regex client-side should be pretty fast

Categories

Resources