How to change filepath names in Amazon S3 bucket? - python

all.
I have around 900 objects in the following structure:
s3/bucketname/INGESTIONDATE=2022-02-20/file.csv
s3/bucketname/INGESTIONDATE=2022-02-21/file.csv
s3/bucketname/INGESTIONDATE=2022-02-22/file.csv
etc..
I need to change the file path to:
s3/bucketname/ingest_date=2022-02-20/file.json
s3/bucketname/ingest_date=2022-02-21/file.json
s3/bucketname/ingest_date=2022-02-22/file.json
This is around 900 objects so I do not plan on doing it by hand through the console.
Also, I am not too bothered about the JSON conversion, I can do that. It is mainly changing the filepaths and copying to a new bucket.
Any ideas?

You can do something like using aws cli -
aws s3 ls s3://bucketname/ | awk '{print $4}' | while read f; do newf=${f//INGESTIONDATE/ingest_date/} && aws s3 mv s3://bucketname/${f} s3://bucketname/${newf%???}json ; done
aws s3 ls s3://bucketname/ - lists the files in the bucket
newf=${f//INGESTIONDATE/ingest_date/} replace INGESTIONDATE with ingest_date and store in newf
We iterate through the file names as f and corresponding newf
And move the suffix of file from csv to json(${newf%???} removes the last 3 chars of f) using aws mv command
For testing, you can run aws mv with --dry-run command to see the mv commands which would be ran.
As pointed out by #Konrad, this assumes that there are no whitespace/newline chars in the filepath(which would make the awk return truncated file path).

Related

snakemake: pass input that does not exist (or pass multiple params)

I am trying and struggling mightily to write a snakemake pipeline to download files from an aws s3 instance.
Because the organization and naming of my files on s3 is inconsistent, I do not want to use snakemake's remote options. Instead, I use a mix of grep and python to enumerate the paths I want on s3, and put them in a text file:
#s3paths.txt
s3://path/to/sample1.bam
s3://path/to/sample2.bam
In my config file I specify the samples I want to work with:
#config.yaml
samplesToDownload: [sample1, sample3, sample18]
I want to make a pipeline where the first rule downloads files from s3 who contain a string present in config['samplesToDownload']. A runtime code snippet does this for me:
pathsToDownload: [path for path in s3paths.txt if path contains string in samplesToDownload]
All this works fine, and I am left with a global variable pathsToDownload that looks something like this:
pathsToDownload: ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
Now I try to get snakemake involved and struggle. If I try to put the python variable in inputs, snakemake refuses because the file does not exist locally:
rule download_bams_from_s3:
input:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {input.s3Path} where/I/want/file/{sample}.bam
This fails because input.s3Path cannot be found as it is a path on s3, not a local path. I then try to do the same but with the pathsToDownload as a param:
rule download_bams_from_s3:
params:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {params.s3Path} where/I/want/file/{sample}.bam
This doesn't produce an error, but it produces the wrong type of shell command. Instead of producing what I want, which is 3 total shell commands:
shell: aws s3 cp path/to/sample1 where/I/want/file/sample1.bam
shell: aws s3 cp path/to/sample3 where/I/want/file/sample3.bam
shell: aws s3 cp path/to/sample18 where/I/want/file/sample18.bam
it produces one shell command with all three paths:
shell: aws s3 cp path/to/sample1 path/to/sample3 path/to/sample18 where/I/want/file/sample1.bam where/I/want/file/sample3.bam where/I/want/file/sample18.bam
Even if I were able to properly construct one massive shell command it is not what I want because I want separate shell commands to take advantage of snakemakes parallelization and ability to not redownload the same file if it already exists.
I feel like this use case for snakemake is not a big ask but I have spent hours trying to construct something workable to no avail. A clean solution is much appreciated!
You could create a dictionary that maps samples to aws paths and use that dictionary to download files one by one. Like:
samplesToDownload = [sample1, sample3, sample18]
pathsToDownload = ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
samplesToPaths = dict(zip(samplesToDownload, pathsToDownload))
rule all:
input:
expand('where/I/want/file/{sample}.bam', sample= samplesToDownload),
rule download_bams_from_s3:
params:
s3Path= lambda wc: samplesToPaths[wc.sample],
output:
bam='where/I/want/file/{sample}.bam',
shell:
r"""
aws s3 cp {params.s3Path} {output.bam}
"""

How to move Header and Trailer from files to another file?

I have around 100 text files with close to thousand records in a folder. I want to copy header and trailer of these files into a new file with the file name of respective file.
So the output i want is as
File_Name,Header,Trailer
is this possible using Unix or Python?
one way to do it is with the bash shell in the folder containing the files:
for file in *; do echo "$file,$(head -1 $file),$(tail -1 $file)"; done
PowerShell-core one liner with aliases
gci *.txt |%{"{0},{1},{2}" -f $_.FullName,(gc $_ -Head 1),(gc $_ -Tail 1)}|set-content .\newfile.txt

Pig Script: STORE command not working

this is my first time posting to StackOverflow and I'm hoping someone can assist. I'm fairly new at pig scripts and have encountered a problem I can't solve.
Below is a pig script that fails when I attempt to write results to a file:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage('$delimiter') AS ($fields);
B = FILTER A by ($field_nm) IS NOT NULL;
C = FOREACH B GENERATE ($field_nm) as fld;
D = GROUP C ALL;
E = FOREACH D GENERATE myfuncs.theResult(C.fld);
--DUMP E;
STORE E INTO 'myoutput/theResult';
EXEC;
I see the results of E when I Dump to the screen. However, I need to store the results temporarily in a file. After the Store command, the error I receive is: Output Location Validation Failed.
I've tried numerous workarounds, like removing the theResult folder and removing the earlier contents of theResult, but none of the commands I use work. These have been along the lines of:
hdfs dfs -rm myoutput/theResult
and
hadoop fs -rm myoutput/theResult
...using both the shell (hs) and file system (fs) commands. I've tried to call another function (shell script, python function, etc.) to clear the earlier results stored in the myoutput/theResult folder. I've read every website I can find and nothing is working. Any ideas??
the output location of a mapreduce is a directory. So, you must have tried it this way
hadoop fs -rmr myoutput/theResult
and then run the pig script. It will work.
"rmr" - remove recursive, which deletes both folder/file
"rm" - is just remove which removes only file
Everytime, you need to either change output path or delete and use the same, since HDFS is worm(write once read many) model storage.
Couple of things you can try-
making sure output director is a valid path.
Remove the entire directory and not just content within it. Remove directory with 'rmr and check that path doesn't exist before running pig script.
Thanks for both of your replies. I now have a solution that is working:
fs -mkdir -p myoutput/theResult
fs -rm -r myoutput/theResult
The first line attempts to create a directory, but the "-p" prevents an error if it already exists. Then the second line removes it. Either way, there will be a directory to remove, so no error!
The output of store is confusing when we are using Pig for the first time.
store grp into '/output1';
This will create the folder named 'output1' in root. The folder should not be already present
You can give your own hdfs path here like /user/thewhitetulip.
hdfs dfs -ls /output1
output:
/output1/_SUCCESS
/output1/part-r-00000
The part-r-00000 file is the output of the store program.

How To Delete S3 Files Starting With

Let's say I have images of different sizes on S3:
137ff24f-02c9-4656-9d77-5e761d76a273.webp
137ff24f-02c9-4656-9d77-5e761d76a273_500_300.webp
137ff24f-02c9-4656-9d77-5e761d76a273_400_280.webp
I am using boto to delete a single file:
bucket = get_s3_bucket()
s3_key = Key(bucket)
s3_key.key = '137ff24f-02c9-4656-9d77-5e761d76a273.webp'
bucket.delete_key(s3_key)
But I would like to delete all keys starting with 137ff24f-02c9-4656-9d77-5e761d76a273.
Keep in mind there might be hundreds of files in the bucket so I don't want to iterate over all files. Is there a way to delete only files starting with certain string?
Maybe some regex delete function.
The S3 service does support a multi-delete operation allowing you to delete up to 1000 objects in a single API call. However, this API call doesn't provide support for server-side filtering of the keys. You have to provide the list of keys you want to delete.
You could roll your own. First, you would want to get a list of all the keys you want to delete.
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
to_delete = list(bucket.list(prefix='137ff24f-02c9-4656-9d77-5e761d76a273'))
The list call returns a generator but I'm converting that to a list using list so, the to_delete variable now points to list of all of the objects in the bucket that match the prefix I have provided.
Now, we need to create chunks of up to 1000 objects from the big list and use the chunk to call the delete_keys method of the bucket object.
for chunk in [to_delete[i:i+1000] for i in range(0, len(to_delete), 1000)]:
result = bucket.delete_keys(chunk)
if result.errors:
print('The following errors occurred')
for error in result.errors:
print(error)
There are more efficient ways to do this (e.g. without converting the bucket generator into a list) and you probably want to do something different when handling the errors but this should give you a start.
you can do it using aws cli : https://aws.amazon.com/cli/ and some unix command.
this aws cli commands should work:
aws s3 rm <your_bucket_name> --exclude "*" --include "*137ff24f-02c9-4656-9d77-5e761d76a273*"
if you want to include sub-folders you should add the flag --recursive
or with unix commands:
aws s3 ls s3://<your_bucket_name>/ | awk '{print $4}' | xargs -I% <your_os_shell> -c 'aws s3 rm s3:// <your_bucket_name> /% $1'
explanation:
list all files on the bucket --pipe-->
get the 4th parameter(its the file name) --pipe-->
run delete script with aws cli
Yes. try usings3cmd, command line tool for S3. First get the list of all files in the bucket.
cmd = 's3cmd ls s3://bucket_name'
args = shlex.split(cmd)
ls_lines = subprocess.check_output(args).splitlines()
Then find all lines that start with your desired string(using regex, should be simple). The delete all of thrm using the command:
s3cmd del s3://bucket_name/file_name(s)
Or if you just wanna use a single command:
s3cmd del s3://bucket_name/string*
I mentioned the first method so that you can test the names of files you are deleting and don't accidently delete anything else.
For boto3 the following snippet removes all files starting with a particular prefix:
import boto3
botoSession = boto3.Session(
aws_access_key_id = <your access key>,
aws_secret_access_key = <your secret key>,
region_name = <your region>,
)
s3 = botoSession.resource('s3')
bucket = s3.Bucket(bucketname)
objects = bucket.objects.filter(Prefix=<your prefix>)
objects.delete()
While there's no direct boto method to do what you want, you should be able to do it efficiently by using get_all_keys, filtering them with the said regex, and then calling delete_keys.
Doing it this way will use only two requests, and doing the regex client-side should be pretty fast

How to create folders using file names and then move files into folders?

I have hundreds of text files in a folder named using this kind of naming convention:
Bandname1 - song1.txt
Bandname1 - song2.txt
Bandname2 - song1.txt
Bandname2 - song2.txt
Bandname2 - song3.txt
Bandname3 - song1.txt
..etc.
I would like to create folders for different bands and move according text files into these folders. How could I achieve this using bash, perl or python script?
It's not necessary to use trim or xargs:
for f in *.txt; do
band=${f% - *}
mkdir -p "$band"
mv "$f" "$band"
done
with Perl
use File::Copy move;
while (my $file= <*.txt> ){
my ($band,$others) = split /\s+-\s+/ ,$file ;
mkdir $band;
move($file, $band);
}
gregseth's answer will work, just replace trim with xargs. You could also eliminate the if test by just using mkdir -p, for example:
for f in *.txt; do
band=$(echo "$f" | cut -d'-' -f1 | xargs)
mkdir -p "$band"
mv "$f" "$band"
done
Strictly speaking the trim or xargs shouldn't even be necessary, but xargs will at least remove any extra formatting, so it doesn't hurt.
You asked for a specific script, but if this is for organizing your music, you might want to check out EasyTAG. It has extremely specific and powerful rules that you can customize to organize your music however you want:
(source: sourceforge.net)
This rule says, "assume my file names are in the structure "[artist] - [album title]/[track number] - [title]". Then you can tag them as such, or move the files around to any new pattern, or do pretty much anything else.
How about this:
for f in *.txt
do
band=$(echo "$f" | cut -d'-' -f1 | trim)
if [ -d "$band" ]
then
mkdir "$band"
fi
mv "$f" "$band"
done
This Python program assumes that the source files are in data and that the new directory structure should be in target (and that it already exists).
The key point is that os.path.walk will traverse the data directory structure and call myVisitor for each file.
import os
import os.path
sourceDir = "data"
targetDir = "target"
def myVisitor(arg, dirname, names):
for file in names:
bandDir = file.split("-")[0]
newDir = os.path.join(targetDir, bandDir)
if (not os.path.exists(newDir)):
os.mkdir(newDir)
newName = os.path.join(newDir, file)
oldName = os.path.join(dirname, file)
os.rename(oldName, newName)
os.path.walk(sourceDir, myVisitor, None)
ls |perl -lne'$f=$_; s/(.+?) - [^-]*\.txt/$1/; mkdir unless -d; rename $f, "$_/$f"'

Categories

Resources