How To Delete S3 Files Starting With - python

Let's say I have images of different sizes on S3:
137ff24f-02c9-4656-9d77-5e761d76a273.webp
137ff24f-02c9-4656-9d77-5e761d76a273_500_300.webp
137ff24f-02c9-4656-9d77-5e761d76a273_400_280.webp
I am using boto to delete a single file:
bucket = get_s3_bucket()
s3_key = Key(bucket)
s3_key.key = '137ff24f-02c9-4656-9d77-5e761d76a273.webp'
bucket.delete_key(s3_key)
But I would like to delete all keys starting with 137ff24f-02c9-4656-9d77-5e761d76a273.
Keep in mind there might be hundreds of files in the bucket so I don't want to iterate over all files. Is there a way to delete only files starting with certain string?
Maybe some regex delete function.

The S3 service does support a multi-delete operation allowing you to delete up to 1000 objects in a single API call. However, this API call doesn't provide support for server-side filtering of the keys. You have to provide the list of keys you want to delete.
You could roll your own. First, you would want to get a list of all the keys you want to delete.
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
to_delete = list(bucket.list(prefix='137ff24f-02c9-4656-9d77-5e761d76a273'))
The list call returns a generator but I'm converting that to a list using list so, the to_delete variable now points to list of all of the objects in the bucket that match the prefix I have provided.
Now, we need to create chunks of up to 1000 objects from the big list and use the chunk to call the delete_keys method of the bucket object.
for chunk in [to_delete[i:i+1000] for i in range(0, len(to_delete), 1000)]:
result = bucket.delete_keys(chunk)
if result.errors:
print('The following errors occurred')
for error in result.errors:
print(error)
There are more efficient ways to do this (e.g. without converting the bucket generator into a list) and you probably want to do something different when handling the errors but this should give you a start.

you can do it using aws cli : https://aws.amazon.com/cli/ and some unix command.
this aws cli commands should work:
aws s3 rm <your_bucket_name> --exclude "*" --include "*137ff24f-02c9-4656-9d77-5e761d76a273*"
if you want to include sub-folders you should add the flag --recursive
or with unix commands:
aws s3 ls s3://<your_bucket_name>/ | awk '{print $4}' | xargs -I% <your_os_shell> -c 'aws s3 rm s3:// <your_bucket_name> /% $1'
explanation:
list all files on the bucket --pipe-->
get the 4th parameter(its the file name) --pipe-->
run delete script with aws cli

Yes. try usings3cmd, command line tool for S3. First get the list of all files in the bucket.
cmd = 's3cmd ls s3://bucket_name'
args = shlex.split(cmd)
ls_lines = subprocess.check_output(args).splitlines()
Then find all lines that start with your desired string(using regex, should be simple). The delete all of thrm using the command:
s3cmd del s3://bucket_name/file_name(s)
Or if you just wanna use a single command:
s3cmd del s3://bucket_name/string*
I mentioned the first method so that you can test the names of files you are deleting and don't accidently delete anything else.

For boto3 the following snippet removes all files starting with a particular prefix:
import boto3
botoSession = boto3.Session(
aws_access_key_id = <your access key>,
aws_secret_access_key = <your secret key>,
region_name = <your region>,
)
s3 = botoSession.resource('s3')
bucket = s3.Bucket(bucketname)
objects = bucket.objects.filter(Prefix=<your prefix>)
objects.delete()

While there's no direct boto method to do what you want, you should be able to do it efficiently by using get_all_keys, filtering them with the said regex, and then calling delete_keys.
Doing it this way will use only two requests, and doing the regex client-side should be pretty fast

Related

Get a list of files in S3 using PySpark in Databricks

I'm trying to generate a list of all S3 files in a bucket/folder. There are usually in the magnitude of millions of files in the folder. I use boto right now and it's able to retrieve around 33k files per minute, which for even a million files, takes half an hour. I also load these files into a dataframe, but generate and use this list as a way to track which files are being processed.
What I've noticed is that when I ask Spark to read all files in the folder, it does a listing of its own and is able to list them out much faster than the boto call can, and then process those files. I looked up a way to do this in PySpark, but found no good examples. The closest I got was some Java and Scala code to list out the files using the HDFS library.
Is there a way we can do this in Python and Spark? For reference, I'm trying to replicate the following code snippet:
def get_s3_files(source_directory, file_type="json"):
s3_resource = boto3.resource("s3")
file_prepend_path = f"/{'/'.join(source_directory.parts[1:4])}"
bucket_name = str(source_directory.parts[3])
prefix = "/".join(source_directory.parts[4:])
bucket = s3_resource.Bucket(bucket_name)
s3_source_files = []
for object in bucket.objects.filter(Prefix=prefix):
if object.key.endswith(f".{file_type}"):
s3_source_files.append(
(
f"{file_prepend_path}/{object.key}",
object.size,
str(source_directory),
str(datetime.now()),
)
)
return s3_source_files
This can be achievable very simply by dbutils.
def get_dir_content(ls_path):
dir_paths = dbutils.fs.ls(ls_path)
subdir_paths = [get_dir_content(p.path) for p in dir_paths if p.isDir() and p.path != ls_path]
flat_subdir_paths = [p for subdir in subdir_paths for p in subdir]
return list(map(lambda p: p.path, dir_paths)) + flat_subdir_paths
paths = get_dir_content('s3 location')
[print(p) for p in paths]
For some reason, using the AWS CLI command was roughly 15 times(!) faster than using boto. Not sure exactly why this is the case, but here's the code I am currently using, in case someone might find it handy. Basically, use s3api to list the objects, and then use jq to manipulate the output and get it into a form of my liking.
def get_s3_files(source_directory, schema, file_type="json"):
file_prepend_path = f"/{'/'.join(source_directory.parts[1:4])}"
bucket = str(source_directory.parts[3])
prefix = "/".join(source_directory.parts[4:])
s3_list_cmd = f"aws s3api list-objects-v2 --bucket {bucket} --prefix {prefix} | jq -r '.Contents[] | select(.Key | endswith(\".{file_type}\")) | [\"{file_prepend_path}/\"+.Key, .Size, \"{source_directory}\", (now | strftime(\"%Y-%m-%d %H:%M:%S.%s\"))] | #csv'"
s3_list = subprocess.check_output(s3_list_cmd, shell=True, universal_newlines=True)
with open(f"s3_file_paths.csv", "w") as f:
f.truncate()
f.write(s3_list)
s3_source_files_df = spark.read.option("header", False).schema(schema).csv(f"s3_file_paths.csv")
return s3_source_files_df

snakemake: pass input that does not exist (or pass multiple params)

I am trying and struggling mightily to write a snakemake pipeline to download files from an aws s3 instance.
Because the organization and naming of my files on s3 is inconsistent, I do not want to use snakemake's remote options. Instead, I use a mix of grep and python to enumerate the paths I want on s3, and put them in a text file:
#s3paths.txt
s3://path/to/sample1.bam
s3://path/to/sample2.bam
In my config file I specify the samples I want to work with:
#config.yaml
samplesToDownload: [sample1, sample3, sample18]
I want to make a pipeline where the first rule downloads files from s3 who contain a string present in config['samplesToDownload']. A runtime code snippet does this for me:
pathsToDownload: [path for path in s3paths.txt if path contains string in samplesToDownload]
All this works fine, and I am left with a global variable pathsToDownload that looks something like this:
pathsToDownload: ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
Now I try to get snakemake involved and struggle. If I try to put the python variable in inputs, snakemake refuses because the file does not exist locally:
rule download_bams_from_s3:
input:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {input.s3Path} where/I/want/file/{sample}.bam
This fails because input.s3Path cannot be found as it is a path on s3, not a local path. I then try to do the same but with the pathsToDownload as a param:
rule download_bams_from_s3:
params:
s3Path = pathsToDownload
output:
expand(where/I/want/file/{sample}.bam, sample=config['samplesToDownload'])
shell:
aws s3 cp {params.s3Path} where/I/want/file/{sample}.bam
This doesn't produce an error, but it produces the wrong type of shell command. Instead of producing what I want, which is 3 total shell commands:
shell: aws s3 cp path/to/sample1 where/I/want/file/sample1.bam
shell: aws s3 cp path/to/sample3 where/I/want/file/sample3.bam
shell: aws s3 cp path/to/sample18 where/I/want/file/sample18.bam
it produces one shell command with all three paths:
shell: aws s3 cp path/to/sample1 path/to/sample3 path/to/sample18 where/I/want/file/sample1.bam where/I/want/file/sample3.bam where/I/want/file/sample18.bam
Even if I were able to properly construct one massive shell command it is not what I want because I want separate shell commands to take advantage of snakemakes parallelization and ability to not redownload the same file if it already exists.
I feel like this use case for snakemake is not a big ask but I have spent hours trying to construct something workable to no avail. A clean solution is much appreciated!
You could create a dictionary that maps samples to aws paths and use that dictionary to download files one by one. Like:
samplesToDownload = [sample1, sample3, sample18]
pathsToDownload = ['s3://path/to/sample1.bam', 's3://path/to/sample3.bam', 's3://path/to/sample18.bam']
samplesToPaths = dict(zip(samplesToDownload, pathsToDownload))
rule all:
input:
expand('where/I/want/file/{sample}.bam', sample= samplesToDownload),
rule download_bams_from_s3:
params:
s3Path= lambda wc: samplesToPaths[wc.sample],
output:
bam='where/I/want/file/{sample}.bam',
shell:
r"""
aws s3 cp {params.s3Path} {output.bam}
"""

How to save files from s3 into current jupyter directory

I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

How to get objects from a folder in an S3 bucket

I am trying to traverse all objects inside a specific folder in my S3 bucket. The code I already have is like follows:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in bucket.objects.filter(Prefix='folder/'):
do_stuff(obj)
I need to use boto3.resource and not client. This code is not getting any objects at all although I have a bunch of text files in the folder. Can someone advise?
Try adding the Delimiter attribute: Delimiter = '\' as you are filtering objects. The rest of the code looks fine.
I had to make sure to skip the first file. For some reason it thinks the folder name is the first file and that may not be what you want.
for video_item in source_bucket.objects.filter(Prefix="my-folder-name/", Delimiter='/'):
if video_item.key == 'my-folder-name/':
continue
do_something(video_item.key)

Categories

Resources