What is the fastest way to empty s3 bucket using boto3? - python

I was thinking about deleting and then re-creating bucket (bad option which I realised later).
Then how can delete all objects from the bucket?
I tried this : http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.delete_objects
But it deletes multiple objects not all.
can you suggest what is the best way to empty bucket ?

Just use aws cli.
aws s3 rm s3://mybucket --recursive
Well, for longer answer if you insists to use boto3. This will send a delete marker to s3. No folder handling required. bucket.Object.all will create a iterator that not limit to 1K .
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
# suggested by Jordon Philips
bucket.objects.all().delete()

If versioning is enabled, there's a similar call to the other answer to delete all object versions:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
bucket.object_versions.delete()

Based on previous responses, and adding check of versioning enabled, you can empty a bucket, with versions enabled or not doing:
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket_name)
bucket_versioning = s3.BucketVersioning(bucket_name)
if bucket_versioning.status == 'Enabled':
s3_bucket.object_versions.delete()
else:
s3_bucket.objects.all().delete()

Related

Read content of a file located under subfolders of S3 in Python

I'm trying to read a file content's (not to download it) from an S3 bucket. The problem is that the file is located under a multi-level folder. For instance, the full path could be s3://s3-bucket/folder-1/folder-2/my_file.json. How can I get that specific file instead of using my iterative approach that lists all objects?
Here is the code that I want to change:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucket')
for obj in my_bucket.objects.all():
key = obj.key
if key == 'folder-1/folder-2/my_file.json':
return obj.get()['Body'].read()
Can it be done in a simpler, more direct way?
Yes - there is no need to enumerate the bucket.
Read the file directly using s3.Object, providing the bucket name as the 1st parameter & the object key as the 2nd parameter.
"Folders" don't really exist in S3 - Amazon S3 doesn't use hierarchy to organize its objects and files. For the sake of organizational simplicity, the Amazon S3 console shows "folders" as a means of grouping objects but they are ultimately baked into your object key.
This should work:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object("s3-bucket", "folder-1/folder-2/my_file.json")
body = obj.get()['Body'].read()

How to save files from s3 into current jupyter directory

I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.

List AWS S3 folders with boto3

I have boto code that collects S3 sub-folders in levelOne folder:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("MyBucket")
for level2 in bucket.list(prefix="levelOne/", delimiter="/"):
print(level2.name)
Please help to discover similar functionality in boto3. The code should not iterate through all S3 objects because the bucket has a very big number of objects.
If you are simply seeking a list of folders, then use CommonPrefixes returned when listing objects. Note that a Delimiter must be specified to obtain the CommonPrefixes:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='BUCKET-NAME', Delimiter = '/')
for prefix in response['CommonPrefixes']:
print(prefix['Prefix'][:-1])
If your bucket has a HUGE number of folders and objects, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
I think the following should be equivalent:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('MyBucket')
for object in bucket.objects.filter(Prefix="levelOne/", Delimiter="/"):
print(object.key)

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

get s3 files with prefix using python

My s3 filename is 'folder/filename.xml'. i want to take the files end with 'name.xml'
import boto3
s3 = boto3.resource('s3')
try:
fileobj = s3.Object('lcu-matillion',''folder/.*name.xml'').get()['Body']
data=fileobj.read()
except Exception:
print('not found')
Any one please help with accurate code?
Thanks
Don't forget that there could be multiple files that match that wildcard.
You would use something like:
import boto3
s3 = boto3.resource('s3', region_name='ap-southeast-2')
bucket = s3.Bucket('my-bucket')
objects = bucket.objects.filter(Prefix='folder-name/')
for object in objects:
if object.key.endswith('.txt'):
object.download_file('/tmp/' + object.key)
This is a pretty OLD one and I am at loss that the main answer which has been accepted is a very poor and potentially dangerous one.
This essentially lists ALL objects and brings searching to the client side. On a bucket with thousands of objects (I guess most buckets) that is terrible.
What you need to do is to use .filter() instead of .all():
s3 = boto3.resource('s3')
buc = s3.Bucket("twtalyser")
for s in buc.objects.filter(Prefix='my/desired/prefix'):
print(s)
UPDATE
The main answer was updated to reflect the point I was making.

Categories

Resources