List AWS S3 folders with boto3 - python

I have boto code that collects S3 sub-folders in levelOne folder:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("MyBucket")
for level2 in bucket.list(prefix="levelOne/", delimiter="/"):
print(level2.name)
Please help to discover similar functionality in boto3. The code should not iterate through all S3 objects because the bucket has a very big number of objects.

If you are simply seeking a list of folders, then use CommonPrefixes returned when listing objects. Note that a Delimiter must be specified to obtain the CommonPrefixes:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='BUCKET-NAME', Delimiter = '/')
for prefix in response['CommonPrefixes']:
print(prefix['Prefix'][:-1])
If your bucket has a HUGE number of folders and objects, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.

I think the following should be equivalent:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('MyBucket')
for object in bucket.objects.filter(Prefix="levelOne/", Delimiter="/"):
print(object.key)

Related

Read content of a file located under subfolders of S3 in Python

I'm trying to read a file content's (not to download it) from an S3 bucket. The problem is that the file is located under a multi-level folder. For instance, the full path could be s3://s3-bucket/folder-1/folder-2/my_file.json. How can I get that specific file instead of using my iterative approach that lists all objects?
Here is the code that I want to change:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucket')
for obj in my_bucket.objects.all():
key = obj.key
if key == 'folder-1/folder-2/my_file.json':
return obj.get()['Body'].read()
Can it be done in a simpler, more direct way?
Yes - there is no need to enumerate the bucket.
Read the file directly using s3.Object, providing the bucket name as the 1st parameter & the object key as the 2nd parameter.
"Folders" don't really exist in S3 - Amazon S3 doesn't use hierarchy to organize its objects and files. For the sake of organizational simplicity, the Amazon S3 console shows "folders" as a means of grouping objects but they are ultimately baked into your object key.
This should work:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object("s3-bucket", "folder-1/folder-2/my_file.json")
body = obj.get()['Body'].read()

How to list and read each of the files in specific folder of an S3 bucket using Python Boto3

I have some files in a specific folder of an s3 bucket. All file names are in the same pattern like below:
s3://example_bucket/products/product1/rawmat-2343274absdfj7827d.csv
s3://example_bucket/products/product1/rawmat-7997werewr666ee.csv
s3://example_bucket/products/product1/rawmat-8qwer897hhw776w3.csv
s3://example_bucket/products/product1/rawmat-2364875349873uy68848732.csv
....
....
Here, I think we can say:
bucket_name = 'example_bucket'
prefix = 'products/product1/'
key = 'rawmat-*.csv'
I need to read each of them. I highly prefer not to list of the objects in the bucket.
What could be the most efficient way to do this?
Iterate over the objects at the folder using a prefix
bucket_name = 'example_bucket'
prefix = 'products/product1/rawmat'
for my_object in bucket_name.objects.filter(Prefix= prefix):
print(my_object)

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

get s3 files with prefix using python

My s3 filename is 'folder/filename.xml'. i want to take the files end with 'name.xml'
import boto3
s3 = boto3.resource('s3')
try:
fileobj = s3.Object('lcu-matillion',''folder/.*name.xml'').get()['Body']
data=fileobj.read()
except Exception:
print('not found')
Any one please help with accurate code?
Thanks
Don't forget that there could be multiple files that match that wildcard.
You would use something like:
import boto3
s3 = boto3.resource('s3', region_name='ap-southeast-2')
bucket = s3.Bucket('my-bucket')
objects = bucket.objects.filter(Prefix='folder-name/')
for object in objects:
if object.key.endswith('.txt'):
object.download_file('/tmp/' + object.key)
This is a pretty OLD one and I am at loss that the main answer which has been accepted is a very poor and potentially dangerous one.
This essentially lists ALL objects and brings searching to the client side. On a bucket with thousands of objects (I guess most buckets) that is terrible.
What you need to do is to use .filter() instead of .all():
s3 = boto3.resource('s3')
buc = s3.Bucket("twtalyser")
for s in buc.objects.filter(Prefix='my/desired/prefix'):
print(s)
UPDATE
The main answer was updated to reflect the point I was making.

What is the fastest way to empty s3 bucket using boto3?

I was thinking about deleting and then re-creating bucket (bad option which I realised later).
Then how can delete all objects from the bucket?
I tried this : http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.delete_objects
But it deletes multiple objects not all.
can you suggest what is the best way to empty bucket ?
Just use aws cli.
aws s3 rm s3://mybucket --recursive
Well, for longer answer if you insists to use boto3. This will send a delete marker to s3. No folder handling required. bucket.Object.all will create a iterator that not limit to 1K .
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
# suggested by Jordon Philips
bucket.objects.all().delete()
If versioning is enabled, there's a similar call to the other answer to delete all object versions:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
bucket.object_versions.delete()
Based on previous responses, and adding check of versioning enabled, you can empty a bucket, with versions enabled or not doing:
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket_name)
bucket_versioning = s3.BucketVersioning(bucket_name)
if bucket_versioning.status == 'Enabled':
s3_bucket.object_versions.delete()
else:
s3_bucket.objects.all().delete()

Categories

Resources