Read content of a file located under subfolders of S3 in Python - python

I'm trying to read a file content's (not to download it) from an S3 bucket. The problem is that the file is located under a multi-level folder. For instance, the full path could be s3://s3-bucket/folder-1/folder-2/my_file.json. How can I get that specific file instead of using my iterative approach that lists all objects?
Here is the code that I want to change:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucket')
for obj in my_bucket.objects.all():
key = obj.key
if key == 'folder-1/folder-2/my_file.json':
return obj.get()['Body'].read()
Can it be done in a simpler, more direct way?

Yes - there is no need to enumerate the bucket.
Read the file directly using s3.Object, providing the bucket name as the 1st parameter & the object key as the 2nd parameter.
"Folders" don't really exist in S3 - Amazon S3 doesn't use hierarchy to organize its objects and files. For the sake of organizational simplicity, the Amazon S3 console shows "folders" as a means of grouping objects but they are ultimately baked into your object key.
This should work:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object("s3-bucket", "folder-1/folder-2/my_file.json")
body = obj.get()['Body'].read()

Related

How to list and read each of the files in specific folder of an S3 bucket using Python Boto3

I have some files in a specific folder of an s3 bucket. All file names are in the same pattern like below:
s3://example_bucket/products/product1/rawmat-2343274absdfj7827d.csv
s3://example_bucket/products/product1/rawmat-7997werewr666ee.csv
s3://example_bucket/products/product1/rawmat-8qwer897hhw776w3.csv
s3://example_bucket/products/product1/rawmat-2364875349873uy68848732.csv
....
....
Here, I think we can say:
bucket_name = 'example_bucket'
prefix = 'products/product1/'
key = 'rawmat-*.csv'
I need to read each of them. I highly prefer not to list of the objects in the bucket.
What could be the most efficient way to do this?
Iterate over the objects at the folder using a prefix
bucket_name = 'example_bucket'
prefix = 'products/product1/rawmat'
for my_object in bucket_name.objects.filter(Prefix= prefix):
print(my_object)

List AWS S3 folders with boto3

I have boto code that collects S3 sub-folders in levelOne folder:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("MyBucket")
for level2 in bucket.list(prefix="levelOne/", delimiter="/"):
print(level2.name)
Please help to discover similar functionality in boto3. The code should not iterate through all S3 objects because the bucket has a very big number of objects.
If you are simply seeking a list of folders, then use CommonPrefixes returned when listing objects. Note that a Delimiter must be specified to obtain the CommonPrefixes:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='BUCKET-NAME', Delimiter = '/')
for prefix in response['CommonPrefixes']:
print(prefix['Prefix'][:-1])
If your bucket has a HUGE number of folders and objects, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
I think the following should be equivalent:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('MyBucket')
for object in bucket.objects.filter(Prefix="levelOne/", Delimiter="/"):
print(object.key)

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

Python: recursive glob in s3

I am trying to get a list of parquet files paths from s3 that are inside of subdirectories and subdirectories of subdirectories (and so on and so forth).
If it was my local file system I would do this:
import glob
glob.glob('C:/Users/user/info/**/*.parquet', recursive=True)
I have tried using the glob method of s3fs however it doesn't have a recursive kwarg.
Is there a function I can use or do I need to implement it myself ?
You can use s3fs with glob:
import s3fs
s3 = s3fs.S3FileSystem(anon=False)
s3.glob('your/s3/path/here/*.parquet')
I also wanted to download the latest file from s3 bucket but located in a specific folder. Initially, I tried using glob but couldn't find a solution to this problem. Finally, I build following function to solve this problem. You can modify this function to work with subfolders.
This function will return dictionary of all filenames and timestamp in key-value pair
(Key: file_name, value: timestamp).
Just pass bucket name and prefix (which is folder name).
import boto3
def get_file_names(bucket_name,prefix):
"""
Return the latest file name in an S3 bucket folder.
:param bucket: Name of the S3 bucket.
:param prefix: Only fetch keys that start with this prefix (folder name).
"""
s3_client = boto3.client('s3')
objs = s3_client.list_objects_v2(Bucket=bucket_name)['Contents']
shortlisted_files = dict()
for obj in objs:
key = obj['Key']
timestamp = obj['LastModified']
# if key starts with folder name retrieve that key
if key.startswith(prefix):
# Adding a new key value pair
shortlisted_files.update( {key : timestamp} )
return shortlisted_files
latest_filename = get_latest_file_name(bucket_name='use_your_bucket_name',prefix = 'folder_name/')
S3 doesn't actually have subdirectories, per se.
boto3's S3.Client.list_objects() supports a prefix argument, which should get you all the objects in a given "directory" in a bucket no matter how "deep" they appear to be.

Inserting items in sub-buckets on S3 using boto

Say I have the following bucket set up on S3
MyBucket/MySubBucket
And, I am planning on serving static media for a website out of MySubBucket (like images users have uploaded, etc.). The way the S3 configurations appear to be currently set up, there is no way to make the top-level bucket "MyBucket" public, so to mimic this you need to make every individual item in that bucket public after it has been inserted.
You can, make a sub-bucket like "MySubBucket" public, but the problem I am currently having is trying to figure out how to call boto to insert a test image into that bucket? Can anyone provide an example?
Thanks
You actually just need to put the sub bucket name and a / before the file name. The following code snippet will put a file called 'test.jpg' into MySubBucket. The k.make_public() makes that file public, but not necessarily the bucket itself.
from boto.s3.connection import S3Connection
from boto.s3.key import Key
connection = S3Connection(AWSKEY,AWS_SECRET_KEY)
bucket = connection.get_bucket('MyBucket')
file = 'test.jpg'
k = Key(bucket)
k.key = 'MySubBucket/' + file
k.set_contents_from_filename(file)
k.make_public()

Categories

Resources