AWS Rekognition and s3 calling subfolders in Python Lambda - python

I'm having trouble figuring out how to access a certain folder within a bucket in s3 using Python
Let's say I'm trying to access this folder in the bucket which contains a bunch of images that I want to run rekognition on:
"myBucket/subfolder/images/"
In /images/ folder there are:
one.jpg
two.jpg
three.jpg
four.jpg
I want to run rekognition's detect_labels on this folder. However, I can't seem to access this folder but if I change the bucket_name to just the root folder ("myBucket"/), then I can access just that folder.
bucket_name = "myBucket/subfolder/images/"
rekognition = boto3.client('rekognition')
s3 = boto3.resource('s3')
bucket = s3.Bucket(name=bucket_name)

That is operating as expected. The bucket name should be just the name of the bucket.
You can then run operation on the bucket, eg:
import boto3
s3 = boto3.resource('s3', region_name='ap-southeast-2')
bucket = s3.Bucket('my-bucket')
for object in bucket.objects.all():
if object.key.startswith('images'):
print object.key
Or, using the client instead of the resource:
import boto3
client = boto3.client('s3', region_name='ap-southeast-2')
response = client.list_objects_v2(Bucket='my-bucket', Prefix='images/')
for object in response['Contents']:
print object['Key']

You could index elsewhere what you have in S3, in such a way, you directly access what you need. Have in mind that to loop against files stored in buckets, may offer very low performance and it gets really slow if the number of keys you have is big.
Following your example, another way to do it:
bucket_name = "myBucket"
folder_name = "subfolder/images/"
rekognition = boto3.client('rekognition')
keys= ['one.jpg','two.jpg','three.jpg','four.jpg']
s3 = boto3.resource('s3')
for k in keys:
obj = s3.Object(bucket_name, folder_name+k )
print(obj.key)
Get the list of items(keys) from any db table in your system.

In the case of AWS Rekognition (as asked), an image file stored in a folder in a S3 bucket will have the key in the form of folder_name/subfolder_name/image_name.jpg. So, since the boto3 Rekognition detect_labels() method has this syntax (per https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rekognition.html#Rekognition.Client.detect_labels):
response = client.detect_labels(
Image={
'Bytes': b'bytes',
'S3Object': {
'Bucket': 'string',
'Name': 'string',
'Version': 'string'
}
},
MaxLabels=123,
MinConfidence=...
)
where the value of Name should be a S3 object key name, you can just feed the entire folder-image path to that dictionary as a string. To loop over multiple images, generate a list of image file names, as recommended in Evhz's answer, and loop over that list while calling the detect_labels() method above (or use a generator).

Related

Unzip file content hosted in s3 to multiple cloudfront url through a single lambda function

Is there any specific way to unzip single file contents from s3 to multiple cloudfront urls by triggering lambda once.
Lets say in there is a zip file contains multiple jpg/ png files already uploaded to s3. Intention is to run lambda function only once to unzip all its file content and make them available in multiple cloudfront urls.
in s3 bucket
archive.zip
a.jpg
b.jpg
c.jpg
through cloudfront
https://1232.cloudfront.net/a.jpg
https://1232.cloudfront.net/b.jpg
https://1232.cloudfront.net/c.jpg
I am looking for a solution such that lambda function trigger function calls whenever a s3 upload happens and make all files available in the zip through cloudfront multiple urls.
Hello Prathap Parameswar,
I think you can resolve your problem like this:
First you need to exact your zip file
Seconds you upload them again to S3.
This is lambda python function:
import json
import boto3
from io import BytesIO
import zipfile
def lambda_handler(event, context):
# TODO implement
s3_resource = boto3.resource('s3')
source_bucket = 'upload-zip-folder'
target_bucket = 'upload-extracted-folder'
my_bucket = s3_resource.Bucket(source_bucket)
for file in my_bucket.objects.all():
if(str(file.key).endswith('.zip')):
zip_obj = s3_resource.Object(bucket_name=source_bucket, key=file.key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
try:
response = s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucket,
Key=f'{filename}'
)
except Exception as e:
print(e)
else:
print(file.key+ ' is not a zip file.')
Hope this can help you

Get count of objects in a specific S3 folder using Boto3

Trying to get count of objects in S3 folder
Current code
bucket='some-bucket'
File='someLocation/File/'
objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=File)
fileCount = objs['KeyCount']
This gives me the count as 1+actual number of objects in S3.
Maybe it is counting "File" as a key too?
Assuming you want to count the keys in a bucket and don't want to hit the limit of 1000 using list_objects_v2. The below code worked for me but I'm wondering if there is a better faster way to do it! Tried looking if there's a packaged function in boto3 s3 connector but there isn't!
# connect to s3 - assuming your creds are all set up and you have boto3 installed
s3 = boto3.resource('s3')
# identify the bucket - you can use prefix if you know what your bucket name starts with
for bucket in s3.buckets.all():
print(bucket.name)
# get the bucket
bucket = s3.Bucket('my-s3-bucket')
# use loop and count increment
count_obj = 0
for i in bucket.objects.all():
count_obj = count_obj + 1
print(count_obj)
If there are more than 1000 entries, you need to use paginators, like this:
count = 0
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket='your-bucket', Prefix='your-folder/', Delimiter='/'):
count += len(result.get('CommonPrefixes'))
"Folders" do not actually exist in Amazon S3. Instead, all objects have their full path as their filename ('Key'). I think you already know this.
However, it is possible to 'create' a folder by creating a zero-length object that has the same name as the folder. This causes the folder to appear in listings and is what happens if folders are created via the management console.
Thus, you could exclude zero-length objects from your count.
For an example, see: Determine if folder or file key - Boto
The following code worked perfectly
def getNumberOfObjectsInBucket(bucketName,prefix):
count = 0
response = boto3.client('s3').list_objects_v2(Bucket=bucketName,Prefix=prefix)
for object in response['Contents']:
if object['Size'] != 0:
#print(object['Key'])
count+=1
return count
object['Size'] == 0 will take you to folder names, if want to check them, object['Size'] != 0 will lead you to all non-folder keys.
Sample function call below:
getNumberOfObjectsInBucket('foo-test-bucket','foo-output/')
If you have credentials to access that bucket, then you can use this simple code. Below code will give you a list. List comprehension is used for more readability.
Filter is used to filter objects because in bucket to identify the files ,folder names are used. As explained by John Rotenstein concisely.
import boto3
bucket = "Sample_Bucket"
folder = "Sample_Folder"
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket(bucket)
files_in_s3 = [f.key.split(folder + "/")[1] for f in s3_bucket.objects.filter(Prefix=folder).all()]

how can I make a s3 bucket sub images iteration generic

my bucket on s3 is named as 'python' and its subfolder is 'boss' . So I want to get all images of folder boss in lambda function. currently I am hard-coding values but putting image in root not in subfolder.
bucket="python"
key="20180530105812.jpeg"
then I want to call this function one by one for all images
def lambda_handler(event, context):
# Get the object from the event
bucket="ais-django"
key="20180530105812.jpeg"
try:
# Calls Amazon Rekognition IndexFaces API to detect faces in S3 object
# to index faces into specified collection
response = index_faces(bucket, key)
# Commit faceId and full name object metadata to DynamoDB
Here is the best thing what you can do , add s3 event notification as a trigger to lambda function and configure it for your prefix of object that is "boss/" in your case
here prefix will be "boss/"
Then change your bucket and key to this in your code :
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
With help of this , whenever object will be uploaded to your bucket/boss/ path, your code will automatically fetch it and will run it doing your processing in code
With this your lambda will not require any hard-coded strings of bucket and key and will run automatically against image upload in your subfolder path.Further if you want only images to be processed , add filter pattern to be .jpg , .jpeg , .png , etc
You can filter that folder within your bucket. As an example:
#import boto3
s3 = boto3.resource('s3')
python_bucket = s3.Bucket('python')
for images in python_bucket.objects.filter(Prefix="boss/"):
print images.key
UPDATE:
According to your recent edit, you can iterate through the bucket/folder and run your script. This is a more complete snippet that should works well for your Lambda function:
import boto3
def lambda_handler(event, context):
s3 = boto3.resource('s3')
images = ""
python_bucket = s3.Bucket('python')
#Here, you're going through each image in your bucket/folder.
for image in python_bucket.objects.filter(Prefix="boss/"):
images += image.key
return images
Use list_object operation on s3 Client.
bucket="python"
client=boto3.client('s3')
response = client.list_objects(
Bucket=bucket,
Prefix='boss'
)
numberofobjects=len(response['Contents'])
for x in range(1, numberofobjects):
try:
response2=index_faces(bucket, response['Contents'][x]['Key'])

Listing contents of a bucket with boto3

How can I see what's inside a bucket in S3 with boto3? (i.e. do an "ls")?
Doing the following:
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('some/path/')
returns:
s3.Bucket(name='some/path/')
How do I see its contents?
One way to see the contents would be:
for my_bucket_object in my_bucket.objects.all():
print(my_bucket_object)
This is similar to an 'ls' but it does not take into account the prefix folder convention and will list the objects in the bucket. It's left up to the reader to filter out prefixes which are part of the Key name.
In Python 2:
from boto.s3.connection import S3Connection
conn = S3Connection() # assumes boto.cfg setup
bucket = conn.get_bucket('bucket_name')
for obj in bucket.get_all_keys():
print(obj.key)
In Python 3:
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])
I'm assuming you have configured authentication separately.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucket_name')
for file in my_bucket.objects.all():
print(file.key)
My s3 keys utility function is essentially an optimized version of #Hephaestus's answer:
import boto3
s3_paginator = boto3.client('s3').get_paginator('list_objects_v2')
def keys(bucket_name, prefix='/', delimiter='/', start_after=''):
prefix = prefix.lstrip(delimiter)
start_after = (start_after or prefix) if prefix.endswith(delimiter) else start_after
for page in s3_paginator.paginate(Bucket=bucket_name, Prefix=prefix, StartAfter=start_after):
for content in page.get('Contents', ()):
yield content['Key']
In my tests (boto3 1.9.84), it's significantly faster than the equivalent (but simpler) code:
import boto3
def keys(bucket_name, prefix='/', delimiter='/'):
prefix = prefix.lstrip(delimiter)
bucket = boto3.resource('s3').Bucket(bucket_name)
return (_.key for _ in bucket.objects.filter(Prefix=prefix))
As S3 guarantees UTF-8 binary sorted results, a start_after optimization has been added to the first function.
In order to handle large key listings (i.e. when the directory list is greater than 1000 items), I used the following code to accumulate key values (i.e. filenames) with multiple listings (thanks to Amelio above for the first lines). Code is for python3:
from boto3 import client
bucket_name = "my_bucket"
prefix = "my_key/sub_key/lots_o_files"
s3_conn = client('s3') # type: BaseClient ## again assumes boto.cfg setup, assume AWS S3
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter = "/")
if 'Contents' not in s3_result:
#print(s3_result)
return []
file_list = []
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter="/", ContinuationToken=continuation_key)
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
return file_list
If you want to pass the ACCESS and SECRET keys (which you should not do, because it is not secure):
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)
#To print all filenames in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
files = obj['Key']
return files
filename = get_s3_keys('your_bucket_name')
print(filename)
#To print all filenames in a certain directory in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket, prefix):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in resp['Contents']:
files = obj['Key']
print(files)
return files
filename = get_s3_keys('your_bucket_name', 'folder_name/sub_folder_name/')
print(filename)
Update:
The most easiest way is to use awswrangler
import awswrangler as wr
wr.s3.list_objects('s3://bucket_name')
import boto3
s3 = boto3.resource('s3')
## Bucket to use
my_bucket = s3.Bucket('city-bucket')
## List objects within a given prefix
for obj in my_bucket.objects.filter(Delimiter='/', Prefix='city/'):
print obj.key
Output:
city/pune.csv
city/goa.csv
A more parsimonious way, rather than iterating through via a for loop you could also just print the original object containing all files inside your S3 bucket:
session = Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
bucket = s3.Bucket('bucket_name')
files_in_s3 = bucket.objects.all()
#you can print this iterable with print(list(files_in_s3))
So you're asking for the equivalent of aws s3 ls in boto3. This would be listing all the top level folders and files. This is the closest I could get; it only lists all the top level folders. Surprising how difficult such a simple operation is.
import boto3
def s3_ls():
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
result = bucket.meta.client.list_objects(Bucket=bucket.name,
Delimiter='/')
for o in result.get('CommonPrefixes'):
print(o.get('Prefix'))
ObjectSummary:
There are two identifiers that are attached to the ObjectSummary:
bucket_name
key
boto3 S3: ObjectSummary
More on Object Keys from AWS S3 Documentation:
Object Keys:
When you create an object, you specify the key name, which uniquely identifies the object in the bucket. For example, in the Amazon S3 console (see AWS Management Console), when you highlight a bucket, a list of objects in your bucket appears. These names are the object keys. The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders. Suppose that your bucket (admin-created) has four objects with the following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
Reference:
AWS S3: Object Keys
Here is some example code that demonstrates how to get the bucket name and the object key.
Example:
import boto3
from pprint import pprint
def main():
def enumerate_s3():
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print("Name: {}".format(bucket.name))
print("Creation Date: {}".format(bucket.creation_date))
for object in bucket.objects.all():
print("Object: {}".format(object))
print("Object bucket_name: {}".format(object.bucket_name))
print("Object key: {}".format(object.key))
enumerate_s3()
if __name__ == '__main__':
main()
Here is a simple function that returns you the filenames of all files or files with certain types such as 'json', 'jpg'.
def get_file_list_s3(bucket, prefix="", file_extension=None):
"""Return the list of all file paths (prefix + file name) with certain type or all
Parameters
----------
bucket: str
The name of the bucket. For example, if your bucket is "s3://my_bucket" then it should be "my_bucket"
prefix: str
The full path to the the 'folder' of the files (objects). For example, if your files are in
s3://my_bucket/recipes/deserts then it should be "recipes/deserts". Default : ""
file_extension: str
The type of the files. If you want all, just leave it None. If you only want "json" files then it
should be "json". Default: None
Return
------
file_names: list
The list of file names including the prefix
"""
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket)
file_objs = my_bucket.objects.filter(Prefix=prefix).all()
file_names = [file_obj.key for file_obj in file_objs if file_extension is not None and file_obj.key.split(".")[-1] == file_extension]
return file_names
One way that I used to do this:
import boto3
s3 = boto3.resource('s3')
bucket=s3.Bucket("bucket_name")
contents = [_.key for _ in bucket.objects.all() if "subfolders/ifany/" in _.key]
Here is the solution
import boto3
s3=boto3.resource('s3')
BUCKET_NAME = 'Your S3 Bucket Name'
allFiles = s3.Bucket(BUCKET_NAME).objects.all()
for file in allFiles:
print(file.key)
I just did it like this, including the authentication method:
s3_client = boto3.client(
's3',
aws_access_key_id='access_key',
aws_secret_access_key='access_key_secret',
config=boto3.session.Config(signature_version='s3v4'),
region_name='region'
)
response = s3_client.list_objects(Bucket='bucket_name', Prefix=key)
if ('Contents' in response):
# Object / key exists!
return True
else:
# Object / key DOES NOT exist!
return False
With little modification to #Hephaeastus 's code in one of the above comments, wrote the below method to list down folders and objects (files) in a given path. Works similar to s3 ls command.
from boto3 import session
def s3_ls(profile=None, bucket_name=None, folder_path=None):
folders=[]
files=[]
result=dict()
bucket_name = bucket_name
prefix= folder_path
session = boto3.Session(profile_name=profile)
s3_conn = session.client('s3')
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter = "/", Prefix=prefix)
if 'Contents' not in s3_result and 'CommonPrefixes' not in s3_result:
return []
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter="/", ContinuationToken=continuation_key, Prefix=prefix)
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
if folders:
result['folders']=sorted(folders)
if files:
result['files']=sorted(files)
return result
This lists down all objects / folders in a given path. Folder_path can be left as None by default and method will list the immediate contents of the root of the bucket.
It can also be done as follows:
csv_files = s3.list_objects_v2(s3_bucket_path)
for obj in csv_files['Contents']:
key = obj['Key']
Using cloudpathlib
cloudpathlib provides a convenience wrapper so that you can use the simple pathlib API to interact with AWS S3 (and Azure blob storage, GCS, etc.). You can install with pip install "cloudpathlib[s3]".
Like with pathlib you can use glob or iterdir to list the contents of a directory.
Here's an example with a public AWS S3 bucket that you can copy and past to run.
from cloudpathlib import CloudPath
s3_path = CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349")
# list items with glob
list(
s3_path.glob("*")
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
# list items with iterdir
list(
s3_path.iterdir()
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
Created at 2021-05-21 20:38:47 PDT by reprexlite v0.4.2
A good option may also be to run aws cli command from lambda functions
import subprocess
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def run_command(command):
command_list = command.split(' ')
try:
logger.info("Running shell command: \"{}\"".format(command))
result = subprocess.run(command_list, stdout=subprocess.PIPE);
logger.info("Command output:\n---\n{}\n---".format(result.stdout.decode('UTF-8')))
except Exception as e:
logger.error("Exception: {}".format(e))
return False
return True
def lambda_handler(event, context):
run_command('/opt/aws s3 ls s3://bucket-name')
I was stuck on this for an entire night because I just wanted to get the number of files under a subfolder but it was also returning one extra file in the content that was the subfolder itself,
After researching about it I found that this is how s3 works but I had
a scenario where I unloaded the data from redshift in the following directory
s3://bucket_name/subfolder/<10 number of files>
and when I used
paginator.paginate(Bucket=price_signal_bucket_name,Prefix=new_files_folder_path+"/")
it would only return the 10 files, but when I created the folder on the s3 bucket itself then it would also return the subfolder
Conclusion
If the whole folder is uploaded to s3 then listing the only returns the files under prefix
But if the fodler was created on the s3 bucket itself then listing it using boto3 client will also return the subfolder and the files
First, create an s3 client object:
s3_client = boto3.client('s3')
Next, create a variable to hold the bucket name and folder. Pay attention to the slash "/" ending the folder name:
bucket_name = 'my-bucket'
folder = 'some-folder/'
Next, call s3_client.list_objects_v2 to get the folder's content object's metadata:
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=folder
)
Finally, with the object's metadata, you can obtain the S3 object by calling the s3_client.get_object function:
for object_metadata in response['Contents']:
object_key = object_metadata['Key']
response = s3_client.get_object(
Bucket=bucket_name,
Key=object_key
)
object_body = response['Body'].read()
print(object_body)
As you can see, the object content in the string format is available by calling response['Body'].read()

Referencing an existing S3 bucket to save file using Boto

I am following this help guide, which explains how to save a file to an S3 bucket after creating it. I cannot find, however, an explanation of how to save to an existing bucket. More specifically, I am unsure how to reference a preexisting bucket. I believe replacing create_bucket with get_bucket does the trick. This allows me to save the file but because the documentation says that get_bucket "retrieves a bucket by name" I wanted to check here and make sure that it only retreives the bucket's metadata and does not does not download all of the contents of the bucket to my computer. Am I doing this right or is there a more pythonic way?
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
from boto.s3.key import Key
k = Key(bucket)
k.key = 'foobar'
k.set_contents_from_string('This is a test of S3')
Your code looks reasonable. The get_bucket method will either return a Bucket object or, if the specified bucket name does not exist, it will raise an S3ResponseError.
You could simplify your code a little:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
k = bucket.new_key('foobar')
k.set_contents_from_string('This is a test of S3')
But it accomplishes the same thing.

Categories

Resources