Download S3 Objects by List of Keys Using Boto3 - python

I've got a list of keys that I'm retrieving from a cache, and I want to download the associated objects (files) from S3 without having to make a request per key.
Assuming I have the following array of keys:
key_array = [
'20160901_0750_7c05da39_INCIDENT_MANIFEST.json',
'20161207_230312_ZX1G222ZS3_INCIDENT_MANIFEST.json',
'20161211_131407_ZX1G222ZS3_INCIDENT_MANIFEST.json',
'20161211_145342_ZX1G222ZS3_INCIDENT_MANIFEST.json',
'20161211_170600_FA68T0303607_INCIDENT_MANIFEST.json'
]
I'm trying to do something similar to this answer on another SO question, but modified like so:
import boto3
s3 = boto3.resource('s3')
incidents = s3.Bucket(my_incident_bucket).objects(key_array)
for incident in incidents:
# Do fun stuff with the incident body
incident_body = incident['Body'].read().decode('utf-8')
My ultimate goal being that I'd like to avoid hitting the AWS API separately for every key in the list. I'd also like to avoid having to pull the whole bucket down and filtering/iterating the full results.

I think the best you are going to get is n API calls where n is the number of keys in your key_array. The amazon API for s3 doesn't offer much in the way of server-side filtering based on keys, other than prefixes. Here is the code to get it in n API calls:
import boto3
s3 = boto3.client('s3')
for key in key_array:
incident_body = s3.get_object(Bucket="my_incident_bucket", Key=key)['Body']
# Do fun stuff with the incident body

Related

List objects in S3 bucket with timestamp filter

Background
Is there a way to get a list of all the files on a s3 bucket that are newer than a specific timestamp. For example i am trying to figure out and get a list of all the files that got modified yesterday afternoon.
In particular i have bucket called foo-bar and inside that i have a folder called prod where the files i am trying to parse through lie.
What I am trying so far
I referred to boto3 documentation and came up with the following so far.
from boto3 import client
conn = client('s3')
conn.list_objects(Bucket='foo-bar', Prefix='prod/')['Contents']
Issues
There is two issues with this solution, the first one is it is only listing 1000 files even though i have over 10,000 files and the other is i am not sure how i filter for time?
You can filter based on timestamp doing this:
import boto3
from datetime import datetime, timedelta
from dateutil.tz import tzutc
condition_timestamp = datetime.now(tz=tzutc()) - timedelta(days=2, hours=12) #dynamic condition
#condition_timestamp = datetime(2023, 2, 17, tzinfo=tzutc()) #Fixed condition
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
s3_filtered_list = [obj for page in paginator.paginate(Bucket="foo-bar",Prefix="prod/") for obj in page["Contents"] if obj['LastModified'] > condition_timestamp]
s3_filtered_list
Note that I give you two options to create your condition based on a timestamp... dynamic (x time from now) or fixed (x datetime)
You can try to use S3.Paginator.ListObjects, which will return 'LastModified': datetime(2015, 1, 1) as part of object metadata in the Contents array. You can then save the object Keys into a local list based on the LastModified condition.
Since the AWS S3 API doesn't support any concept of filtering, you'll need to filter based off of the returned objects.
Further, the list_objects and list_objects_v2 APIs only supports returning 1000 objects at a time, so you'll need to paginate the results, calling it again and again to get all of the objects in a bucket. There is a helper method get_paginator that handles this for you.
So, you can put these two together, and get the list of all objects in a bucket, and filter them based on whatever criteria you see fit:
import boto3
from datetime import datetime, UTC
# Pick a target timestamp to filter objects on or after
# Note, it must be in UTC
target_timestamp = datetime(2023, 2, 1, tzinfo=UTC)
found_objects = []
# Create and use a paginator to list more than 1000 objects in the bucket
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=BUCKET):
# Pull out each list of objects from each page
for cur in page.get('Contents', []):
# Check each object to see if it matches the target criteria
if cur['LastModified'] >= target_timestamp:
# If so, add it to the final list
found_objects.append(cur)
# Just show the number of found objects in this example
print(f"Found {len(found_objects)} objects")

Is there a quick way to get the lexically maximal key in an aws bucket with a matching prefix

Is a better way to get the last key (lexically ordered) that is in an s3 bucket with a matching prefix, using boto3 in python
Currently doing the following:
bucket = 'hello'
prefix = 'is/it/me/your/looking/for'
paginator = s3.get_paginator('list_objects_v2')
last_key = None
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page['Contents']:
last_key = obj['Key']
obviously this suffers as the # of objects matching the prefix grows.
No.
The on-disk data structure could readily support this,
but amazon doesn't expose such a function.
I have a "find the most recently stored timestamp" use case,
that is, compute max(stamp).
Here's the countdown kludge to make it work:
Define END_OF_TIME arbitrarily, perhaps 2060-01-01.
Then remaining seconds is just that minus current time.
Format it with leading zeros so there's a fixed number of columns.
Use it as an S3 object name prefix when writing records.
Using this scheme it is hard to compute min(stamp),
but trivial to find the max.
It's the first result returned by list_objects_v2.
You clearly mentioned that you need boto3 version, but here is the working cli version. It may give you some clue.
aws s3api list-objects --bucket hello --prefix "is/it/me/your/looking/for" --query "reverse(sort_by(Contents,&Key))"
In your code it may be something like this (sorry not familiar with python)
query = 'reverse(sort_by(Contents,&Key))'
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Query=query):

How can I bypass the boto3 function not allowing uppercase letters in bucket name?

I need to get a list of s3 keys from a bucket. I have a script that works great with boto3. The issue is that the bucket name I'm using has uppercase letters and this throws an error with boto3.
In looking at connect to bucket having uppercase letter this uses boto, but I'd like to use boto3 but OrdinaryCallingFormat() doesn't seem to be an option for boto3.
Or, I could adapt the script to work for boto, but I'm not sure how to do that. I tried:
s3 = boto.connect_s3(aws_access_key_id,aws_secret_access_key,
calling_format = OrdinaryCallingFormat())
Bucketname = s3.get_bucket('b-datasci/x-DI-S')
but that gave the error boto.exception.S3ResponseError: S3ResponseError: 404 Not Found.
With this attempt:
conn = boto.s3.connect_to_region(
'us-east-1',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
calling_format=boto.s3.connection.OrdinaryCallingFormat(),
)
bucket = conn.get_bucket(Bucketname)
I got the error: xml.sax._exceptions.SAXParseException: <unknown>:1:0: no element found
How can I get this to work with boto, or else integrate OrdinaryCallingFormat()into boto3 to list the keys?
The problem is not the an uppercase character. It's the slash.
Bucket name must match this regex:
^[a-zA-Z0-9.\-_]{1,255}$
To list objects in bucket "mybucket" with prefix "b-datasci/x-DI-S", you're going too low-level. Use this instead:
import boto3
client = boto3.client('s3')
for obj in client.list_objects_v2(Bucket="mybucket", Prefix="b-datasci/x-DI-S"):
...
As pointed out by #John Rotenstein , all bucket name must be DNS-compliant, so a direct slash is not allowed for bucket name.
In fact the 2nd piece of code above does work, when I added this: for
key in bucket.get_all_keys(prefix='s-DI-S/', delimiter='/') and took
what was really the prefix off the Bucketname.
From the above OP comment, it seems OP maybe misunderstood S3 object store with a file directory system. Each s3 object are never store in a hierarchy format. The so call "prefix" is simply an text filter that retrieve object name that contains the the similar "prefix" in the object name.
In OP, there is a bucket name call b-datasci, while there are many object with the prefix x-DI-S store under the bucket.
In addition, the usage of delimiter under get_all_keys is not what OP think. If you check out AWS documentation on Listing Keys Hierarchically Using a Prefix and Delimiter, you will learn that it is just a secondary filter. i.e.
The prefix and delimiter parameters limit the kind of results returned by a list operation. Prefix limits results to only those keys that begin with the specified prefix, and delimiter causes list to roll up all keys that share a common prefix into a single summary list result.
p/s: OP should use boto3 because boto is deprecated by AWS for more than 2 years.

How to retrieve bucket prefixes in a filesystem style using boto3

Doing something like the following:
s3 = boto3.resource('s3')
bucket = s3.Bucket('a_dummy_bucket')
bucket.objects.all()
Will return all the objects under 'a_dummy_bucket' bucket, like:
test1/blah/blah/afile45645.zip
test1/blah/blah/afile23411.zip
test1/blah/blah/afile23411.zip
[...] 2500 files
test2/blah/blah/afile.zip
[...] 2500 files
test3/blah/blah/afile.zip
[...] 2500 files
Is there any way of getting, in this case, 'test1','test2', 'test3', etc... without paginating over all results?
For reaching 'test2' I need 3 paginated calls, each one with 1000 keys to know that there is a 'test2', and then other 3 with 1000 keys to reach 'test3', and so on.
How can I get all these prefixes without paginating over all results?
Thanks
I believe getting the Common Prefixes is what you are possibly looking for. Which can be done using this example:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
AWS Documentation#Bucket.Get says the following regarding Common Prefixes:
A response can contain CommonPrefixes only if you specify a delimiter. When you do, CommonPrefixes contains all (if there are any) keys between Prefix and the next occurrence of the string specified by delimiter. In effect, CommonPrefixes lists keys that act like subdirectories in the directory specified by Prefix. For example, if prefix is notes/ and delimiter is a slash (/), in notes/summer/july, the common prefix is notes/summer/. All of the keys rolled up in a common prefix count as a single return when calculating the number of returns. See MaxKeys.
Type: String
Ancestor: ListBucketResult
There is no way to escape from pagination. You can specify less than the default page size of 1000, but not more than 1000. If you think paginating and finding the prefix is messy, try the AWS CLI - which internally does all pagination for you but gives you only the results that you want.
aws s3 ls s3://a_dummy_bucket/
PRE test1/
PRE test2/
PRE test3/

How can I get a list of all file names having same prefix on Amazon S3?

I am using boto and Python to store and retrieve files to and from Amazon S3.
I need to get the list of files present in a directory. I know there is no concept of directories in S3 so I am phrasing my question like how can I get a list of all file names having same prefix?
For example- Let's say I have following files-
Brad/files/pdf/abc.pdf
Brad/files/pdf/abc2.pdf
Brad/files/pdf/abc3.pdf
Brad/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
When I call foo("Brad"), it should return a list like this-
files/pdf/abc.pdf
files/pdf/abc2.pdf
files/pdf/abc3.pdf
files/pdf/abc4.pdf
What is the best way to do it?
user3's approach is a pure client side solution. I think it works well in small scale. If you have millions of object in one bucket, you may pay for many requests and bandwidth fee.
Alternatively, you can use delimiter and prefix parameter provided by GET BUCKET API to archive your requirement. There are many example in the document, see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
Needless to say, you can use boto to achieve this.
You can use startswith and list comprehension for this purpose as below:
paths=['Brad/files/pdf/abc.pdf','Brad/files/pdf/abc2.pdf','Brad/files/pdf/abc3.pdf','Brad/files/pdf/abc4.pdf','mybucket/files/pdf/new/','mybucket/files/pdf/new/abc.pdf','mybucket/files/pdf/2011/']
def foo(m):
return [p for p in paths if p.startswith(m+'/')]
print foo('Brad')
output:
['Brad/files/pdf/abc.pdf', 'Brad/files/pdf/abc2.pdf', 'Brad/files/pdf/abc3.pdf', 'Brad/files/pdf/abc4.pdf']
Using split and filter:
def foo(m):
return filter(lambda x: x.split('/')[0]== m, paths)

Categories

Resources