I am using a versioned S3 bucket, with boto3. How can I retrieve all versions for a given key (or even all versions for all keys) ? I can do this:
for os in b.objects.filter(Prefix=pref):
print("os.key")
but that gives me only the most recent version for each key.
import boto3
bucket = 'bucket name'
key = 'key'
s3 = boto3.resource('s3')
versions = s3.Bucket(bucket).object_versions.filter(Prefix=key)
for version in versions:
obj = version.get()
print(obj.get('VersionId'), obj.get('ContentLength'), obj.get('LastModified'))
I can't take credit as I had the same question but I found this here
boto3 s3 client has a list_object_versions method.
resp = client.list_object_versions(Prefix=prefix, Bucket=bucket)
for obj in [*resp['Versions'], *resp.get('DeleteMarkers', [])]:
print(f"Key: {obj['Key']}")
print(f"VersionId: {obj['VersionId']}")
print(f"LastModified: {obj['LastModified']}")
print(f"IsLatest: {obj['IsLatest']}")
print(f"Size: {obj.get('Size', 0)/1e6}")
supposing you wanted to delete all but the current version, you could do so by adding the objects where not IsLatest to a to_delete list, then running the following:
for obj in to_delete:
print(client.delete_object(Bucket=bucket, Key=obj['Key'], VersionId=obj['VersionId']))
The existing answers describe how to get the version ids of the objects. We're interested in using versions of the object(s) this usually involves accessing part, or all of the objects.
First get the version ids. I'll use al76's answer here for getting version ids to illustrate both boto3.resource and boto3.client examples.
import boto3
from collections import deque
bucket = 'bucket name'
key = 'key'
s3 = boto3.resource('s3')
versions = s3.Bucket(bucket).object_versions.filter(Prefix=key)
s3_object_versions = deque()
for version in versions:
obj = version.get()
s3_object_versions.appendleft(obj.get('VersionId'))
Next we need to get the actual versioned objects, there are a couple ways to do this, depending on whether you have a s3 resource or the s3 client. Given that we got the version ids with the resource we'll start there.
# using the first version of the object
obj_version = s3.ObjectVersion(bucket, key, s3_object_versions[0]).get()
# if this object is really large this loop might execute for awhile
for event in obj_version['Body']:
print(event)
Note that each event is a botocore.response.StreamingBody object and may not correspond to a row if your object is a csv file (etc).
Now as an alternate approach, you might have an s3 client rather than an s3 resource object. In the s3 resource you can access the client via .meta. Here is code:
response = s3.meta.client.get_object(Bucket=bucket, Key=key, VersionId=s3_object_versions[-1])
# same caveat as above, for loop may run for awhile if object is large
for event in response['Body']:
print(event)
A third option which is very similar to the first one, directly using the resource, bucket and object.
object = s3.Bucket(bucket).Object(key).get()
object_version = object.Version(s3_object_versions[0]).get()
# again careful if object is large-will run awhile and fill stdout
for event in object_version['Body']:
print(event)
A fourth option-also using the s3 client is to download the file directly.
object_original_version = s3.meta.client.download_file(bucket, key, '/tmp/' + key, ExtraArgs={'VersionId': s3_object_versions[0]})
Then you can read the file from /tmp via open() call or something more specialized for the file format.
In your code you'll likely want to wrap these with try/except etc. to handle errors unless this is prototyping/experimental work.
Related
I have a bucket called:
bucket_1
On bucket_1 i have s3 events set up for all new objects created within that bucket. What i would like to do is set up a rule so that if the file that is dropped is prefixed with "data". Then it triggers a lambda function that will process the data within the file.
What i'm struggling with is how to filter for a particular file. So far the code i have within python is this:
def handler(event, context):
s3 = boto3.resource('s3')
message = json.loads(event['Records'][0]['Sns']['Message'])
print("JSON: " + json.dumps(message))
return message
This lambda is triggered when an event is added to my sns topic, but i just want to filter specifically for an object creation with a prefix of "data".
Has anyone done something similar to this before? for clarity this is the workflow of the job that i would like to happen:
1. file added to bucket_1
2. notification sent to sns topic [TRIGGERS below python code]
3. python filters for object created notification and file with prefix of "data*" [Triggers] python below]
4. python fetched data from s3 location cleans it up and places it into table.
So specifically i am looking on how exactly to set up step 3.
After looking around i've found that this isn't actually done in python but rather when configuring events on the s3 bucket itself.
You go to this area:
s3 > choose your bucket > properties > event notifications > create notification
Once you are there you just select all create objects events and you can even specify the prefix and suffix of a file name. Then you will only receive notifications through to that topic for what you've specified.
Hopefully this helps someone at some point!
I have a cloud function configured to be triggered on google.storage.object.finalize in a storage bucket. This was running well for a while. However recently I start to getting some errors FileNotFoundError when trying to read the file. But if I try download the file through the gsutil or the console works fine.
Code sample:
def main(data, context):
full_filename = data['name']
bucket = data['bucket']
df = pd.read_csv(f'gs://{bucket}/{full_filename}') # intermittent raises FileNotFoundError
The errors occurs most often when the file was overwritten. The bucket has the object versioning enabled.
There are something I can do?
As clarified in this other similar case here, sometimes cache can be an issue between Cloud Functions and Cloud Storage, where this can be causing the files to get overwritten and this way, not possible to be found, causing the FileNotFoundError to show up.
Using the invalidate_cache before reading the file can help in this situations, since it will disconsider the cache for the reading and avoid the error. The code for using invalidate_cache is like this:
import gcsfs
fs = gcsfs.GCSFileSystem()
fs.invalidate_cache()
Check in function logging if your function execution is not triggered twice on single object finalize:
first triggered execution with event attribute 'size': '0'
second triggered execution with event attribute size with actual object size
If your function fails on the first you can simply filter it out by checking the attribute value and continuing only if non-zero.
def main(data, context):
object_size = data['size']
if object_size != '0':
full_filename = data['name']
bucket = data['bucket']
df = pd.read_csv(f'gs://{bucket}/{full_filename}')
Don't know what exactly is causing the double-triggering but had similar problem once when using Cloud Storage FUSE and this was a quick solution solving the problem.
I am developing a script in Python for deleting old AMIs and its snapshots. For testing purposes, I have been trying to create and right after deleting an AMI. My code for creating the instance is the following (including the addition of tags at the end):
import boto3
from datetime import datetime, timedelta
import time
today = datetime.utcnow().strftime('%Y%m%d')
remove_on = (datetime.utcnow() + timedelta(days=3)).strftime('%Y%m%d')
session = boto3.session.Session(region_name='eu-west-1')
client = session.client('ec2')
ec2 = session.resource('ec2')
instance_info = client.describe_instances(Filters=[{'Name': 'tag:Name',
'Values': ['Airflow']}]) #This filter DOES work
instance_id = instance_info['Reservations'][0]['Instances'][0]['InstanceId']
instance = ec2.Instance(instance_id)
image = instance.create_image(InstanceId=instance_id, Name=f"Airflow_{today}")
time.sleep(2)
image.create_tags(Tags=[{'Key': 'RemoveOn', 'Value': remove_on},
{'Key': 'BackupOf', 'Value': 'Airflow'}])
However, when I try to get the info of the recent created AMI, I get no data:
instances_to_delete = client.describe_instances(Filters=[{'Name': 'tag:RemoveOn',
'Values':[remove_on]}])
I have tried to explicitly put a string in Values but it does not work either. Also, even though it didn't make much sense (since I already had one filter working previously), I specified the region in client also (because of these answers Boto3 ec2 describe_instances always returns empty) and it doesn't work.
The tag is there as we can see in the following screenshot
Your code seems to be creating an image (AMI) and then putting a tag on the AMI.
Then, you are saying that it is unable to find the instance with that tag. That makes sense, because only the image was tagged, not the instance.
I'm trying to find a boto3.resource('s3') equivalent of
s3_connection = boto3.client('s3')
s3_response = client.get_object_tagging()
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html?highlight=get_object_tagging#S3.Client.get_object_tagging
I have yet to find anything from resource that returns the same information for an object.
Unfortunately, it is still opened feature request status. See This.
I'm having difficulty figuring out a way (if possible) to create a new AWS keypair with the Python Boto library and then download that keypair.
The Key object returned by the create_keypair method in boto has a "save" method. So, basically you can do something like this:
>>> import boto
>>> ec2 = boto.connect_ec2()
>>> key = ec2.create_key_pair('mynewkey')
>>> key.save('/path/to/keypair/dir')
If you want a more detailed example, check out https://github.com/garnaat/paws/blob/master/ec2_launch_instance.py.
Does that help? If not, provide some specifics about the problems you are encountering.
Same for Boto3:
ec2 = boto3.resource('ec2')
keypair_name = 'my_key'
new_keypair = ec2.create_key_pair(KeyName=keypair_name)
with open('./my_key.pem', 'w') as file:
file.write(new_keypair.key_material)
print(new_keypair.key_fingerprint)