How can i automatically delete AWS S3 files using python? - python

I want delete some files from S3 after certain time. i need to set a time limit for each object not for the bucket. is that possible?
I am using boto3 to upload the file into S3.
region = "us-east-2"
bucket = os.environ["S3_BUCKET_NAME"]
credentials = {
'aws_access_key_id': os.environ["AWS_ACCESS_KEY"],
'aws_secret_access_key': os.environ["AWS_ACCESS_SECRET_KEY"]
}
client = boto3.client('s3', **credentials)
transfer = S3Transfer(client)
transfer.upload_file(file_name, bucket, folder+file_name,
extra_args={'ACL': 'public-read'})
Above is the code i used to upload the object.

You have many options here. Some ideas:
You can automatically delete files are a given time period by using Amazon S3 Object Lifecycle Management. See: How Do I Create a Lifecycle Policy for an S3 Bucket?
If you requirements are more-detailed (eg different files after different time periods), you could add a Tag to each object specifying when you'd like the object deleted, or after how many days it should be deleted. Then, you could define an Amazon CloudWatch Events rule to trigger an AWS Lambda function at regular periods (eg once a day or once an hour). You could then code the Lambda function to look at the tags on objects, determine whether they should be deleted and delete the desired objects. You will find examples of this on the Internet, often called a Stopinator.
If you have an Amazon EC2 instance that is running all the time for other work, then you could simply create a cron job or Scheduled Task to run a similar program (without using AWS Lambda).

Related

How can I transfer objects in buckets between two aws accounts with python?

I need to transfer all objects of all buckets, with the same folders and buckets structures, from an aws account to another aws account.
I've been doing it with this code through aws cli, one bucket at a time:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --no-verify-ssl
Can I do it with python for all objects of all buckets?
The AWS CLI actually is a Python program. It includes multi-threading to copy multiple objects simultaneously, so it will likely be much more efficient than the equivalent Python program you can make yourself.
You can tweak some settings that might help: AWS CLI S3 Configuration — AWS CLI Command Reference
There is no option to copy "all buckets" -- you would still need to sync/copy one bucket at a time.
Another approach would be to use S3 Bucket Replication, where AWS will replicate the buckets for you. This now works on existing objects. See: Replicating existing objects between S3 buckets | AWS Storage Blog
Or, you could use S3 Batch Operations, which can take a manifest (a listing of objects) as input and then copy those objects to a desired destination. See: Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service
aws s3 sync is a high level functionality not available in AWS SDKs such as boto3. You have to implement it yourself on top of boto3 or search though many available code snippets that already implement that, such as python - Sync two buckets through boto3 - Stack Overflow.

How to delete thousands of objects from s3 bucket with in specific object folder?

Im having thousands of objects in all the folders gocc1, gocc2,etc
s3://awss3runner/gocc1/gocc2/goccf/
i just want to delete the objects(50,000+) from goccf and its versions
import boto3
session = boto3.Session()
s3 = session.resource(service_name='s3')
#bucket = s3.Bucket('awss3runner','goccf')if we use this getting error
bucket = s3.Bucket('awss3runner') # (working but if we use this everything in the bucket getting deleted)
bucket.object_versions.delete()
is there anyway to delete goccf objects and its versions
You can use the DeleteObjects API in S3 (https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html)
I would first perform a list operation to enumerate all the objects you wish to delete, then pass that into DeleteObjects. Be very careful as you could accidentally delete other objects in your bucket.
Another option, is to use an S3 lifecycle policy, if this is going to be a one-off operation. With a lifecycle policy you can specify a path in your S3 bucket and set the objects to Expire. They will be asynchronously removed from your S3 bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html

Checking my s3 bucket permissions in bulk via a text word list or csv

So i'm auditing s3 buckets for work. I can do manual checks for bucket permissions via awscli:
aws s3 ls s3://bucketname
or via Python using boto3:
# list 1st 1000 objects (which I like if it is readable so I can
#assess risk quickly on larger buckets)
from boto3 import client
bucket_name = input("enter the bucket name")
connection = client('s3')
for key in connection.list_objects_v2(Bucket=bucket_name)['Contents']:
print(key['Key'])
My question is, how I can use Python to read a list of my buckets (100+) and check if they are publicly readable or not. I should mention when I audit we use a dummy aws account in one of the ways we check permissions.
I was thinking something like
bucketlist = open('buckets.txt', 'r')
Now I'm comfortable getting that list to print out but to run the function against each line I got nothing. How can I use either the awscli method or Python boto3 method to check bucket permissions?
You can open the text file with a context manager, load each line as a list, and then loop over all bucket names. Then, define a custom function to check permissions - this will be situation-dependent, but I have included some suggestions and boto documentation links below.
You can use the code you have above with just a slight modification. (Note, this assumes you have one bucket name per line of your file.)
from boto3 import client
def check_bucket_permissions(client, bucket):
"""Function to check bucket permissions.
What you do here depends on what you want to check.
Examples given below."""
pass
# get a list of bucket names (strings)
with open('buckets.txt', 'r') as f:
bucket_names = f.readlines()
# check each bucket
connection = client('s3')
for bucket_name in bucket_names:
check_bucket_permissions(client, bucket_name)
For the check_bucket_permissions() function, you have a few options, depending on what info you are looking for. Two ways of checking who can access buckets are, (1) checking policies, and (2) checking access permissions/access control lists.
This AWS blog post gives some background on IAM policies, bucket policies, and access control lists for buckets, but the short version is:
A bucket policy is an IAM policy document that is attached to the bucket itself, and sets rules for what type of principals (users/groups/roles) can access the bucket's resources, and what type of actions they are allowed to take (read, write, etc.)
An IAM policy document can also be attached to a principal (user/group/role), and that policy document can set rules for what types of resources (like buckets) can be accessed and what type of actions they are allowed to take
Bucket ACLs (access control lists) are an older way of specifying bucket permissions (AWS recommends using the IAM approaches above). An ACL is attached to a bucket and defines principals (users/groups/roles) that can access the bucket and what type of access they have.
If you're using the classic IAM policy approach, you can check what IAM policy documents apply to a bucket using the get_bucket_policy() method:
def check_bucket_permissions(client, bucket):
policy_doc = client.get_bucket_policy(Bucket=bucket)
# parse policy_doc, print or record the result
return
If there are no bucket policies, this will error out with botocore.exceptions.ClientError: An error occurred (NoSuchBucketPolicy) when calling the GetBucketPolicy operation: The bucket policy does not exist.
If you want to check bucket ACLs, you can use the get_bucket_acl() method instead of the get_bucket_policy() method:
def check_bucket_permissions(client, bucket):
acl = client.get_bucket_acl(Bucket=bucket)
# parse ACL, print or record the result
return
Refer to the documentation for sample JSON responses from each of these methods to figure out how you want to parse the results that are returned.

S3 buckets to Glacier on demand Is it possible from boto3 API?

I'm testing a script to recover date stored in an S3 bucket, with a lifecycle rule that moves the data to glacier each day. So, in theory, when I upload a file to the S3 bucket, after a day, the Amazon infrastructure should move it to glacier.
But I want to test a script that I'm developing in python to test the
restore process. So, if I understand the boto3 API I haven't seen any method to force a file stored in the S3 bucket to move immediately
to the glacier storage. Is is possible to do that or it's necessary
to wait till the Amazon infrastructure fires the lifecycle rule.
I would like to use some code like this:
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Bucket=TARGET_BUCKET, Prefix=TARGET_KEYS + KEY_SEPARATOR):
obj.move_to_glacier()
But I can't find any API that make this move to glacier on demand. Also, I don't
know if I can force this on demand using a bucket lifecycle rule
Update:
S3 has changed the PUT Object API, effective 2018-11-26. This was not previously possible, but you can now write objects directly to the S3 Glacier storage class.
One of the things we hear from customers about using S3 Glacier is that they prefer to use the most common S3 APIs to operate directly on S3 Glacier objects. Today we’re announcing the availability of S3 PUT to Glacier, which enables you to use the standard S3 “PUT” API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy.
https://aws.amazon.com/blogs/architecture/amazon-s3-amazon-s3-glacier-launch-announcements-for-archival-workloads/
The service now accepts the following values for x-amz-storage-class:
STANDARD
STANDARD_IA
ONEZONE_IA
INTELLIGENT_TIERING
GLACIER
REDUCED_REDUNDANCY
PUT+Copy (which is always used, typically followed by DELETE, for operations that change metadata or rename objects) also supports the new functionality.
Note that to whatever extent your SDK "screens" these values locally, taking advantage of this functionality may require you to upgrade to a more current version of the SDK.
This isn't possible. The only way to migrate an S3 object to the GLACIER storage class is through lifecycle policies.
x-amz-storage-class
Constraints: You cannot specify GLACIER as the storage class. To transition objects to the GLACIER storage class, you can use lifecycle configuration.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
The REST API is the interface used by all of the SDKs, the console, and aws-cli.
Note... test with small objects, but don't archive small objects to Glacier in production. S3 will bill you for 90 days minimum of Glacier storage even if you delete the object before 90 days. (This charge is documented.)
It is possible to upload the files from S3 to Glacier using the upload_archive() method of Glacier.
Update: This will not be same as S3 object lifecycle management but direct upload to Glacier.
glacier_client = boto3.client('glacier')
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Prefix=TARGET_KEYS + KEY_SEPARATOR):
archive_id = glacier_client.upload_archive(vaultName='TARGET_VAULT',body=obj.get()['Body'].read())
print obj.key, archive_id
.filter() does not accept the Bucket keyword argument.

Local access to Amazon S3 Bucket from EC2 instance

I have an EC2 instance and an S3 bucket in the same region. The bucket contains reasonably large (5-20mb) files that are used regularly by my EC2 instance.
I want to programatically open the file on my EC2 instance (using python). Like so:
file_from_s3 = open('http://s3.amazonaws.com/my-bucket-name/my-file-name')
But using a "http" URL to access the file remotely seems grossly inefficient, surely this would mean downloading the file to the server every time I want to use it.
What I want to know is, is there a way I can access S3 files locally from my EC2 instance, for example:
file_from_s3 = open('s3://my-bucket-name/my-file-name')
I can't find a solution myself, any help would be appreciated, thank you.
Whatever you do the object will be downloaded behind the scenes from S3 into your EC2 instance. That cannot be avoided.
If you want to treat files in the bucket as local files you need to install any one of several S3 filesystem plugins for FUSE (example : s3fs-fuse ). Alternatively you can use boto for easy access to S3 objects via python code.

Categories

Resources