Amazon S3 boto - how to delete folder? - python

I created a folder in s3 named "test" and I pushed "test_1.jpg", "test_2.jpg" into "test".
How can I use boto to delete folder "test"?

Here is 2018 (almost 2019) version:
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.objects.filter(Prefix="myprefix/").delete()

There are no folders in S3. Instead, the keys form a flat namespace. However a key with slashes in its name shows specially in some programs, including the AWS console (see for example Amazon S3 boto - how to create a folder?).
Instead of deleting "a directory", you can (and have to) list files by prefix and delete. In essence:
for key in bucket.list(prefix='your/directory/'):
key.delete()
However the other accomplished answers on this page feature more efficient approaches.
Notice that the prefix is just searched using dummy string search. If the prefix were your/directory, that is, without the trailing slash appended, the program would also happily delete your/directory-that-you-wanted-to-remove-is-definitely-not-t‌​his-one.
For more information, see S3 boto list keys sometimes returns directory key.

I feel that it's been a while and boto3 has a few different ways of accomplishing this goal. This assumes you want to delete the test "folder" and all of its objects Here is one way:
s3 = boto3.resource('s3')
objects_to_delete = s3.meta.client.list_objects(Bucket="MyBucket", Prefix="myfolder/test/")
delete_keys = {'Objects' : []}
delete_keys['Objects'] = [{'Key' : k} for k in [obj['Key'] for obj in objects_to_delete.get('Contents', [])]]
s3.meta.client.delete_objects(Bucket="MyBucket", Delete=delete_keys)
This should make two requests, one to fetch the objects in the folder, the second to delete all objects in said folder.
https://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.delete_objects

A slight improvement on Patrick's solution. As you might know, both list_objects() and delete_objects() have an object limit of 1000. This is why you have to paginate listing and delete in chunks. This is pretty universal and you can give Prefix to paginator.paginate() to delete subdirectories/paths
client = boto3.client('s3', **credentials)
paginator = client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=self.bucket_name)
delete_us = dict(Objects=[])
for item in pages.search('Contents'):
delete_us['Objects'].append(dict(Key=item['Key']))
# flush once aws limit reached
if len(delete_us['Objects']) >= 1000:
client.delete_objects(Bucket=bucket, Delete=delete_us)
delete_us = dict(Objects=[])
# flush rest
if len(delete_us['Objects']):
client.delete_objects(Bucket=bucket, Delete=delete_us)

You can use bucket.delete_keys() with a list of keys (with a large number of keys I found this to be an order of magnitude faster than using key.delete).
Something like this:
delete_key_list = []
for key in bucket.list(prefix='/your/directory/'):
delete_key_list.append(key)
if len(delete_key_list) > 100:
bucket.delete_keys(delete_key_list)
delete_key_list = []
if len(delete_key_list) > 0:
bucket.delete_keys(delete_key_list)

If versioning is enabled on the S3 bucket:
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.object_versions.filter(Prefix="myprefix/").delete()

If one needs to filter by object contents like I did, the following is a blueprint for your logic:
def get_s3_objects_batches(s3: S3Client, **base_kwargs):
kwargs = dict(MaxKeys=1000, **base_kwargs)
while True:
response = s3.list_objects_v2(**kwargs)
# to yield each and every file: yield from response.get('Contents', [])
yield response.get('Contents', [])
if not response.get('IsTruncated'): # At the end of the list?
break
continuation_token = response.get('NextContinuationToken')
kwargs['ContinuationToken'] = continuation_token
def your_filter(b):
raise NotImplementedError()
session = boto3.session.Session(profile_name=profile_name)
s3client = session.client('s3')
for batch in get_s3_objects_batches(s3client, Bucket=bucket_name, Prefix=prefix):
to_delete = [{'Key': obj['Key']} for obj in batch if your_filter(obj)]
if to_delete:
s3client.delete_objects(Bucket=bucket_name, Delete={'Objects': to_delete})

#Deleting a Files Inside Folder S3 using boto3#
def delete_from_minio():
"""
This function is used to delete files or folder inside the another Folder
"""
try:
logger.info("Deleting from minio")
aws_access_key_id='Your_aws_acess_key'
aws_secret_access_key = 'Your_aws_Secret_key'
host = 'your_aws_endpoint'
s3 = boto3.resource('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key ,
config=boto3.session.Config(signature_version='your_version'),
region_name="your_region",
endpoint_url=host ,
verify=False)
bucket = s3.Bucket('Your_bucket_name')
for obj in bucket.objects.filter(Prefix='Directory/Sub_Directory'):
s3.Object(bucket.name, obj.key).delete()
except Exception as e:
print(f"Error Occurred while deleting from the S3,{str(e)}")
Hope this Helps :)

Related

wandb: artifact.add_reference() option to add specific (not current) versionId or ETag to stop the need for re-upload to s3?

I feel like this should be possible, but I looked through the wandb SDK code and I can't find an easy/logical way to do it. It might be possible to hack it by modifying the manifest entries at some point later (but maybe before the artifact is logged to wandb as then the manifest and the entries might be locked)? I saw things like this in the SDK code:
version = manifest_entry.extra.get("versionID")
etag = manifest_entry.extra.get("etag")
So, I figure we can probably edit those?
UPDATE
So, I tried to hack it together with something like this and it works but it feels wrong:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
API_KEY = os.environ.get("WANDB_API_KEY")
api = api = wandb.Api(overrides={"entity": ENTITY, "project": ENTITY})
run = wandb.init(entity=ENTITY, project=PROJECT, job_type="test upload")
file = "admin2Codes.txt" # "admin1CodesASCII.txt" # (both already on s3 with a couple versions)
artifact = wandb.Artifact("test_data", type="dataset")
# modify one of the local files so it has a new md5hash etc.
with open(file, "a") as f:
f.write("new_line_1\n")
# upload local file to s3
local_file_path = file
s3_url = f"s3://bucket/prefix/{file}"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
head_response = s3_client.head_object(Bucket=s3_bucket, Key=key)
version_id: str = head_response["VersionId"]
print(version_id)
# upload a link/ref to this s3 object in wandb:
artifact.add_reference(s3_dir)
# at this point we might be able to modify the artifact._manifest.entries and each entry.extra.get("etag") etc.?
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
# set these to an older version on s3 that we know we want (rather than latest) - do this via wandb public API:
dataset_v2 = api.artifact(f"{ENTITY}/{PROJECT}/test_data:v2", type="dataset")
# artifact._manifest.add_entry(dataset_v2.manifest.entries["admin1CodesASCII.txt"])
artifact._manifest.entries["admin1CodesASCII.txt"] = dataset_v2.manifest.entries[
"admin1CodesASCII.txt"
]
# verify that it did change:
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
run.log_artifact(artifact) # at this point the manifest is locked I believe?
artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(artifact.name)
run_id = run.id
run.finish()
curr_run = api.run(f"{ENTITY}/{PROJECT}/{run_id}")
used_artifacts = curr_run.used_artifacts()
logged_artifacts = curr_run.logged_artifacts()
Am I on the right track here? I guess the other workaround is to make a copy on s3 (so that older version is the latest again) but I wanted to avoid this as the 1 file that I want to use an old version of is a large NLP model and the only files I want to change are small config.json files etc. (so seems very wasteful to upload all files again).
I was also wondering if when I copy an old version of an object back into the same key in the bucket if that creates a real copy or just like a pointer to the same underlying object. Neither boto3 nor AWS documentation makes that clear - although it seems like it is a proper copy.
I think I found the correct way to do it now:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
def wandb_update_only_some_files_in_artifact(
existing_artifact_name: str,
new_s3_file_urls: list[str],
entity: str = ENTITY,
project: str = PROJECT,
) -> Artifact:
"""If you want to just update a config.json file for example,
but the rest of the artifact can remain the same, then you can
use this functions like so:
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)
and then all the other files like model.bin will be the same as in v3,
even if there was a v4 or v5 in between (as the v3 VersionIds are used)
Args:
existing_artifact_name (str): name with version like "old_artifact:v3"
new_s3_file_urls (list[str]): files that should be updated
entity (str, optional): wandb entity. Defaults to ENTITY.
project (str, optional): wandb project. Defaults to PROJECT.
Returns:
Artifact: the new artifact object
"""
api = wandb.Api(overrides={"entity": entity, "project": project})
old_artifact = api.artifact(existing_artifact_name)
old_artifact_name = re.sub(r":v\d+$", "", old_artifact.name)
with wandb.init(entity=entity, project=project) as run:
new_artifact = wandb.Artifact(old_artifact_name, type=old_artifact.type)
s3_file_names = [s3_url.split("/")[-1] for s3_url in new_s3_file_urls]
# add the new ones:
for s3_url, filename in zip(new_s3_file_urls, s3_file_names):
new_artifact.add_reference(s3_url, filename)
# add the old ones:
for filename, entry in old_artifact.manifest.entries.items():
if filename in s3_file_names:
continue
new_artifact.add_reference(entry, filename)
# this also works but feels hackier:
# new_artifact._manifest.entries[filename] = entry
run.log_artifact(new_artifact)
new_artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(new_artifact.name)
print(run.id)
return new_artifact
# usage:
local_file_path = "config.json" # modified file
s3_url = "s3://bucket/prefix/config.json"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)

InvalidS3ObjectException when calling the AnalyzeDocument operation:

InvalidS3ObjectException when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions."
I keep getting this error. Over. And. Over. This program worked with my test cases of what I'm bringing in, the json with a {"body":"imagename.jpg"}. But the very moment I try to utilize the actual code my JS brings in, I get this error. The thing that confuses me is that I've checked the regions and they are fine. I went into my account and created users with full access to all AWS and S3 features, and utilized those logins, I've used my root account, everything. All I'm trying to do is access an image from my s3 bucket. Why won't it work? Below is my code. It works if I utilize the test case I provided above, but the moment I try and use the website it's connected to, it doesn't work.
def main(event, context):
key_map, value_map, block_map = get_kv_map(event) #Take map variables in to get the key and value map we need.
It goes to this function...
def get_kv_map(event):
filePath = event
fileExt = filePath.get('body')
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket('react-images-ex')
obj = bucket.Object(bucket)
client = boto3.client('textract') #We utilize boto3's textract lib
response = client.analyze_document(Document={'S3Object': {'Bucket': 'react-images-ex', 'Name': fileExt}}, FeatureTypes=['FORMS'])
# Get the text blocks
blocks=response['Blocks'] #We make a blocks variable that will be the blocks we find in the document
# get key and value maps
key_map = {}
value_map = {}
block_map = {}
for block in blocks: #Traverse the blocks found in the document
block_id = block['Id'] #Set variable for blockId to the Id's found on that block location
block_map[block_id] = block #Make the block map at that ID be the block variable
if block['BlockType'] == "KEY_VALUE_SET": #if we see that the type of block we're on is a key and value set pair, we check if it's a key or not. If it's not a key, we know it's a value. We send it to the respective map.
if 'KEY' in block['EntityTypes']:
key_map[block_id] = block
else:
value_map[block_id] = block
return key_map, value_map, block_map #Return the maps we need after they're filled.
I have been told before this code is fine, and it should work. So why exactly is it that I get this error?
Based on the comments.
The issue with body was that it was json string, not actual json object.
The solution was to parse the string into json:
fileExt = json.loads(filePath.get('body'))
Try awscli to see if you can access the image in s3:
aws s3 ls s3://react-images-ex/<some-fileExt>
Either you are parsing the fileExt wrongly, or you don't have S3 permission to access the file. The awscli command will help to verify this.

boto3 put_object with new key python

I want to save a csv file ("test.csv") in S3 using boto3.
my bucket is "outputS3Bucket" and the key is "folder/newFolder".
I want to check if "newFolder" exists and if not to create it.
import boto3
client = boto3.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket("outputS3Bucket")
result = client.list_objects(Bucket='outputS3Bucket',Prefix="folder/newFolder")
if len(result)==0:
key = bucket.new_key("folder/newFolder")
newKey = key + "/" + "test.csv"
client.put_object(Bucket="outputS3Bucket", Key=newKey, Body=content)
# put_object path: 's3://outputS3Bucket/folder/newFolder/test.csv'
I have few problems:
if I don't write the full key name (such as "folder/ne") and there is a "neaFo" folder instead it still says it exists.
key = bucket.new_key("folder/newFolder")
AttributeError: 's3.Bucket' object has no attribute 'new_key'
Firstly, according to boto3 documentation, it's preferred to use the new API method - list_objects_v2() instead to list a bucket's objects.
I suggest using a simple boolean function to check whether a folder exist (makes your code cleaner and more readable).
for question 1, you can check if the prefix ends with '/' character and append it if not, - this will make sure your are looking for EXACT match and not Starts With.
Sample Function:
def bucket_folder_exists(client, bucket, path_prefix):
# make path_prefix exact match and not path/to/folder*
if list(path_prefix)[-1] is not '/':
path_prefix += '/'
# check if 'Contents' key exist in response dict - if it exist it indicate the folder exists, otherwise response will be None
response = client.list_objects_v2(Bucket=bucket, Prefix=path_prefix).get('Contents')
if response:
return True
return False
Sample Implementation:
if bucket_folder_exists(client, 'outputS3Bucket', 'folder/newFolder'):
pass # Do something if folder already exist
else:
pass # Do something if folder does not exist
Regarding your second question, I added a comment - it seems your code mentions bucket variable\object used as key = bucket.new_key("folder/newFolder"), however bucket is not set anywhere in your code, -> according to the error you are getting, it looks like a s3.Bucket object, which doesn't have the the new_key attribute defined.

How do I get the size of a boto3 Collection?

The way I have been using is to transform the Collection into a List and query the length:
s3 = boto3.resource('s3')
bucket = s3.Bucket('my_bucket')
size = len(list(bucket.objects.all()))
However, this forces resolution of the whole collection and obviates the benefits of using a Collection in the first place. Is there a better way to do this?
There is no way to get the count of keys in a bucket without listing all the objects this is a limitation of AWS S3 (see https://forums.aws.amazon.com/thread.jspa?messageID=164220).
Getting the Object Summaries (HEAD) doesn't get the actual data so should be a relatively inexpensive operation and if you are just discarding the list then you could do:
size = sum(1 for _ in bucket.objects.all())
Which will give you the number of objects without constructing a list.
Borrowing from a similar question, one option to retrieve the complete list of object keys from a bucket + prefix is to use recursion with the list_objects_v2 method.
This method will recursively retrieve the list of object keys, 1000 keys at a time.
Each request to list_objects_v2 uses the StartAfter argument to continue listing keys after the last key from the previous request.
import boto3
if __name__ == '__main__':
client = boto3.client('s3',
aws_access_key_id = 'access_key',
aws_secret_access_key = 'secret_key'
)
def get_all_object_keys(bucket, prefix, start_after = '', keys = []):
response = client.list_objects_v2(
Bucket = bucket,
Prefix = prefix,
StartAfter = start_after
)
if 'Contents' not in response:
return keys
key_list = response['Contents']
last_key = key_list[-1]['Key']
keys.extend(key_list)
return get_all_object_keys(bucket, prefix, last_key, keys)
object_keys = get_all_object_keys('your_bucket', 'prefix/to/files')
print(len(object_keys))
For my use case, I just needed to know whether the folder is empty or not.
s3 = boto3.client('s3')
response = s3.list_objects(
Bucket='your-bucket',
Prefix='path/to/your/folder/',
)
print(len(response['Contents']))
This was enough to know whether the folder is empty. Note that a folder, if manually created in the S3 console, can count as a resource itself. In this case, if the length shown above is greater than 1, then the S3 "folder" is not empty.

Unable to set file content type in S3

How do you set content type on a file in a webhosting-enabled S3 account via the Python boto module?
I'm doing:
from boto.s3.connection import S3Connection
from boto.s3.key import Key
from boto.cloudfront import CloudFrontConnection
conn = S3Connection(access_key_id, secret_access_key)
bucket = conn.create_bucket('mybucket')
b = conn.get_bucket(bucket)
b.set_acl('public-read')
fn = 'index.html'
template = '<html>blah</html>'
k = Key(b)
k.key = fn
k.set_contents_from_string(template)
k.set_acl('public-read')
k.set_metadata('Content-Type', 'text/html')
However, when I access it from http://mybucket.s3-website-us-east-1.amazonaws.com/index.html my browser prompts me to download the file instead of simply serving it as a webpage.
Looking at the metadata in the S3 Management console shows the Content-Type has actually been set to "application/octet-stream". If I manually change it in the console, I can access the page normally, but if I run my script again, it resets it back to the wrong content type.
What am I doing wrong?
The set_metadata method is really for setting user metadata on S3 objects. Many of the standard HTTP metadata fields have first class attributes to represent them, e.g. content_type. Also, you want to set the metadata before you actually send the object to S3. Something like this should work:
import boto
conn = boto.connect_s3()
bucket = conn.get_bucket('mybucket') # Assumes bucket already exists
key = bucket.new_key('mykey')
key.content_type = 'text/html'
key.set_contents_from_string(mystring, policy='public-read')
Note that you can set canned ACL policies at the time you write the object to S3 which saves having to make another API call.
For people who need one-liner for this,
import boto3
s3 = boto3.resource('s3')
s3.Bucket('bucketName').put_object(Key='keyName', Body='content or fileData', ContentType='contentType', ACL='check below')
Supported ACL values:
'private'|'public-read'|'public-read-write'|'authenticated-read'|'aws-exec-read'|'bucket-owner-read'|'bucket-owner-full-control'
Arguments supported by put_object can be found here, https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
I wasn't able to get the above solution to actually persist my metadata changes.
Perhaps because I was using a file and it was resetting the content type using mimetype? Also I am uploading m3u8 and ts files for HLS encoding so that could interfere as well.
Anyway, here's what worked for me.
import boto
conn = boto.connect_s3()
bucket = conn.get_bucket('mybucket')
key_m3u8 = Key(bucket_handle)
key_m3u8.key = s3folder+"/"+s3keyname
key_m3u8.metadata = {"Content-Type":"application/x-mpegURL","Cache-Control":"public,max-age=8"}
key_m3u8.set_contents_from_filename("path_to_my_file", policy="public-read")
If you use AWS S3 Bitbucket Pipelines Python add the parameter content_type:
s3_upload.py
def upload_to_s3(bucket, artefact, bucket_key, content_type):
...
def main():
...
parser.add_argument("content_type", help="Content Type File")
...
if not upload_to_s3(args.bucket, args.artefact, args.bucket_key, args.content_type):
then modify bitbucket-pipelines.yml as follow:
...
- python s3_upload.py bucket_name file key content_type
...
Where content_type param can be one of the following: MIME types (IANA media types)

Categories

Resources