S3 boto3 list directories only - python

I have below hierarchy in S3 and would like to retrieve only subfolders type information excluding files that ends in .txt (basically exclude filenames and retrieve only prefixes/folders).
--folder1/subfolder1/item1.txt
--folder1/subfolder1/item11.txt
--folder1/subfolder2/item2.txt
--folder1/subfolder2/item21.txt
--folder1/subfolder3/item3.txt
--folder1/subfolder3/subfolder31/item311.txt
Desired Output:
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31
I understand that there is no folders/subfolders in S3 but all are keys.
I tried below code but it is displaying all information including filenames like item1.txt
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('s3-bucketname')
paginator = client.get_paginator('list_objects')
objs = list(bucket.objects.filter(Prefix='folder1/'))
for i in range(0, len(objs)):
print(objs[i].key)
Any recommendation to get below output?
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31

As you say, S3 doesn't really have a concept of folders, so to get what you want, in a sense, you need to recreate it.
One option is to list all of the objects in the bucket, and construct the folder, or prefix, of each object, and operate on new names as you run across them:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucketname')
shown = set()
for obj in bucket.objects.filter(Prefix='folder1/'):
prefix = "/".join(obj.key.split("/")[:-1])
if len(prefix) and prefix not in shown:
shown.add(prefix)
print(prefix + "/")

Related

Recursively copy a child directory to the parent in Google Cloud Storage

I need to recursively move the contents of a sub-folder to a parent folder in google cloud storage. This code works for moving a single file from sub-folder to the parent.
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)
source_path = Path(parent_dir, sub_folder, filename).as_posix()
source_blob = bucket.blob(source_path)
dest_path = Path(parent_dir, filename).as_posix()
bucket.copy_blob(source_blob, bucket, dest_path)
but I don't know how to properly format the command because if my dest_path is "parent_dir", I get the following error:
google.api_core.exceptions.NotFound: 404 POST https://storage.googleapis.com/storage/v1/b/bucket/o/parent_dir%2Fsubfolder/copyTo/b/geo-storage/o/parent_dir?prettyPrint=false: No such object: geo-storage/parent_dir/subfolder
Note: This works for recursive copy with gsutils but I would prefer to use the blob object:
os.system(f"gsutil cp -r gs://bucket/parent_dir/subfolder/* gs://bucket/parent_dir")
GCS does not have the concept of "directory" - just a flat namespace of objects. You can have an object named "foo/a.txt" and "foo/b.txt", but there is not actual thing representing "foo/" - its' just a prefix on the object names.
gsutil uses prefixes on the object names to pretend directories exist, but under the hood it is really just acting on all the individual objects with that prefix.
You need to do the same, with a copy for each object:
client = storage.Client()
bucket = client.Bucket(BUCKET_NAME)
for source_blob in client.list_blobs(BUCKET_NAME, prefix="old/prefix/"):
dest_name = source_blob.name.replace("old/prefix/", "new/prefix/")
# do these in parallel for more speed
bucket.copy_blob(source_blob, bucket, new_name=dest_name)

get_object from S3 with full path

I used to have environment variables BUCKET_NAME and FILENAME.
This is my current code to get the file:
obj = self.s3_client.get_object(Bucket=self.bucket_name, Key=filename)
(where self.bucket_name came from BUCKET_NAME and filename came from FILENAME environment variables)
Earlier today, the "higher powers" changed the environment, so now instead of the bucket name I get the BUCKET_FILE, with the value s3://bucket_name/filename
This breaks my code, and I need to fix it.
Can I somehow use this string to get to the object? Or do I have to parse the bucket_name and filename out of it and keep the above code?
I searched S3 website, but I can't find anything other than get_object, which has Bucket (string containing bucket name) as a required parameter.
Yep, you need to parse this string and get the bucket name and the key. Here is the function that AWS CLI uses to achieve this:
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
block_unsupported_resources(s3_path)
match = _S3_ACCESSPOINT_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
match = _S3_OUTPOST_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
s3_components = s3_path.split('/', 1)
bucket = s3_components[0]
s3_key = ''
if len(s3_components) > 1:
s3_key = s3_components[1]
return bucket, s3_key
Reference: https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/utils.py#L217-L235

how to copy files and folders from one S3 bucket to another S3 using python boto3

I want to copy a files and folders from one s3 bucket to another.
I am unable to find a solution by reading the docs.
Only able to copy files but not folders from s3 bucket.
Here is my code:
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')
S3 does not have any concept of folder/directories. It follows a flat structure.
For example it seems,
On UI you see 2 files inside test_folder with named file1.txt and file2.txt, but actually two files will have key as
"test_folder/file1.txt" and "test_folder/file2.txt".
Each file is stored with this naming convention.
You can use code snippet given below to copy each key to some other bucket.
import boto3
s3_client = boto3.client('s3')
resp = s3_client.list_objects_v2(Bucket='mybucket')
keys = []
for obj in resp['Contents']:
keys.append(obj['Key'])
s3_resource = boto3.resource('s3')
for key in keys:
copy_source = {
'Bucket': 'mybucket',
'Key': key
}
bucket = s3_resource.Bucket('otherbucket')
bucket.copy(copy_source, 'otherkey')
If your source bucket contains many keys, and this is a one time activity, then I suggest you to checkout this link.
If this needs to be done for every insert event on your bucket and you need to copy that to another bucket, you can checkout this approach.
S3 is flat object storage, there are no "folders" there. It is structured by key prefix for display purposes in S3 browser. To copy "folder" you have to copy all objects with common key prefix.

Can we copy the files and folders recursively between aws s3 buckets using boto3 Python?

Is it possible to copy all the files in one source bucket to other target bucket using boto3. And source bucket doesn't have regular folder structure.
Source bucket: SRC
Source Path: A/B/C/D/E/F..
where in D folder it has some files,
E folder has some files
Target bucket: TGT
Target path: L/M/N/
I need to copy all the files and folders from above SRC bucket from folder C to TGT bucket under N folder using boto3.
Can any one aware of any API or do we need to write new python script to complete this task.
S3 store object, it doesn't store folder, even '/' or '\' is part of the object key name. You just need to manipulate the key name and copy the data over.
import boto3
old_bucket_name = 'SRC'
old_prefix = 'A/B/C/'
new_bucket_name = 'TGT'
new_prefix = 'L/M/N/'
s3 = boto3.resource('s3')
old_bucket = s3.Bucket(old_bucket_name)
new_bucket = s3.Bucket(new_bucket_name)
for obj in old_bucket.objects.filter(Prefix=old_prefix):
old_source = { 'Bucket': old_bucket_name,
'Key': obj.key}
# replace the prefix
new_key = obj.key.replace(old_prefix, new_prefix, 1)
new_obj = new_bucket.Object(new_key)
new_obj.copy(old_source)
Optimized technique of defining new_key suggested by zvikico:
new_key = new_prefix + obj.key[len(old_prefix):]

How to move files in Google Cloud Storage from one bucket to another bucket by Python

Are there any API function that allow us to move files in Google Cloud Storage from one bucket in another bucket?
The scenario is we want Python to move read files in A bucket to B bucket. I knew that gsutil could do that but not sure Python can support that or not.
Thanks.
Here's a function I use when moving blobs between directories within the same bucket or to a different bucket.
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="path_to_your_creds.json"
def mv_blob(bucket_name, blob_name, new_bucket_name, new_blob_name):
"""
Function for moving files between directories or buckets. it will use GCP's copy
function then delete the blob from the old location.
inputs
-----
bucket_name: name of bucket
blob_name: str, name of file
ex. 'data/some_location/file_name'
new_bucket_name: name of bucket (can be same as original if we're just moving around directories)
new_blob_name: str, name of file in new directory in target bucket
ex. 'data/destination/file_name'
"""
storage_client = storage.Client()
source_bucket = storage_client.get_bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.get_bucket(new_bucket_name)
# copy to new destination
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, new_blob_name)
# delete in old destination
source_blob.delete()
print(f'File moved from {source_blob} to {new_blob_name}')
Using the google-api-python-client, there is an example on the storage.objects.copy page. After you copy, you can delete the source with storage.objects.delete.
destination_object_resource = {}
req = client.objects().copy(
sourceBucket=bucket1,
sourceObject=old_object,
destinationBucket=bucket2,
destinationObject=new_object,
body=destination_object_resource)
resp = req.execute()
print json.dumps(resp, indent=2)
client.objects().delete(
bucket=bucket1,
object=old_object).execute()
you can use GCS Client Library Functions documented at [1] to read to one bucket and write to the other and then delete source file.
You can even use the GCS REST API documented at [2].
Link:
[1] - https://developers.google.com/appengine/docs/python/googlecloudstorageclient/functions
[2] - https://developers.google.com/storage/docs/concepts-techniques#overview
def GCP_BUCKET_A_TO_B():
source_bucket = storage_client.get_bucket("Bucket_A_Name")
filename = [filename.name for filename in
list(source_bucket.list_blobs(prefix=""))]
for i in range (0,len(filename)):
source_blob = source_bucket.blob(filename[i])
destination_bucket = storage_client.get_bucket("Bucket_B_Name")
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, filename[i])
I just wanted to point out that there's another possible approach and that is using gsutil through the use of the subprocess module.
The advantages of using gsutil like that:
You don't have to deal with individual blobs
gsutil's implementation of the move and especially rsync will probably be much better and more resilient that what we do ourselves.
The disadvantages:
You can't deal with individual blobs easily
It's hacky and generally a library is preferable to executing shell commands
Example:
def move(source_uri: str,
destination_uri: str) -> None:
"""
Move file from source_uri to destination_uri.
:param source_uri: gs:// - like uri of the source file/directory
:param destination_uri: gs:// - like uri of the destination file/directory
:return: None
"""
cmd = f"gsutil -m mv {source_uri} {destination_uri}"
subprocess.run(cmd)

Categories

Resources