get_object from S3 with full path - python

I used to have environment variables BUCKET_NAME and FILENAME.
This is my current code to get the file:
obj = self.s3_client.get_object(Bucket=self.bucket_name, Key=filename)
(where self.bucket_name came from BUCKET_NAME and filename came from FILENAME environment variables)
Earlier today, the "higher powers" changed the environment, so now instead of the bucket name I get the BUCKET_FILE, with the value s3://bucket_name/filename
This breaks my code, and I need to fix it.
Can I somehow use this string to get to the object? Or do I have to parse the bucket_name and filename out of it and keep the above code?
I searched S3 website, but I can't find anything other than get_object, which has Bucket (string containing bucket name) as a required parameter.

Yep, you need to parse this string and get the bucket name and the key. Here is the function that AWS CLI uses to achieve this:
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
block_unsupported_resources(s3_path)
match = _S3_ACCESSPOINT_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
match = _S3_OUTPOST_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
s3_components = s3_path.split('/', 1)
bucket = s3_components[0]
s3_key = ''
if len(s3_components) > 1:
s3_key = s3_components[1]
return bucket, s3_key
Reference: https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/utils.py#L217-L235

Related

How to upload a file into s3 (using Python) where the s3 bucket name has a slash '/' in the name

I have a s3 bucket (waterbucket/sample) that I am trying to write a .json file to. The s3 bucket's name however has a slash (/) in the name which causes an error to be returned when I try to move my file to that location(waterbucket/sample):
Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
I was successfully able to get my .json file to write to the 'waterbucket' s3 bucket. I've googled around and have tried using a prefix (among some other things) but no luck. Is there a way I can have my python script write to the 'waterbucket/sample' bucket instead of to'waterbucket'?
Below is my python script:
import boto3
import os
import pathLib
import logging
s3 = boto3.resource("s3")
s3_client = boto3.client(service_name = 's3')
bucket_name = "waterbucket/sample"
object_name = "file.json"
file_name = os.path.join(pathlib.Path(__file__).parent.resolve(), object_name)
s3.meta.client.upload_file(file_name, folder.format(bucket_name), object_name)
Thanks in advance!
Bucket names can't have slashes. Thus in your case, sample must be part of the object's name, as it will be considered as s3 prefix:
bucket_name = "waterbucket"
object_name = "sample/file.json"

S3 boto3 list directories only

I have below hierarchy in S3 and would like to retrieve only subfolders type information excluding files that ends in .txt (basically exclude filenames and retrieve only prefixes/folders).
--folder1/subfolder1/item1.txt
--folder1/subfolder1/item11.txt
--folder1/subfolder2/item2.txt
--folder1/subfolder2/item21.txt
--folder1/subfolder3/item3.txt
--folder1/subfolder3/subfolder31/item311.txt
Desired Output:
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31
I understand that there is no folders/subfolders in S3 but all are keys.
I tried below code but it is displaying all information including filenames like item1.txt
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('s3-bucketname')
paginator = client.get_paginator('list_objects')
objs = list(bucket.objects.filter(Prefix='folder1/'))
for i in range(0, len(objs)):
print(objs[i].key)
Any recommendation to get below output?
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31
As you say, S3 doesn't really have a concept of folders, so to get what you want, in a sense, you need to recreate it.
One option is to list all of the objects in the bucket, and construct the folder, or prefix, of each object, and operate on new names as you run across them:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucketname')
shown = set()
for obj in bucket.objects.filter(Prefix='folder1/'):
prefix = "/".join(obj.key.split("/")[:-1])
if len(prefix) and prefix not in shown:
shown.add(prefix)
print(prefix + "/")

Upload files in a subdirectory in S3 with boto3 in python

I want to upload files in a a subdirectory in a bucket. When I try to upload it in the bucket only it works well but I don't know how to add the subdirectory (Prefix ?)
def dlImgs():
s3 = boto3.resource("s3")
if gmNew is not None:
reqImg = requests.get(gmNew, stream=True)
fileObj = reqImg.raw
reqData = fileObj.read()
#upload to S3
s3.Bucket(_BUCKET_NAME_IMG).put_object(Key=ccvName, Body=reqData)
dlImgs()
But how to add the Prefix ?
EDIT: I find the solution by creating a path directly in the ccvName variable.
I had written this long ago.
def upload_file(file_name,in_sub_folder,bucket_name):
client = boto3.client('s3')
fname = os.path.basename(file_name)
key = f'{in_sub_folder}/{fname}'
try:
client.upload_file(fname, Bucket=bucket_name ,Key=key)
except:
print(f'{file_name} not uploaded')

Can we copy the files and folders recursively between aws s3 buckets using boto3 Python?

Is it possible to copy all the files in one source bucket to other target bucket using boto3. And source bucket doesn't have regular folder structure.
Source bucket: SRC
Source Path: A/B/C/D/E/F..
where in D folder it has some files,
E folder has some files
Target bucket: TGT
Target path: L/M/N/
I need to copy all the files and folders from above SRC bucket from folder C to TGT bucket under N folder using boto3.
Can any one aware of any API or do we need to write new python script to complete this task.
S3 store object, it doesn't store folder, even '/' or '\' is part of the object key name. You just need to manipulate the key name and copy the data over.
import boto3
old_bucket_name = 'SRC'
old_prefix = 'A/B/C/'
new_bucket_name = 'TGT'
new_prefix = 'L/M/N/'
s3 = boto3.resource('s3')
old_bucket = s3.Bucket(old_bucket_name)
new_bucket = s3.Bucket(new_bucket_name)
for obj in old_bucket.objects.filter(Prefix=old_prefix):
old_source = { 'Bucket': old_bucket_name,
'Key': obj.key}
# replace the prefix
new_key = obj.key.replace(old_prefix, new_prefix, 1)
new_obj = new_bucket.Object(new_key)
new_obj.copy(old_source)
Optimized technique of defining new_key suggested by zvikico:
new_key = new_prefix + obj.key[len(old_prefix):]

How to move files in Google Cloud Storage from one bucket to another bucket by Python

Are there any API function that allow us to move files in Google Cloud Storage from one bucket in another bucket?
The scenario is we want Python to move read files in A bucket to B bucket. I knew that gsutil could do that but not sure Python can support that or not.
Thanks.
Here's a function I use when moving blobs between directories within the same bucket or to a different bucket.
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="path_to_your_creds.json"
def mv_blob(bucket_name, blob_name, new_bucket_name, new_blob_name):
"""
Function for moving files between directories or buckets. it will use GCP's copy
function then delete the blob from the old location.
inputs
-----
bucket_name: name of bucket
blob_name: str, name of file
ex. 'data/some_location/file_name'
new_bucket_name: name of bucket (can be same as original if we're just moving around directories)
new_blob_name: str, name of file in new directory in target bucket
ex. 'data/destination/file_name'
"""
storage_client = storage.Client()
source_bucket = storage_client.get_bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.get_bucket(new_bucket_name)
# copy to new destination
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, new_blob_name)
# delete in old destination
source_blob.delete()
print(f'File moved from {source_blob} to {new_blob_name}')
Using the google-api-python-client, there is an example on the storage.objects.copy page. After you copy, you can delete the source with storage.objects.delete.
destination_object_resource = {}
req = client.objects().copy(
sourceBucket=bucket1,
sourceObject=old_object,
destinationBucket=bucket2,
destinationObject=new_object,
body=destination_object_resource)
resp = req.execute()
print json.dumps(resp, indent=2)
client.objects().delete(
bucket=bucket1,
object=old_object).execute()
you can use GCS Client Library Functions documented at [1] to read to one bucket and write to the other and then delete source file.
You can even use the GCS REST API documented at [2].
Link:
[1] - https://developers.google.com/appengine/docs/python/googlecloudstorageclient/functions
[2] - https://developers.google.com/storage/docs/concepts-techniques#overview
def GCP_BUCKET_A_TO_B():
source_bucket = storage_client.get_bucket("Bucket_A_Name")
filename = [filename.name for filename in
list(source_bucket.list_blobs(prefix=""))]
for i in range (0,len(filename)):
source_blob = source_bucket.blob(filename[i])
destination_bucket = storage_client.get_bucket("Bucket_B_Name")
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, filename[i])
I just wanted to point out that there's another possible approach and that is using gsutil through the use of the subprocess module.
The advantages of using gsutil like that:
You don't have to deal with individual blobs
gsutil's implementation of the move and especially rsync will probably be much better and more resilient that what we do ourselves.
The disadvantages:
You can't deal with individual blobs easily
It's hacky and generally a library is preferable to executing shell commands
Example:
def move(source_uri: str,
destination_uri: str) -> None:
"""
Move file from source_uri to destination_uri.
:param source_uri: gs:// - like uri of the source file/directory
:param destination_uri: gs:// - like uri of the destination file/directory
:return: None
"""
cmd = f"gsutil -m mv {source_uri} {destination_uri}"
subprocess.run(cmd)

Categories

Resources