I have a current process that reads in a data source directory via a yaml file designation:
with open (r'<yaml file>') as file:
directory = yaml.load(file, Loader = yaml.FullLoader)
source_directory = directory['source_directory']
The yaml file reads as follows:
source_directory : '<directory>'
However, the data source has now shifted from a local directory to an S3 Bucket. I am able to view the files in my S3 bucket by using the code below:
import boto3
def ListFiles(client):
response = client.list_objects(Bucket = '<bucket name>')
for content in response.get('Contents', []):
yield content.get('Key')
file_list = ListFiles(client)
for file in file_list:
print(file)
The boto3 code correctly lists my files, so I know my connection to the bucket is successful. How do I reference the directory path within the S3 bucket in the variable source_directory in the yaml file?
Update based on a comment I got:
Someone suggested to use s3://<bucket_name>/object_path in place of the call out in the yaml file. However, this produces a No such file or directory error.
If you are wanting to obtain a list of objects in a given directory of an Amazon S3 bucket, you can use:
response = client.list_objects_v2(Bucket='<bucket name>',Prefix=source_directory)
The Prefix should end with a slash, so the YAML file should look like:
source_directory : 'my-directory/'
Related
I have a s3 bucket (waterbucket/sample) that I am trying to write a .json file to. The s3 bucket's name however has a slash (/) in the name which causes an error to be returned when I try to move my file to that location(waterbucket/sample):
Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
I was successfully able to get my .json file to write to the 'waterbucket' s3 bucket. I've googled around and have tried using a prefix (among some other things) but no luck. Is there a way I can have my python script write to the 'waterbucket/sample' bucket instead of to'waterbucket'?
Below is my python script:
import boto3
import os
import pathLib
import logging
s3 = boto3.resource("s3")
s3_client = boto3.client(service_name = 's3')
bucket_name = "waterbucket/sample"
object_name = "file.json"
file_name = os.path.join(pathlib.Path(__file__).parent.resolve(), object_name)
s3.meta.client.upload_file(file_name, folder.format(bucket_name), object_name)
Thanks in advance!
Bucket names can't have slashes. Thus in your case, sample must be part of the object's name, as it will be considered as s3 prefix:
bucket_name = "waterbucket"
object_name = "sample/file.json"
I have been having some issues with my aws lambda that unzips a file inside of our s3 bucket. I created a script that would activate from a json payload that gets those passed through to it. The problem is it seems to be loosing the parent folder of the zip file and uploading the child folders underneath it. This is an issue for me as we also have another script I made to parse a log4j file inside of a folder to review for errors. That script is having problems because of the name lost that defines the farm the folder comes from.
To give an example of the issue ---
There's an s3 bucket on us-east, and inside that bucket is a key for "OriginalFolder.zip". When this lambda is activated it unzips and places the child file into the exact same bucket and place where the original zip file is but names it "Log.folder". I want it to keep the original name of the zip file so that when multiple farms are activating this lambda it doesn't overwrite that folder that's created or get confused on which one to read from with the second lambda.
I tried to append something at the end of the created file name to allow for params to be passed through for each farm that runs it but can't seem to make it work. I also contemplated having a separate action called in the script to copy and rename it using boto3 but I would rather not use that as my first choice. I feel there has to be an easier method but might be overlooking it.
Any thoughts would be helpful.
Edit: Here's a picture of the example. The green arrow is what I want it to stay named as. The red arrow is what the file is becoming named inside of our s3 environment. "on1" is the next folder inside "update-dc-logs-test".
import os
import tempfile
import zipfile
from concurrent import futures
from io import BytesIO
import boto3
s3 = boto3.client('s3')
def handler(event, context):
# Parse and prepare required items from event
global bucket, path, zipdata, rn_file
action = event.get("action", None)
if action == "create" or action == "update":
bucket = event['payload']['BucketName']
key = event['payload']['Key']
#rn_file = event['payload']['RenameFile']
path = os.path.dirname(key)
# Create temporary file
temp_file = tempfile.mktemp()
# Fetch and load target file
s3.download_file(bucket, key, temp_file)
zipdata = zipfile.ZipFile(temp_file)
# Call action method with using ThreadPool
with futures.ThreadPoolExecutor(max_workers=4) as executor:
future_list = [
executor.submit(extract, filename)
for filename in zipdata.namelist()
]
result = {'success': [], 'fail': []}
for future in future_list:
filename, status = future.result()
result[status].append(filename)
return result
def extract(filename):
# Extract zip and place it back in bucket
upload_status = 'success'
try:
s3.upload_fileobj(
BytesIO(zipdata.read(filename)),
bucket,
os.path.join(path, filename)
)
except Exception:
upload_status = 'fail'
finally:
return filename, upload_status
You are prefixing all uploaded files with path which is the path at which the ZIP file is found. If you want the uploaded files to be stored below a prefix which is the path and name of the ZIP file (minus the .zip extension), then change the value of path to this:
path = os.path.splitext(key)[0]
Now, instead of path holding the ZIP file's folder prefix it will contain the folder prefix plus the first part of the ZIP filename. For example, if an object is uploaded to folder1/myarchive.zip then path would previously contain folder1, but with this change it will now contain folder1/myarchive.
When that new path is combined in the extract function via os.path.join(path, filename), the object will now be uploaded to folder1/myarchive/on1/file.txt.
I have a FPDF object like this:
import fpdf
pdf = FPDF()
#Cover page
pdf.add_page()
pdf.set_font('Arial', 'B', 16)
pdf.cell(175, 10, 'TEST - report',0,1,'C')
I'm trying to use a .png file from AWS S3 bucket directly to generate a PDF, like so:
pdf.image(bucket_folder_name + '/' + file_name + '.png')
Unfortunately, I get the error:
[Errno 2] No such file or directory
What is the issue?
To use files from an S3 bucket, you need to download them first - an S3 bucket is not like a local folder for you to be able to provide a path and use the file.
Download the file first using download_file and then pass the filename to pdf.image(...).
The download_file method has the following parameters:
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
Something like this:
import boto3
s3 = boto3.client('s3')
...
filename_with_extension = file_name + '.png';
outputPath = '/tmp/' + filename_with_extension;
s3.download_file(bucket_folder_name, filename_with_extension, outputPath)
pdf.image(outputPath)
Important note: you must write to the /tmp directory as that is the only available file system that AWS permits you to write to (and read from).
Any other path will result in a [Errno 13] Permission denied error to indicate this.
Per docs:
You can configure each Lambda function with its own ephemeral storage between 512MB and 10,240MB, in 1MB increments. The ephemeral storage is available in each function’s /tmp directory.
The following code works fine except if there is a subfolder, which does not have any file inside, then the subfolder will not appear in S3. e.g.
if /home/temp/subfolder has no file, then subfolder will not show in S3. how to change the code so that the empty folder is also uploaded in S3?
I tried to write sth. (see note below), but do not know how to call put_object() to the empty subfolder.
#!/usr/bin/env python
import os
from boto3.session import Session
path = "/home/temp"
session = Session(aws_access_key_id='XXX', aws_secret_access_key='XXX')
s3 = session.resource('s3')
for subdir, dirs, files in os.walk(path):
# note: if not files ......
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
s3.Bucket('my_bucket').put_object(Key=full_path[len(path)+1:],
Body=data)
besides, I tried to call this function to check if a subfolder or file exist or not. it works for file, but not subfolder. how to check if a subfolder exists or not? (if a subfolder exists, I will not upload)
def check_exist(s3, bucket, key):
try:
s3.Object(bucket, key).load()
except botocore.exceptions.ClientError as e:
return False
return True
BTW, I refer the above code from
check if a key exists in a bucket in s3 using boto3
and
http://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
thanks them for sharing the code.
Directories (folders, subfolders, etc.) do not exist in S3.
When you copy this file to an empty S3 bucket /mydir/myfile.txt, only the file myfile.txt is copied to S3. The directory mydir is not created as that string is part of the file name mydir/myfile.txt. The actual file name is the full path, no subdirectories exist or are created.
S3 simulates directories by using a prefix when listing files in the bucket. If you specify mydir/, then all of the S3 objects that start with mydir/ will be returned including objects such as mydir/anotherfolder/myotherfile.txt. S3 supports a delimitor such as / so that the appearance of subdirectories can be created.
Note: There is no / at the beginning of a file name for S3 objects.
Listing Keys Hierarchically Using a Prefix and Delimiter
I've got a python script that gets a list of files that have been uploaded to a google cloud storage bucket, and attempts to retrieve the data as a string.
The code is simply:
file = open(base_dir + "/" + path, 'wb')
data = Blob(path, bucket).download_as_string()
file.write(data)
My issue is that the data I've uploaded is stored inside folders in the bucket, so the path would be something like:
folder/innerfolder/file.jpg
When the google library attempts to download the file, it gets it in the form of a GET request, which turns the above path into:
https://www.googleapis.com/storage/v1/b/bucket/o/folder%2Finnerfolder%2Ffile.jpg
Is there any way to stop this happening / download the file though this way? Cheers.
Yes - you can do this with the python storage client library.
Just install it with pip install --upgrade google-cloud-storage and then use the following code:
from google.cloud import storage
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(destination_file_name)
You can also use .download_as_string() but as you're writing it to a file anyway downloading straight to the file may be easier.
The only slightly awkward thing to be aware of is that the filepath is the path from after the bucket name, so doesn't line up exactly with the path on the web interface.