how to upload sub-folder which is empty to S3 using python - python

The following code works fine except if there is a subfolder, which does not have any file inside, then the subfolder will not appear in S3. e.g.
if /home/temp/subfolder has no file, then subfolder will not show in S3. how to change the code so that the empty folder is also uploaded in S3?
I tried to write sth. (see note below), but do not know how to call put_object() to the empty subfolder.
#!/usr/bin/env python
import os
from boto3.session import Session
path = "/home/temp"
session = Session(aws_access_key_id='XXX', aws_secret_access_key='XXX')
s3 = session.resource('s3')
for subdir, dirs, files in os.walk(path):
# note: if not files ......
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
s3.Bucket('my_bucket').put_object(Key=full_path[len(path)+1:],
Body=data)
besides, I tried to call this function to check if a subfolder or file exist or not. it works for file, but not subfolder. how to check if a subfolder exists or not? (if a subfolder exists, I will not upload)
def check_exist(s3, bucket, key):
try:
s3.Object(bucket, key).load()
except botocore.exceptions.ClientError as e:
return False
return True
BTW, I refer the above code from
check if a key exists in a bucket in s3 using boto3
and
http://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
thanks them for sharing the code.

Directories (folders, subfolders, etc.) do not exist in S3.
When you copy this file to an empty S3 bucket /mydir/myfile.txt, only the file myfile.txt is copied to S3. The directory mydir is not created as that string is part of the file name mydir/myfile.txt. The actual file name is the full path, no subdirectories exist or are created.
S3 simulates directories by using a prefix when listing files in the bucket. If you specify mydir/, then all of the S3 objects that start with mydir/ will be returned including objects such as mydir/anotherfolder/myotherfile.txt. S3 supports a delimitor such as / so that the appearance of subdirectories can be created.
Note: There is no / at the beginning of a file name for S3 objects.
Listing Keys Hierarchically Using a Prefix and Delimiter

Related

copy files from bitbucket repository to S3 bucket using python script

I am running a python script (through teamcity) which deploys AWS cloud formation templates. As a pre-requisite all the templates and script files in the bit bucket repository should be copied to S3 bucket before the deployment. Bit bucket directory structure should be maintained in S3 also. How can python script copy files and directories from bitbucket to S3.
I was using os.walk method to accomplish it, but it's not copying the subdirectories and files in subdirectories.
pwd=os.getcwd()
print("Current Dir is: "+pwd)
print("Files in the Directory: ")
for subdir, dirs, files in os.walk(pwd):
for file in files:
print (file)
try:
object_name = file
print("object name is"+object_name)
response = client_bucket.upload_file(file, bucket_name, object_name)
except:
print("Error processing account " + account_id)
for line in str(traceback.format_exc()).splitlines():
print(line)

S3 Paths | boto3

I have a current process that reads in a data source directory via a yaml file designation:
with open (r'<yaml file>') as file:
directory = yaml.load(file, Loader = yaml.FullLoader)
source_directory = directory['source_directory']
The yaml file reads as follows:
source_directory : '<directory>'
However, the data source has now shifted from a local directory to an S3 Bucket. I am able to view the files in my S3 bucket by using the code below:
import boto3
def ListFiles(client):
response = client.list_objects(Bucket = '<bucket name>')
for content in response.get('Contents', []):
yield content.get('Key')
file_list = ListFiles(client)
for file in file_list:
print(file)
The boto3 code correctly lists my files, so I know my connection to the bucket is successful. How do I reference the directory path within the S3 bucket in the variable source_directory in the yaml file?
Update based on a comment I got:
Someone suggested to use s3://<bucket_name>/object_path in place of the call out in the yaml file. However, this produces a No such file or directory error.
If you are wanting to obtain a list of objects in a given directory of an Amazon S3 bucket, you can use:
response = client.list_objects_v2(Bucket='<bucket name>',Prefix=source_directory)
The Prefix should end with a slash, so the YAML file should look like:
source_directory : 'my-directory/'

AWS Lambda unzips and returns file to s3 bucket -- issue with dropped zip file folder name

I have been having some issues with my aws lambda that unzips a file inside of our s3 bucket. I created a script that would activate from a json payload that gets those passed through to it. The problem is it seems to be loosing the parent folder of the zip file and uploading the child folders underneath it. This is an issue for me as we also have another script I made to parse a log4j file inside of a folder to review for errors. That script is having problems because of the name lost that defines the farm the folder comes from.
To give an example of the issue ---
There's an s3 bucket on us-east, and inside that bucket is a key for "OriginalFolder.zip". When this lambda is activated it unzips and places the child file into the exact same bucket and place where the original zip file is but names it "Log.folder". I want it to keep the original name of the zip file so that when multiple farms are activating this lambda it doesn't overwrite that folder that's created or get confused on which one to read from with the second lambda.
I tried to append something at the end of the created file name to allow for params to be passed through for each farm that runs it but can't seem to make it work. I also contemplated having a separate action called in the script to copy and rename it using boto3 but I would rather not use that as my first choice. I feel there has to be an easier method but might be overlooking it.
Any thoughts would be helpful.
Edit: Here's a picture of the example. The green arrow is what I want it to stay named as. The red arrow is what the file is becoming named inside of our s3 environment. "on1" is the next folder inside "update-dc-logs-test".
import os
import tempfile
import zipfile
from concurrent import futures
from io import BytesIO
import boto3
s3 = boto3.client('s3')
def handler(event, context):
# Parse and prepare required items from event
global bucket, path, zipdata, rn_file
action = event.get("action", None)
if action == "create" or action == "update":
bucket = event['payload']['BucketName']
key = event['payload']['Key']
#rn_file = event['payload']['RenameFile']
path = os.path.dirname(key)
# Create temporary file
temp_file = tempfile.mktemp()
# Fetch and load target file
s3.download_file(bucket, key, temp_file)
zipdata = zipfile.ZipFile(temp_file)
# Call action method with using ThreadPool
with futures.ThreadPoolExecutor(max_workers=4) as executor:
future_list = [
executor.submit(extract, filename)
for filename in zipdata.namelist()
]
result = {'success': [], 'fail': []}
for future in future_list:
filename, status = future.result()
result[status].append(filename)
return result
def extract(filename):
# Extract zip and place it back in bucket
upload_status = 'success'
try:
s3.upload_fileobj(
BytesIO(zipdata.read(filename)),
bucket,
os.path.join(path, filename)
)
except Exception:
upload_status = 'fail'
finally:
return filename, upload_status
You are prefixing all uploaded files with path which is the path at which the ZIP file is found. If you want the uploaded files to be stored below a prefix which is the path and name of the ZIP file (minus the .zip extension), then change the value of path to this:
path = os.path.splitext(key)[0]
Now, instead of path holding the ZIP file's folder prefix it will contain the folder prefix plus the first part of the ZIP filename. For example, if an object is uploaded to folder1/myarchive.zip then path would previously contain folder1, but with this change it will now contain folder1/myarchive.
When that new path is combined in the extract function via os.path.join(path, filename), the object will now be uploaded to folder1/myarchive/on1/file.txt.

Upload files to GCS, skip if existed using python

I have a GCS called my-gcs with inconsistent subfolder such as;
parent-path/path1/path2/*
parent-path/path3/path4/path5/*
parent-path/path6/*
The files can be parquet/csv or other than this.
This is my function to copy the entire folder from local to GCS:
def upload_local_directory_to_gcs(src_path, dest_path, data_backup, file_name):
"""
Upload the whole directory to GCS
"""
logger.debug("Uploading directory...")
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.get_bucket(GCS_BUCKET)
if os.path.isfile(src_path):
blob = bucket.blob(os.path.join(dest_path, os.path.basename(src_path)))
blob.upload_from_filename(src_path)
return
for item in glob.glob(src_path + '/*'):
file_exist = check_file_exist(data_backup, file_name)
if os.path.isfile(item):
print(item)
if file_exist is False:
blob = bucket.blob(os.path.join(dest_path, os.path.basename(item)),
chunk_size=10485760)
blob.upload_from_filename(item)
else:
logger.warning("Skipping upload. File already existed")
else:
if file_exist is False:
upload_local_directory_to_gcs(item, os.path.join(dest_path, os.path.basename(item)),
data_backup, file_name)
else:
logger.warning("Skipping upload. File already existed")
This is the function to check if specific file exist in the directory & sub-directory:
def check_file_exist(dataset, file_name):
"""
Check if files existed
"""
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(GCS_BUCKET)
logger.debug("Checking if file already existed in GCS to skip upload...")
blobs = bucket.list_blobs(prefix=f'parent-path{dataset}/')
check_files = [blob.name for blob in blobs if file_name in blob.name] # if '.' in blob.name
return bool(len(check_files))
However the code is not running correctly. Say this path parent-path/path1/path2/* already has a file called first_file.csv. It will skip uploading the existing file in this path. Until it encounters a file that not yet existed, it will upload the file and overwrite the other files for all directories as well.
Where I was expecting it to only upload specific file that is not existed yet, without overwriting the other files.
I tried my best to explain... please help.
If you have a look to the documentation, you can see that on the Name property of the blob
The name of the blob. This corresponds to the unique path of the object in the bucket.
That means the value is not only the file name, but the fully qualified path + the name path/to/file.csv
If your loop, you check if a file name (file.csv for example) is included in the blob path. Consider this case
path/to/file.csv
path/to/to/file.csv
If you test is file.csv exists, both blobs will return true.
To fix your issue, you need to
Either compare the strict equality of the target_path + file_name and the blob.name
Or include an additional condition in your "if" to include the bucket path to check in addition to the file name.

Boto3 folder sync under new S3 'folder'

So, before anyone tells me about the flat structure of S3, I already know, but the fact is you can create 'folders' in S3. My objective with this Python code is to create a new folder named using the date of running and appending the user's input to this (which is the createS3Folder function) - I then want to sync a folder in a local directory to this folder.
The problem is that my upload_files function creates a new folder in S3 that exactly emulates the folder structure of my local set up.
Can anyone suggest how I would just sync the folder into the newly created one without changing names?
import sys
import boto3
import datetime
import os
teamName = raw_input("Please enter the name of your project: ")
bucketFolderName = ""
def createS3Folder():
date = datetime.date.today().strftime("%Y") + "." +
datetime.date.today().strftime("%B") + "." +
datetime.date.today().strftime("%d")
date1 = datetime.date.today()
date = str(date1) + "/" #In order to generate a file, you must
put "/" at the end of key
bucketFolderName = date + teamName + "/"
client = boto3.client('s3')
client.put_object(Bucket='MY_BUCKET',Key=bucketFolderName)
upload_files('/Users/local/directory/to/sync')
def upload_files(path):
session = boto3.Session()
s3 = session.resource('s3')
bucket = s3.Bucket('MY_BUCKET')
for subdir, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
bucket.put_object(Key=bucketFolderName, Body=data)
def main():
createS3Folder()
if __name__ == "__main__":
main()
Your upload_files() function is uploading to:
bucket.put_object(Key=bucketFolderName, Body=data)
This means that the filename ("Key") on S3 will be the name of the 'folder'. It should be:
bucket.put_object(Key=bucketFolderName + '/' + file, Body=data)
The Key is the full path of the destination object, including the filename (not just a 'directory').
In fact, there is no need to create the 'folder' beforehand -- just upload to the desired Key.
If you are feeling lazy, use the AWS Command-Line Interface (CLI) aws s3 sync command to do it for you!
"the fact is you can create 'folders' in S3"
No, you can't.
You can create an empty object that looks like a folder in the console, but it is still not a folder, it still has no meaning, it is still unnecessary, and if you delete it via the API, all the files you thought were "in" the folder will still be in the bucket. (If you delete it from the console, all the contents are deleted from the bucket, because the console explicitly deletes every object starting with that key prefix.)
The folder you are creating is not a container and cannot have anything inside it, because S3 does not have folders that are containers.
If I want to store a file cat.png and make it look like it's in the hat/ folder, you simply set the object key to hat/cat.png. This has exactly the same effect as observed in the console, whether or not the hat/ folder was explicitly created or not.
To so what you want, you simply build the desired object key for each object with string manipulation, including your common prefix ("folder name") and / delimiters. Any folder structure the / delimiters imply will be displayed in the console as a result.

Categories

Resources