So, before anyone tells me about the flat structure of S3, I already know, but the fact is you can create 'folders' in S3. My objective with this Python code is to create a new folder named using the date of running and appending the user's input to this (which is the createS3Folder function) - I then want to sync a folder in a local directory to this folder.
The problem is that my upload_files function creates a new folder in S3 that exactly emulates the folder structure of my local set up.
Can anyone suggest how I would just sync the folder into the newly created one without changing names?
import sys
import boto3
import datetime
import os
teamName = raw_input("Please enter the name of your project: ")
bucketFolderName = ""
def createS3Folder():
date = datetime.date.today().strftime("%Y") + "." +
datetime.date.today().strftime("%B") + "." +
datetime.date.today().strftime("%d")
date1 = datetime.date.today()
date = str(date1) + "/" #In order to generate a file, you must
put "/" at the end of key
bucketFolderName = date + teamName + "/"
client = boto3.client('s3')
client.put_object(Bucket='MY_BUCKET',Key=bucketFolderName)
upload_files('/Users/local/directory/to/sync')
def upload_files(path):
session = boto3.Session()
s3 = session.resource('s3')
bucket = s3.Bucket('MY_BUCKET')
for subdir, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
bucket.put_object(Key=bucketFolderName, Body=data)
def main():
createS3Folder()
if __name__ == "__main__":
main()
Your upload_files() function is uploading to:
bucket.put_object(Key=bucketFolderName, Body=data)
This means that the filename ("Key") on S3 will be the name of the 'folder'. It should be:
bucket.put_object(Key=bucketFolderName + '/' + file, Body=data)
The Key is the full path of the destination object, including the filename (not just a 'directory').
In fact, there is no need to create the 'folder' beforehand -- just upload to the desired Key.
If you are feeling lazy, use the AWS Command-Line Interface (CLI) aws s3 sync command to do it for you!
"the fact is you can create 'folders' in S3"
No, you can't.
You can create an empty object that looks like a folder in the console, but it is still not a folder, it still has no meaning, it is still unnecessary, and if you delete it via the API, all the files you thought were "in" the folder will still be in the bucket. (If you delete it from the console, all the contents are deleted from the bucket, because the console explicitly deletes every object starting with that key prefix.)
The folder you are creating is not a container and cannot have anything inside it, because S3 does not have folders that are containers.
If I want to store a file cat.png and make it look like it's in the hat/ folder, you simply set the object key to hat/cat.png. This has exactly the same effect as observed in the console, whether or not the hat/ folder was explicitly created or not.
To so what you want, you simply build the desired object key for each object with string manipulation, including your common prefix ("folder name") and / delimiters. Any folder structure the / delimiters imply will be displayed in the console as a result.
Related
I have been having some issues with my aws lambda that unzips a file inside of our s3 bucket. I created a script that would activate from a json payload that gets those passed through to it. The problem is it seems to be loosing the parent folder of the zip file and uploading the child folders underneath it. This is an issue for me as we also have another script I made to parse a log4j file inside of a folder to review for errors. That script is having problems because of the name lost that defines the farm the folder comes from.
To give an example of the issue ---
There's an s3 bucket on us-east, and inside that bucket is a key for "OriginalFolder.zip". When this lambda is activated it unzips and places the child file into the exact same bucket and place where the original zip file is but names it "Log.folder". I want it to keep the original name of the zip file so that when multiple farms are activating this lambda it doesn't overwrite that folder that's created or get confused on which one to read from with the second lambda.
I tried to append something at the end of the created file name to allow for params to be passed through for each farm that runs it but can't seem to make it work. I also contemplated having a separate action called in the script to copy and rename it using boto3 but I would rather not use that as my first choice. I feel there has to be an easier method but might be overlooking it.
Any thoughts would be helpful.
Edit: Here's a picture of the example. The green arrow is what I want it to stay named as. The red arrow is what the file is becoming named inside of our s3 environment. "on1" is the next folder inside "update-dc-logs-test".
import os
import tempfile
import zipfile
from concurrent import futures
from io import BytesIO
import boto3
s3 = boto3.client('s3')
def handler(event, context):
# Parse and prepare required items from event
global bucket, path, zipdata, rn_file
action = event.get("action", None)
if action == "create" or action == "update":
bucket = event['payload']['BucketName']
key = event['payload']['Key']
#rn_file = event['payload']['RenameFile']
path = os.path.dirname(key)
# Create temporary file
temp_file = tempfile.mktemp()
# Fetch and load target file
s3.download_file(bucket, key, temp_file)
zipdata = zipfile.ZipFile(temp_file)
# Call action method with using ThreadPool
with futures.ThreadPoolExecutor(max_workers=4) as executor:
future_list = [
executor.submit(extract, filename)
for filename in zipdata.namelist()
]
result = {'success': [], 'fail': []}
for future in future_list:
filename, status = future.result()
result[status].append(filename)
return result
def extract(filename):
# Extract zip and place it back in bucket
upload_status = 'success'
try:
s3.upload_fileobj(
BytesIO(zipdata.read(filename)),
bucket,
os.path.join(path, filename)
)
except Exception:
upload_status = 'fail'
finally:
return filename, upload_status
You are prefixing all uploaded files with path which is the path at which the ZIP file is found. If you want the uploaded files to be stored below a prefix which is the path and name of the ZIP file (minus the .zip extension), then change the value of path to this:
path = os.path.splitext(key)[0]
Now, instead of path holding the ZIP file's folder prefix it will contain the folder prefix plus the first part of the ZIP filename. For example, if an object is uploaded to folder1/myarchive.zip then path would previously contain folder1, but with this change it will now contain folder1/myarchive.
When that new path is combined in the extract function via os.path.join(path, filename), the object will now be uploaded to folder1/myarchive/on1/file.txt.
I have this script written in python which looks thrue folder 'CSVtoGD', list every CSV there and send those CSV's as independent sheets to my google drive. I am trying to write a line which will delete the old files when I run the program again. What am I missing here? I am trying to achieve that by using:
sh = gc.del_spreadsheet(filename.split(".")[0]+" TTF")
Unfortunately the script is doing the same thing after adding this line. It is uploading new files but not deleting old ones.
Whole script looks like that
import gspread
import os
gc = gspread.oauth(credentials_filename='/users/user/credentials.json')
os.chdir('/users/user/CSVtoGD')
files = os.listdir()
for filename in files:
if filename.split(".")[1] == "csv":
folder_id = '19vrbvaeDqWcxFGwPV82APWYTmB'
sh = gc.del_spreadsheet(filename.split(".")[0]+" TTF")
sh = gc.create(filename.split(".")[0]+" TTF", folder_id)
content = open(filename, 'r').read().encode('utf-8')
gc.import_csv(sh.id, content)
Everything is working fine, CSVs from folder are uploaded to google drive, my problem is with deleting the old CSV (with the same name as new ones)
When I saw the document of gspread, it seems that the argument of the method of del_spreadsheet is the file ID. Ref When I saw your script, you are using the filename as the argument. I thought that this might be the reason for your issue. When this is reflected in your script, it becomes as follows.
From:
sh = gc.del_spreadsheet(filename.split(".")[0]+" TTF")
To:
sh = gc.del_spreadsheet(gc.open(filename.split(".")[0] + " TTF").id)
Note:
When the Spreadsheet of the filename of filename.split(".")[0] + " TTF" is not existing, an error occurs. Please be careful about this.
Reference:
del_spreadsheet(file_id)
Added:
From your reply of When I try do delete other file using this method from My Drive it is working well., it was found that my proposed modification can be used for "My Drive". But, it seems that this cannot be used for the shared drive.
When I saw the script of gspread again, I noticed that the current request cannot search the files in the shared drive using the filename. And also, I confirmed that in the current gspread, the Spreadsheet ID cannot be retrieved using gspread. Because the files cannot be searched from all shared drives. By this, I would like to propose the following modified script.
Modified script:
import gspread
import os
from googleapiclient.discovery import build
gc = gspread.oauth(credentials_filename='/users/user/credentials.json')
service = build("drive", "v3", credentials=gc.auth)
def getSpreadsheetId(filename):
q = f"name='{filename}' and mimeType='application/vnd.google-apps.spreadsheet' and trashed=false" # or q = "name='" + filename + "' and mimeType='application/vnd.google-apps.spreadsheet' and trashed=false"
res = service.files().list(q=q, fields="files(id)", corpora="allDrives", includeItemsFromAllDrives=True, supportsAllDrives=True).execute()
items = res.get("files", [])
if not items:
print("No files found.")
exit()
return items[0]["id"]
os.chdir('/users/user/CSVtoGD')
files = os.listdir()
for filename in files:
fname = filename.split(".")
if fname[1] == "csv":
folder_id = '19vrbvaeDqWcxFGwPV82APWYTmB'
oldSpreadsheetId = getSpreadsheetId(fname[0] + " TTF")
sh = gc.del_spreadsheet(oldSpreadsheetId)
sh = gc.create(fname[0] + " TTF", folder_id)
content = open(filename, "r").read().encode("utf-8")
gc.import_csv(sh.id, content)
In this modification, in order to retrieve the Spreadsheet ID from the filename in the shared drive, googleapis for python is used. Ref
But, in this case, it supposes that you have the permission for writing to the shared drive. Please be careful about this.
I have a GCS called my-gcs with inconsistent subfolder such as;
parent-path/path1/path2/*
parent-path/path3/path4/path5/*
parent-path/path6/*
The files can be parquet/csv or other than this.
This is my function to copy the entire folder from local to GCS:
def upload_local_directory_to_gcs(src_path, dest_path, data_backup, file_name):
"""
Upload the whole directory to GCS
"""
logger.debug("Uploading directory...")
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.get_bucket(GCS_BUCKET)
if os.path.isfile(src_path):
blob = bucket.blob(os.path.join(dest_path, os.path.basename(src_path)))
blob.upload_from_filename(src_path)
return
for item in glob.glob(src_path + '/*'):
file_exist = check_file_exist(data_backup, file_name)
if os.path.isfile(item):
print(item)
if file_exist is False:
blob = bucket.blob(os.path.join(dest_path, os.path.basename(item)),
chunk_size=10485760)
blob.upload_from_filename(item)
else:
logger.warning("Skipping upload. File already existed")
else:
if file_exist is False:
upload_local_directory_to_gcs(item, os.path.join(dest_path, os.path.basename(item)),
data_backup, file_name)
else:
logger.warning("Skipping upload. File already existed")
This is the function to check if specific file exist in the directory & sub-directory:
def check_file_exist(dataset, file_name):
"""
Check if files existed
"""
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(GCS_BUCKET)
logger.debug("Checking if file already existed in GCS to skip upload...")
blobs = bucket.list_blobs(prefix=f'parent-path{dataset}/')
check_files = [blob.name for blob in blobs if file_name in blob.name] # if '.' in blob.name
return bool(len(check_files))
However the code is not running correctly. Say this path parent-path/path1/path2/* already has a file called first_file.csv. It will skip uploading the existing file in this path. Until it encounters a file that not yet existed, it will upload the file and overwrite the other files for all directories as well.
Where I was expecting it to only upload specific file that is not existed yet, without overwriting the other files.
I tried my best to explain... please help.
If you have a look to the documentation, you can see that on the Name property of the blob
The name of the blob. This corresponds to the unique path of the object in the bucket.
That means the value is not only the file name, but the fully qualified path + the name path/to/file.csv
If your loop, you check if a file name (file.csv for example) is included in the blob path. Consider this case
path/to/file.csv
path/to/to/file.csv
If you test is file.csv exists, both blobs will return true.
To fix your issue, you need to
Either compare the strict equality of the target_path + file_name and the blob.name
Or include an additional condition in your "if" to include the bucket path to check in addition to the file name.
I have these paths in S3:
s3://mykey/mytest/file1.txt
s3://mykey/mytest/file2.txt
s3://mykey/mytest/file3.txt
and
s3://mykey/mytest_temp/file4.txt
s3://mykey/mytest_temp/file5.txt
s3://mykey/mytest_temp/file6.txt
Want to drop s3://mykey/mytest/ (and all files in it) and THEN rename s3://mykey/mytest_temp/ to s3://mykey/mytest/ while keeping all files in there (file4, file5, file6).
Final result should be - only 1 folder:
s3://mykey/mytest/file4.txt
s3://mykey/mytest/file5.txt
s3://mykey/mytest/file6.txt
How to achieve this using Python Boto3?
Thanks.
The AWS API only allows an operation on one object at a time. Also, there is no "move" command, so you would need to do a Copy and a Delete.
The easiest way to do what you ask is to use the AWS Command-Line Interface (CLI) because it has some higher-level commands that can do this easily:
aws rm --recursive s3://mykey/mytest/
aws mv s3://mykey/mytest_temp/ s3://mykey/mytest/
If you didn't want to use the AWS CLI, you could code this operation using boto but you would need to loop through each object and process it individually.
To do this purely from Python using boto3, you would need to do the following:
Deleting existing 'folder'
Call list_objects_v2(), passing in a Prefix to obtain a listing of the directory
Take the results and pass the object names into a delete_objects() call
Please note that each of these API calls handle up to 1000 objects each. If you have more than 1000 objects, you would need to paginate the results by calling them again.
'Renaming' objects
Amazon S3 does not have a 'rename' command. Instead, it will be necessary to copy each object to a new key, then delete the original object.
Call list_objects_v2(), passing in a Prefix to obtain a listing of the directory
Loop through each object and:
Call copy_object(), specifying a full path in the destination Key
Call delete_object() after the object has been copied
I have a Django project where I needed the ability to rename a folder but still keep the directory structure in-tact, meaning empty folders would need to be copied and stored in the renamed directory as well.
aws cli is great but neither cp or sync or mv copied empty folders (i.e. files ending in '/') over to the new folder location, so I used a mixture of boto3 and the aws cli to accomplish the task.
More or less I find all folders in the renamed directory and then use boto3 to put them in the new location, then I cp the data with aws cli and finally remove it.
import threading
import os
from django.conf import settings
from django.contrib import messages
from django.core.files.storage import default_storage
from django.shortcuts import redirect
from django.urls import reverse
def rename_folder(request, client_url):
"""
:param request:
:param client_url:
:return:
"""
current_property = request.session.get('property')
if request.POST:
# name the change
new_name = request.POST['name']
# old full path with www.[].com?
old_path = request.POST['old_path']
# remove the query string
old_path = ''.join(old_path.split('?')[0])
# remove the .com prefix item so we have the path in the storage
old_path = ''.join(old_path.split('.com/')[-1])
# remove empty values, this will happen at end due to these being folders
old_path_list = [x for x in old_path.split('/') if x != '']
# remove the last folder element with split()
base_path = '/'.join(old_path_list[:-1])
# # now build the new path
new_path = base_path + f'/{new_name}/'
# remove empty variables
# print(old_path_list[:-1], old_path.split('/'), old_path, base_path, new_path)
endpoint = settings.AWS_S3_ENDPOINT_URL
# # recursively add the files
copy_command = f"aws s3 --endpoint={endpoint} cp s3://{old_path} s3://{new_path} --recursive"
remove_command = f"aws s3 --endpoint={endpoint} rm s3://{old_path} --recursive"
# get_creds() is nothing special it simply returns the elements needed via boto3
client, resource, bucket, resource_bucket = get_creds()
path_viewing = f'{"/".join(old_path.split("/")[1:])}'
directory_content = default_storage.listdir(path_viewing)
# loop over folders and add them by default, aws cli does not copy empty ones
# so this is used to accommodate
folders, files = directory_content
for folder in folders:
new_key = new_path+folder+'/'
# we must remove bucket name for this to work
new_key = new_key.split(f"{bucket}/")[-1]
# push this to new thread
threading.Thread(target=put_object, args=(client, bucket, new_key,)).start()
print(f'{new_key} added')
# # run command, which will copy all data
os.system(copy_command)
print('Copy Done...')
os.system(remove_command)
print('Remove Done...')
# print(bucket)
print(f'Folder renamed.')
messages.success(request, f'Folder Renamed to: {new_name}')
return redirect(request.META.get('HTTP_REFERER', f"{reverse('home', args=[client_url])}"))
The following code works fine except if there is a subfolder, which does not have any file inside, then the subfolder will not appear in S3. e.g.
if /home/temp/subfolder has no file, then subfolder will not show in S3. how to change the code so that the empty folder is also uploaded in S3?
I tried to write sth. (see note below), but do not know how to call put_object() to the empty subfolder.
#!/usr/bin/env python
import os
from boto3.session import Session
path = "/home/temp"
session = Session(aws_access_key_id='XXX', aws_secret_access_key='XXX')
s3 = session.resource('s3')
for subdir, dirs, files in os.walk(path):
# note: if not files ......
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
s3.Bucket('my_bucket').put_object(Key=full_path[len(path)+1:],
Body=data)
besides, I tried to call this function to check if a subfolder or file exist or not. it works for file, but not subfolder. how to check if a subfolder exists or not? (if a subfolder exists, I will not upload)
def check_exist(s3, bucket, key):
try:
s3.Object(bucket, key).load()
except botocore.exceptions.ClientError as e:
return False
return True
BTW, I refer the above code from
check if a key exists in a bucket in s3 using boto3
and
http://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
thanks them for sharing the code.
Directories (folders, subfolders, etc.) do not exist in S3.
When you copy this file to an empty S3 bucket /mydir/myfile.txt, only the file myfile.txt is copied to S3. The directory mydir is not created as that string is part of the file name mydir/myfile.txt. The actual file name is the full path, no subdirectories exist or are created.
S3 simulates directories by using a prefix when listing files in the bucket. If you specify mydir/, then all of the S3 objects that start with mydir/ will be returned including objects such as mydir/anotherfolder/myotherfile.txt. S3 supports a delimitor such as / so that the appearance of subdirectories can be created.
Note: There is no / at the beginning of a file name for S3 objects.
Listing Keys Hierarchically Using a Prefix and Delimiter