Python Boto3 - how to drop a 'folder' and rename another one? - python

I have these paths in S3:
s3://mykey/mytest/file1.txt
s3://mykey/mytest/file2.txt
s3://mykey/mytest/file3.txt
and
s3://mykey/mytest_temp/file4.txt
s3://mykey/mytest_temp/file5.txt
s3://mykey/mytest_temp/file6.txt
Want to drop s3://mykey/mytest/ (and all files in it) and THEN rename s3://mykey/mytest_temp/ to s3://mykey/mytest/ while keeping all files in there (file4, file5, file6).
Final result should be - only 1 folder:
s3://mykey/mytest/file4.txt
s3://mykey/mytest/file5.txt
s3://mykey/mytest/file6.txt
How to achieve this using Python Boto3?
Thanks.

The AWS API only allows an operation on one object at a time. Also, there is no "move" command, so you would need to do a Copy and a Delete.
The easiest way to do what you ask is to use the AWS Command-Line Interface (CLI) because it has some higher-level commands that can do this easily:
aws rm --recursive s3://mykey/mytest/
aws mv s3://mykey/mytest_temp/ s3://mykey/mytest/
If you didn't want to use the AWS CLI, you could code this operation using boto but you would need to loop through each object and process it individually.

To do this purely from Python using boto3, you would need to do the following:
Deleting existing 'folder'
Call list_objects_v2(), passing in a Prefix to obtain a listing of the directory
Take the results and pass the object names into a delete_objects() call
Please note that each of these API calls handle up to 1000 objects each. If you have more than 1000 objects, you would need to paginate the results by calling them again.
'Renaming' objects
Amazon S3 does not have a 'rename' command. Instead, it will be necessary to copy each object to a new key, then delete the original object.
Call list_objects_v2(), passing in a Prefix to obtain a listing of the directory
Loop through each object and:
Call copy_object(), specifying a full path in the destination Key
Call delete_object() after the object has been copied

I have a Django project where I needed the ability to rename a folder but still keep the directory structure in-tact, meaning empty folders would need to be copied and stored in the renamed directory as well.
aws cli is great but neither cp or sync or mv copied empty folders (i.e. files ending in '/') over to the new folder location, so I used a mixture of boto3 and the aws cli to accomplish the task.
More or less I find all folders in the renamed directory and then use boto3 to put them in the new location, then I cp the data with aws cli and finally remove it.
import threading
import os
from django.conf import settings
from django.contrib import messages
from django.core.files.storage import default_storage
from django.shortcuts import redirect
from django.urls import reverse
def rename_folder(request, client_url):
"""
:param request:
:param client_url:
:return:
"""
current_property = request.session.get('property')
if request.POST:
# name the change
new_name = request.POST['name']
# old full path with www.[].com?
old_path = request.POST['old_path']
# remove the query string
old_path = ''.join(old_path.split('?')[0])
# remove the .com prefix item so we have the path in the storage
old_path = ''.join(old_path.split('.com/')[-1])
# remove empty values, this will happen at end due to these being folders
old_path_list = [x for x in old_path.split('/') if x != '']
# remove the last folder element with split()
base_path = '/'.join(old_path_list[:-1])
# # now build the new path
new_path = base_path + f'/{new_name}/'
# remove empty variables
# print(old_path_list[:-1], old_path.split('/'), old_path, base_path, new_path)
endpoint = settings.AWS_S3_ENDPOINT_URL
# # recursively add the files
copy_command = f"aws s3 --endpoint={endpoint} cp s3://{old_path} s3://{new_path} --recursive"
remove_command = f"aws s3 --endpoint={endpoint} rm s3://{old_path} --recursive"
# get_creds() is nothing special it simply returns the elements needed via boto3
client, resource, bucket, resource_bucket = get_creds()
path_viewing = f'{"/".join(old_path.split("/")[1:])}'
directory_content = default_storage.listdir(path_viewing)
# loop over folders and add them by default, aws cli does not copy empty ones
# so this is used to accommodate
folders, files = directory_content
for folder in folders:
new_key = new_path+folder+'/'
# we must remove bucket name for this to work
new_key = new_key.split(f"{bucket}/")[-1]
# push this to new thread
threading.Thread(target=put_object, args=(client, bucket, new_key,)).start()
print(f'{new_key} added')
# # run command, which will copy all data
os.system(copy_command)
print('Copy Done...')
os.system(remove_command)
print('Remove Done...')
# print(bucket)
print(f'Folder renamed.')
messages.success(request, f'Folder Renamed to: {new_name}')
return redirect(request.META.get('HTTP_REFERER', f"{reverse('home', args=[client_url])}"))

Related

Recursively copy a child directory to the parent in Google Cloud Storage

I need to recursively move the contents of a sub-folder to a parent folder in google cloud storage. This code works for moving a single file from sub-folder to the parent.
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)
source_path = Path(parent_dir, sub_folder, filename).as_posix()
source_blob = bucket.blob(source_path)
dest_path = Path(parent_dir, filename).as_posix()
bucket.copy_blob(source_blob, bucket, dest_path)
but I don't know how to properly format the command because if my dest_path is "parent_dir", I get the following error:
google.api_core.exceptions.NotFound: 404 POST https://storage.googleapis.com/storage/v1/b/bucket/o/parent_dir%2Fsubfolder/copyTo/b/geo-storage/o/parent_dir?prettyPrint=false: No such object: geo-storage/parent_dir/subfolder
Note: This works for recursive copy with gsutils but I would prefer to use the blob object:
os.system(f"gsutil cp -r gs://bucket/parent_dir/subfolder/* gs://bucket/parent_dir")
GCS does not have the concept of "directory" - just a flat namespace of objects. You can have an object named "foo/a.txt" and "foo/b.txt", but there is not actual thing representing "foo/" - its' just a prefix on the object names.
gsutil uses prefixes on the object names to pretend directories exist, but under the hood it is really just acting on all the individual objects with that prefix.
You need to do the same, with a copy for each object:
client = storage.Client()
bucket = client.Bucket(BUCKET_NAME)
for source_blob in client.list_blobs(BUCKET_NAME, prefix="old/prefix/"):
dest_name = source_blob.name.replace("old/prefix/", "new/prefix/")
# do these in parallel for more speed
bucket.copy_blob(source_blob, bucket, new_name=dest_name)

AWS Lambda unzips and returns file to s3 bucket -- issue with dropped zip file folder name

I have been having some issues with my aws lambda that unzips a file inside of our s3 bucket. I created a script that would activate from a json payload that gets those passed through to it. The problem is it seems to be loosing the parent folder of the zip file and uploading the child folders underneath it. This is an issue for me as we also have another script I made to parse a log4j file inside of a folder to review for errors. That script is having problems because of the name lost that defines the farm the folder comes from.
To give an example of the issue ---
There's an s3 bucket on us-east, and inside that bucket is a key for "OriginalFolder.zip". When this lambda is activated it unzips and places the child file into the exact same bucket and place where the original zip file is but names it "Log.folder". I want it to keep the original name of the zip file so that when multiple farms are activating this lambda it doesn't overwrite that folder that's created or get confused on which one to read from with the second lambda.
I tried to append something at the end of the created file name to allow for params to be passed through for each farm that runs it but can't seem to make it work. I also contemplated having a separate action called in the script to copy and rename it using boto3 but I would rather not use that as my first choice. I feel there has to be an easier method but might be overlooking it.
Any thoughts would be helpful.
Edit: Here's a picture of the example. The green arrow is what I want it to stay named as. The red arrow is what the file is becoming named inside of our s3 environment. "on1" is the next folder inside "update-dc-logs-test".
import os
import tempfile
import zipfile
from concurrent import futures
from io import BytesIO
import boto3
s3 = boto3.client('s3')
def handler(event, context):
# Parse and prepare required items from event
global bucket, path, zipdata, rn_file
action = event.get("action", None)
if action == "create" or action == "update":
bucket = event['payload']['BucketName']
key = event['payload']['Key']
#rn_file = event['payload']['RenameFile']
path = os.path.dirname(key)
# Create temporary file
temp_file = tempfile.mktemp()
# Fetch and load target file
s3.download_file(bucket, key, temp_file)
zipdata = zipfile.ZipFile(temp_file)
# Call action method with using ThreadPool
with futures.ThreadPoolExecutor(max_workers=4) as executor:
future_list = [
executor.submit(extract, filename)
for filename in zipdata.namelist()
]
result = {'success': [], 'fail': []}
for future in future_list:
filename, status = future.result()
result[status].append(filename)
return result
def extract(filename):
# Extract zip and place it back in bucket
upload_status = 'success'
try:
s3.upload_fileobj(
BytesIO(zipdata.read(filename)),
bucket,
os.path.join(path, filename)
)
except Exception:
upload_status = 'fail'
finally:
return filename, upload_status
You are prefixing all uploaded files with path which is the path at which the ZIP file is found. If you want the uploaded files to be stored below a prefix which is the path and name of the ZIP file (minus the .zip extension), then change the value of path to this:
path = os.path.splitext(key)[0]
Now, instead of path holding the ZIP file's folder prefix it will contain the folder prefix plus the first part of the ZIP filename. For example, if an object is uploaded to folder1/myarchive.zip then path would previously contain folder1, but with this change it will now contain folder1/myarchive.
When that new path is combined in the extract function via os.path.join(path, filename), the object will now be uploaded to folder1/myarchive/on1/file.txt.

Boto3 folder sync under new S3 'folder'

So, before anyone tells me about the flat structure of S3, I already know, but the fact is you can create 'folders' in S3. My objective with this Python code is to create a new folder named using the date of running and appending the user's input to this (which is the createS3Folder function) - I then want to sync a folder in a local directory to this folder.
The problem is that my upload_files function creates a new folder in S3 that exactly emulates the folder structure of my local set up.
Can anyone suggest how I would just sync the folder into the newly created one without changing names?
import sys
import boto3
import datetime
import os
teamName = raw_input("Please enter the name of your project: ")
bucketFolderName = ""
def createS3Folder():
date = datetime.date.today().strftime("%Y") + "." +
datetime.date.today().strftime("%B") + "." +
datetime.date.today().strftime("%d")
date1 = datetime.date.today()
date = str(date1) + "/" #In order to generate a file, you must
put "/" at the end of key
bucketFolderName = date + teamName + "/"
client = boto3.client('s3')
client.put_object(Bucket='MY_BUCKET',Key=bucketFolderName)
upload_files('/Users/local/directory/to/sync')
def upload_files(path):
session = boto3.Session()
s3 = session.resource('s3')
bucket = s3.Bucket('MY_BUCKET')
for subdir, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
bucket.put_object(Key=bucketFolderName, Body=data)
def main():
createS3Folder()
if __name__ == "__main__":
main()
Your upload_files() function is uploading to:
bucket.put_object(Key=bucketFolderName, Body=data)
This means that the filename ("Key") on S3 will be the name of the 'folder'. It should be:
bucket.put_object(Key=bucketFolderName + '/' + file, Body=data)
The Key is the full path of the destination object, including the filename (not just a 'directory').
In fact, there is no need to create the 'folder' beforehand -- just upload to the desired Key.
If you are feeling lazy, use the AWS Command-Line Interface (CLI) aws s3 sync command to do it for you!
"the fact is you can create 'folders' in S3"
No, you can't.
You can create an empty object that looks like a folder in the console, but it is still not a folder, it still has no meaning, it is still unnecessary, and if you delete it via the API, all the files you thought were "in" the folder will still be in the bucket. (If you delete it from the console, all the contents are deleted from the bucket, because the console explicitly deletes every object starting with that key prefix.)
The folder you are creating is not a container and cannot have anything inside it, because S3 does not have folders that are containers.
If I want to store a file cat.png and make it look like it's in the hat/ folder, you simply set the object key to hat/cat.png. This has exactly the same effect as observed in the console, whether or not the hat/ folder was explicitly created or not.
To so what you want, you simply build the desired object key for each object with string manipulation, including your common prefix ("folder name") and / delimiters. Any folder structure the / delimiters imply will be displayed in the console as a result.

How to move files in Google Cloud Storage from one bucket to another bucket by Python

Are there any API function that allow us to move files in Google Cloud Storage from one bucket in another bucket?
The scenario is we want Python to move read files in A bucket to B bucket. I knew that gsutil could do that but not sure Python can support that or not.
Thanks.
Here's a function I use when moving blobs between directories within the same bucket or to a different bucket.
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="path_to_your_creds.json"
def mv_blob(bucket_name, blob_name, new_bucket_name, new_blob_name):
"""
Function for moving files between directories or buckets. it will use GCP's copy
function then delete the blob from the old location.
inputs
-----
bucket_name: name of bucket
blob_name: str, name of file
ex. 'data/some_location/file_name'
new_bucket_name: name of bucket (can be same as original if we're just moving around directories)
new_blob_name: str, name of file in new directory in target bucket
ex. 'data/destination/file_name'
"""
storage_client = storage.Client()
source_bucket = storage_client.get_bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.get_bucket(new_bucket_name)
# copy to new destination
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, new_blob_name)
# delete in old destination
source_blob.delete()
print(f'File moved from {source_blob} to {new_blob_name}')
Using the google-api-python-client, there is an example on the storage.objects.copy page. After you copy, you can delete the source with storage.objects.delete.
destination_object_resource = {}
req = client.objects().copy(
sourceBucket=bucket1,
sourceObject=old_object,
destinationBucket=bucket2,
destinationObject=new_object,
body=destination_object_resource)
resp = req.execute()
print json.dumps(resp, indent=2)
client.objects().delete(
bucket=bucket1,
object=old_object).execute()
you can use GCS Client Library Functions documented at [1] to read to one bucket and write to the other and then delete source file.
You can even use the GCS REST API documented at [2].
Link:
[1] - https://developers.google.com/appengine/docs/python/googlecloudstorageclient/functions
[2] - https://developers.google.com/storage/docs/concepts-techniques#overview
def GCP_BUCKET_A_TO_B():
source_bucket = storage_client.get_bucket("Bucket_A_Name")
filename = [filename.name for filename in
list(source_bucket.list_blobs(prefix=""))]
for i in range (0,len(filename)):
source_blob = source_bucket.blob(filename[i])
destination_bucket = storage_client.get_bucket("Bucket_B_Name")
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, filename[i])
I just wanted to point out that there's another possible approach and that is using gsutil through the use of the subprocess module.
The advantages of using gsutil like that:
You don't have to deal with individual blobs
gsutil's implementation of the move and especially rsync will probably be much better and more resilient that what we do ourselves.
The disadvantages:
You can't deal with individual blobs easily
It's hacky and generally a library is preferable to executing shell commands
Example:
def move(source_uri: str,
destination_uri: str) -> None:
"""
Move file from source_uri to destination_uri.
:param source_uri: gs:// - like uri of the source file/directory
:param destination_uri: gs:// - like uri of the destination file/directory
:return: None
"""
cmd = f"gsutil -m mv {source_uri} {destination_uri}"
subprocess.run(cmd)

How to determine the Dropbox folder location programmatically?

I have a script that is intended to be run by multiple users on multiple computers, and they don't all have their Dropbox folders in their respective home directories. I'd hate to have to hard code paths in the script. I'd much rather figure out the path programatically.
Any suggestions welcome.
EDIT:
I am not using the Dropbox API in the script, the script simply reads files in a specific Dropbox folder shared between the users. The only thing I need is the path to the Dropbox folder, as I of course already know the relative path within the Dropbox file structure.
EDIT:
If it matters, I am using Windows 7.
I found the answer here. Setting s equal to the 2nd line in ~\AppData\Roaming\Dropbox\host.db and then decoding it with base64 gives the path.
def _get_appdata_path():
import ctypes
from ctypes import wintypes, windll
CSIDL_APPDATA = 26
_SHGetFolderPath = windll.shell32.SHGetFolderPathW
_SHGetFolderPath.argtypes = [wintypes.HWND,
ctypes.c_int,
wintypes.HANDLE,
wintypes.DWORD,
wintypes.LPCWSTR]
path_buf = wintypes.create_unicode_buffer(wintypes.MAX_PATH)
result = _SHGetFolderPath(0, CSIDL_APPDATA, 0, 0, path_buf)
return path_buf.value
def dropbox_home():
from platform import system
import base64
import os.path
_system = system()
if _system in ('Windows', 'cli'):
host_db_path = os.path.join(_get_appdata_path(),
'Dropbox',
'host.db')
elif _system in ('Linux', 'Darwin'):
host_db_path = os.path.expanduser('~'
'/.dropbox'
'/host.db')
else:
raise RuntimeError('Unknown system={}'
.format(_system))
if not os.path.exists(host_db_path):
raise RuntimeError("Config path={} doesn't exists"
.format(host_db_path))
with open(host_db_path, 'r') as f:
data = f.read().split()
return base64.b64decode(data[1])
There is an answer to this on Dropbox Help Center - How can I programmatically find the Dropbox folder paths?
Short version:
Use ~/.dropbox/info.json or %APPDATA%\Dropbox\info.json
Long version:
Access the valid %APPDATA% or %LOCALAPPDATA% location this way:
import os
from pathlib import Path
import json
try:
json_path = (Path(os.getenv('LOCALAPPDATA'))/'Dropbox'/'info.json').resolve()
except FileNotFoundError:
json_path = (Path(os.getenv('APPDATA'))/'Dropbox'/'info.json').resolve()
with open(str(json_path)) as f:
j = json.load(f)
personal_dbox_path = Path(j['personal']['path'])
business_dbox_path = Path(j['business']['path'])
You could search the file system using os.walk. The Dropbox folder is probably within the home directory of the user, so to save some time you could limit your search to that. Example:
import os
dropbox_folder = None
for dirname, dirnames, filenames in os.walk(os.path.expanduser('~')):
for subdirname in dirnames:
if(subdirname == 'Dropbox'):
dropbox_folder = os.path.join(dirname, subdirname)
break
if dropbox_folder:
break
# dropbox_folder now contains the full path to the Dropbox folder, or
# None if the folder wasn't found
Alternatively you could prompt the user for the Dropbox folder location, or make it configurable via a config file.
This adaptation based on J.F. Sebastian's suggestion works for me on Ubuntu:
os.path.expanduser('~/Dropbox')
And to actually set the working directory to be there:
os.chdir(os.path.expanduser('~/Dropbox'))
Note: answer is valid for Dropbox v2.8 and higher
Windows
jq -r ".personal.path" < %APPDATA%\Dropbox\info.json
This needs jq - JSON parser utility to be installed. If you are happy user of Chocolatey package manager, just run choco install jq before.
Linux
jq -r ".personal.path" < ~/.dropbox/info.json
Just similarly to Windows install jq using package manager of your distro.
Note: requires Dropbox >= 2.8
Dropbox now stores the paths in json format in a file called info.json. It is located in one of the two following locations:
%APPDATA%\Dropbox\info.json
%LOCALAPPDATA%\Dropbox\info.json
I can access the %APPDATA% environment variable in Python by os.environ['APPDATA'], however I check both that and os.environ['LOCALAPPDATA']. Then I convert the JSON into a dictionary and read the 'path' value under the appropriate Dropbox (business or personal).
Calling get_dropbox_location() from the code below will return the filepath of the business Dropbox, while get_dropbox_location('personal') will return the file path of the personal Dropbox.
import os
import json
def get_dropbox_location(account_type='business'):
"""
Returns a string of the filepath of the Dropbox for this user
:param account_type: str, 'business' or 'personal'
"""
info_path = _get_dropbox_info_path()
info_dict = _get_dictionary_from_path_to_json(info_path)
return _get_dropbox_path_from_dictionary(info_dict, account_type)
def _get_dropbox_info_path():
"""
Returns filepath of Dropbox file info.json
"""
path = _create_dropox_info_path('APPDATA')
if path:
return path
return _create_dropox_info_path('LOCALAPPDATA')
def _create_dropox_info_path(appdata_str):
r"""
Looks up the environment variable given by appdata_str and combines with \Dropbox\info.json
Then checks if the info.json exists at that path, and if so returns the filepath, otherwise
returns False
"""
path = os.path.join(os.environ[appdata_str], r'Dropbox\info.json')
if os.path.exists(path):
return path
return False
def _get_dictionary_from_path_to_json(info_path):
"""
Loads a json file and returns as a dictionary
"""
with open(info_path, 'r') as f:
text = f.read()
return json.loads(text)
def _get_dropbox_path_from_dictionary(info_dict, account_type):
"""
Returns the 'path' value under the account_type dictionary within the main dictionary
"""
return info_dict[account_type]['path']
This is a pure Python solution, unlike the other solution using info.json.
One option is you could go searching for the .dropbox.cache directory which (at least on Mac and Linux) is a hidden folder in the Dropbox directory.
I am fairly certain that Dropbox stores its preferences in an encrypted .dbx container, so extracting it using the same method that Dropbox uses is not trivial.
This should work on Win7. The use of getEnvironmentVariable("APPDATA") instead of os.getenv('APPDATA') supports Unicode filepaths -- see question titled Problems with umlauts in python appdata environvent variable.
import base64
import ctypes
import os
def getEnvironmentVariable(name):
""" read windows native unicode environment variables """
# (could just use os.environ dict in Python 3)
name = unicode(name) # make sure string argument is unicode
n = ctypes.windll.kernel32.GetEnvironmentVariableW(name, None, 0)
if not n:
return None
else:
buf = ctypes.create_unicode_buffer(u'\0'*n)
ctypes.windll.kernel32.GetEnvironmentVariableW(name, buf, n)
return buf.value
def getDropboxRoot():
# find the path for Dropbox's root watch folder from its sqlite host.db database.
# Dropbox stores its databases under the currently logged in user's %APPDATA% path.
# If you have installed multiple instances of dropbox under the same login this only finds the 1st one.
# Dropbox stores its databases under the currently logged in user's %APPDATA% path.
# usually "C:\Documents and Settings\<login_account>\Application Data"
sConfigFile = os.path.join(getEnvironmentVariable("APPDATA"),
'Dropbox', 'host.db')
# return null string if can't find or work database file.
if not os.path.exists(sConfigFile):
return None
# Dropbox Watch Folder Location is base64 encoded as the last line of the host.db file.
with open(sConfigFile) as dbxfile:
for sLine in dbxfile:
pass
# decode last line, path to dropbox watch folder with no trailing slash.
return base64.b64decode(sLine)
if __name__ == '__main__':
print getDropboxRoot()

Categories

Resources