Django-storages not detecting changed static files - python

I'm using django-storages and amazon s3 for my static files. Following the documentation, I put these settings in my settings.py
STATIC_URL = 'https://mybucket.s3.amazonaws.com/'
ADMIN_MEDIA_PREFIX = 'https://mybucket.s3.amazonaws.com/admin/'
INSTALLED_APPS += (
'storages',
)
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
AWS_ACCESS_KEY_ID = 'mybucket_key_id'
AWS_SECRET_ACCESS_KEY = 'mybucket_access_key'
AWS_STORAGE_BUCKET_NAME = 'mybucket'
STATICFILES_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
And the first time I ran collect static everything worked correctly and my static files were uploaded to my s3 bucket.
However, after making changes to my static files and running python manage.py collectstatic this is outputted despite the fact that static files were modified
-----> Collecting static files
0 static files copied, 81 unmodified.
However, if I rename the changed static file, the changed static file is correctly copied to my s3 bucket.
Why isn't django-storages uploading my changed static files? Is there a configuration problem or is the problem deeper?

collectstatic skips files if "target" file is "younger" than source file. Seems like amazon S3 storage returns wrong date for you file.
you could investigate [code][1] and debug server responses. Maybe there is a problem with timezone.
Or you could just pass --clear argument to collectstatic so that all files are deleted on S3 before collecting

https://github.com/antonagestam/collectfast
From readme.txt : Custom management command that compares the MD5 sum and etag from S3 and if the two are the same skips file copy. This makes running collect static MUCH faster
if you are using git as a source control system which updates timestamps.

Create a settings file just for collectstatic sync, with this config:
TIME_ZONE = 'UTC'
Run collectstatic with a specific settings with this line:
python manage.py collectstatic --settings=settings.collectstatic

This question is a little old but in case it helps someone in the future, I figured I'd share my experience. Following advice found in other threads I confirmed that, for me, this was indeed caused by a difference in time zone. My django time wasn't incorrect but was set to EST and S3 was set to GMT. In testing, I reverted to django-storages 1.1.5 which did seem to get collectstatic working. Partially due to personal preference, I was unwilling to a) roll back three versions of django-storages and lose any potential bug fixes or b) alter time zones for components of my project for what essentially boils down to a convenience function (albeit an important one).
I wrote a short script to do the same job as collectstatic without the aforementioned alterations. It will need a little modifying for your app but should work for standard cases if it is placed at the app level and 'static_dirs' is replaced with the names of your project's apps. It is run via terminal with 'python whatever_you_call_it.py -e environment_name (set this to your aws bucket).
import sys, os, subprocess
import boto3
import botocore
from boto3.session import Session
import argparse
import os.path, time
from datetime import datetime, timedelta
import pytz
utc = pytz.UTC
DEV_BUCKET_NAME = 'dev-homfield-media-root'
PROD_BUCKET_NAME = 'homfield-media-root'
static_dirs = ['accounts', 'messaging', 'payments', 'search', 'sitewide']
def main():
try:
parser = argparse.ArgumentParser(description='Homfield Collectstatic. Our version of collectstatic to fix django-storages bug.\n')
parser.add_argument('-e', '--environment', type=str, required=True, help='Name of environment (dev/prod)')
args = parser.parse_args()
vargs = vars(args)
if vargs['environment'] == 'dev':
selected_bucket = DEV_BUCKET_NAME
print "\nAre you sure? You're about to push to the DEV bucket. (Y/n)"
elif vargs['environment'] == 'prod':
selected_bucket = PROD_BUCKET_NAME
print "Are you sure? You're about to push to the PROD bucket. (Y/n)"
else:
raise ValueError
acceptable = ['Y', 'y', 'N', 'n']
confirmation = raw_input().strip()
while confirmation not in acceptable:
print "That's an invalid response. (Y/n)"
confirmation = raw_input().strip()
if confirmation == 'Y' or confirmation == 'y':
run(selected_bucket)
else:
print "Collectstatic aborted."
except Exception as e:
print type(e)
print "An error occured. S3 staticfiles may not have been updated."
def run(bucket_name):
#open session with S3
session = Session(aws_access_key_id='{aws_access_key_id}',
aws_secret_access_key='{aws_secret_access_key}',
region_name='us-east-1')
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_name)
# loop through static directories
for directory in static_dirs:
rootDir = './' + directory + "/static"
print('Checking directory: %s' % rootDir)
#loop through subdirectories
for dirName, subdirList, fileList in os.walk(rootDir):
#loop through all files in subdirectory
for fname in fileList:
try:
if fname == '.DS_Store':
continue
# find and qualify file last modified time
full_path = dirName + "/" + fname
last_mod_string = time.ctime(os.path.getmtime(full_path))
file_last_mod = datetime.strptime(last_mod_string, "%a %b %d %H:%M:%S %Y") + timedelta(hours=5)
file_last_mod = utc.localize(file_last_mod)
# truncate path for S3 loop and find object, delete and update if it has been updates
s3_path = full_path[full_path.find('static'):]
found = False
for key in bucket.objects.all():
if key.key == s3_path:
found = True
last_mode_date = key.last_modified
if last_mode_date < file_last_mod:
key.delete()
s3.Object(bucket_name, s3_path).put(Body=open(full_path, 'r'), ContentType=get_mime_type(full_path))
print "\tUpdated : " + full_path
if not found:
# if file not found in S3 it is new, send it up
print "\tFound a new file. Uploading : " + full_path
s3.Object(bucket_name, s3_path).put(Body=open(full_path, 'r'), ContentType=get_mime_type(full_path))
except:
print "ALERT: Big time problems with: " + full_path + ". I'm bowin' out dawg, this shitz on u."
def get_mime_type(full_path):
try:
last_index = full_path.rfind('.')
if last_index < 0:
return 'application/octet-stream'
extension = full_path[last_index:]
return {
'.js' : 'application/javascript',
'.css' : 'text/css',
'.txt' : 'text/plain',
'.png' : 'image/png',
'.jpg' : 'image/jpeg',
'.jpeg' : 'image/jpeg',
'.eot' : 'application/vnd.ms-fontobject',
'.svg' : 'image/svg+xml',
'.ttf' : 'application/octet-stream',
'.woff' : 'application/x-font-woff',
'.woff2' : 'application/octet-stream'
}[extension]
except:
'ALERT: Couldn\'t match mime type for '+ full_path + '. Sending to S3 as application/octet-stream.'
if __name__ == '__main__':
main()

Related

Serving CHANGING static django files in production

I need to serve a static file that has a changing name. This works fine with DEBUG = True, but immediately breaks with DEBUG = False. I'm currently using Whitenoise to serve files in production, and tried executing collectstatic after the filename changes, but that didn't help. Any advice is appreciated
Edit: This is how the filename is changed
newname = 'myfile' + str(random.randint(0, 999999)) + '.css'
newpath = os.path.join(current_dir, ('../staticfiles/stuff/css/' + newname))
os.rename(r'{}'.format(filepath), r'{}'.format(newpath))
I can't quite give full context for why I need this file's name to change

Can i have two different sets of app configurations?

I have made a flask server REST API that does an upload to a specific folder.
The upload must comply to a specific file extension, and must be in a specific folder.
So i made these two app.configs:
app.config['UPLOAD_EXTENSIONS'] = ['.stp', '.step']
app.config['UPLOAD_PATH'] = 'uploads'
However, i want to make a new route, for a new upload of a specific file extension to another folder.
So can i have two sets of app.config['UPLOAD_EXTENSIONS'] and app.config['UPLOAD_PATH']?
One set will be for extension1 in folder1 and the other set for extension2 in folder2.
Try using the extension Flask-Uploads .
Or, proceeding from the file format, form a subdirectory in your UPLOAD_PATH.
import os
def select_directory(filename: str) -> str:
file_format = filename.split('.')[1]
your_path1 = 'path1'
your_path2 = 'path2'
if file_format in ('your format1', 'your format2'):
full_path = os.path.join(app.config['UPLOAD_FOLDER'], your_path1, filename)
else:
full_path = os.path.join(app.config['UPLOAD_FOLDER'], your_path2, filename)
return full_path

Boto3 to download all files from a S3 Bucket

I'm using boto3 to get files from s3 bucket. I need a similar functionality like aws s3 sync
My current code is
#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
s3.download_file('my_bucket_name', key['Key'], key['Key'])
This is working fine, as long as the bucket has only files.
If a folder is present inside the bucket, its throwing an error
Traceback (most recent call last):
File "./test", line 6, in <module>
s3.download_file('my_bucket_name', key['Key'], key['Key'])
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
extra_args, callback)
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
self._get_object(bucket, key, filename, extra_args, callback)
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
extra_args, callback)
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
with self._osutil.open(filename, 'wb') as f:
File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'
Is this a proper way to download a complete s3 bucket using boto3. How to download folders.
I have the same needs and created the following function that download recursively the files.
The directories are created locally only if they contain files.
import boto3
import os
def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
for file in result.get('Contents', []):
dest_pathname = os.path.join(local, file.get('Key'))
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
if not file.get('Key').endswith('/'):
resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)
The function is called that way:
def _start():
client = boto3.client('s3')
resource = boto3.resource('s3')
download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')
When working with buckets that have 1000+ objects its necessary to implement a solution that uses the NextContinuationToken on sequential sets of, at most, 1000 keys. This solution first compiles a list of objects then iteratively creates the specified directories and downloads the existing objects.
import boto3
import os
s3_client = boto3.client('s3')
def download_dir(prefix, local, bucket, client=s3_client):
"""
params:
- prefix: pattern to match in s3
- local: local path to folder in which to place files
- bucket: s3 bucket with target contents
- client: initialized s3 client object
"""
keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket':bucket,
'Prefix':prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
if k[-1] != '/':
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
for k in keys:
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
client.download_file(bucket, k, dest_pathname)
import os
import boto3
#initiate s3 resource
s3 = boto3.resource('s3')
# select bucket
my_bucket = s3.Bucket('my_bucket_name')
# download file into current directory
for s3_object in my_bucket.objects.all():
# Need to split s3_object.key into path and file name, else it will give error file not found.
path, filename = os.path.split(s3_object.key)
my_bucket.download_file(s3_object.key, filename)
Amazon S3 does not have folders/directories. It is a flat file structure.
To maintain the appearance of directories, path names are stored as part of the object Key (filename). For example:
images/foo.jpg
In this case, the whole Key is images/foo.jpg, rather than just foo.jpg.
I suspect that your problem is that boto is returning a file called my_folder/.8Df54234 and is attempting to save it to the local filesystem. However, your local filesystem interprets the my_folder/ portion as a directory name, and that directory does not exist on your local filesystem.
You could either truncate the filename to only save the .8Df54234 portion, or you would have to create the necessary directories before writing files. Note that it could be multi-level nested directories.
An easier way would be to use the AWS Command-Line Interface (CLI), which will do all this work for you, eg:
aws s3 cp --recursive s3://my_bucket_name local_folder
There's also a sync option that will only copy new and modified files.
I'm currently achieving the task, by using the following
#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
s3_object = s3_key['Key']
if not s3_object.endswith("/"):
s3.download_file('bucket', s3_object, s3_object)
else:
import os
if not os.path.exists(s3_object):
os.makedirs(s3_object)
Although, it does the job, I'm not sure its good to do this way.
I'm leaving it here to help other users and further answers, with better manner of achieving this
Better late than never:) The previous answer with paginator is really good. However it is recursive, and you might end up hitting Python's recursion limits. Here's an alternate approach, with a couple of extra checks.
import os
import errno
import boto3
def assert_dir_exists(path):
"""
Checks if directory tree in path exists. If not it created them.
:param path: the path to check if it exists
"""
try:
os.makedirs(path)
except OSError as e:
if e.errno != errno.EEXIST:
raise
def download_dir(client, bucket, path, target):
"""
Downloads recursively the given S3 path to the target directory.
:param client: S3 client to use.
:param bucket: the name of the bucket to download from
:param path: The S3 directory to download.
:param target: the local directory to download the files to.
"""
# Handle missing / at end of prefix
if not path.endswith('/'):
path += '/'
paginator = client.get_paginator('list_objects_v2')
for result in paginator.paginate(Bucket=bucket, Prefix=path):
# Download each file individually
for key in result['Contents']:
# Calculate relative path
rel_path = key['Key'][len(path):]
# Skip paths ending in /
if not key['Key'].endswith('/'):
local_file_path = os.path.join(target, rel_path)
# Make sure directories exist
local_file_dir = os.path.dirname(local_file_path)
assert_dir_exists(local_file_dir)
client.download_file(bucket, key['Key'], local_file_path)
client = boto3.client('s3')
download_dir(client, 'bucket-name', 'path/to/data', 'downloads')
A lot of the solutions here get pretty complicated. If you're looking for something simpler, cloudpathlib wraps things in a nice way for this use case that will download directories or files.
from cloudpathlib import CloudPath
cp = CloudPath("s3://bucket/product/myproject/2021-02-15/")
cp.download_to("local_folder")
Note: for large folders with lots of files, awscli at the command line is likely faster.
I have a workaround for this that runs the AWS CLI in the same process.
Install awscli as python lib:
pip install awscli
Then define this function:
from awscli.clidriver import create_clidriver
def aws_cli(*cmd):
old_env = dict(os.environ)
try:
# Environment
env = os.environ.copy()
env['LC_CTYPE'] = u'en_US.UTF'
os.environ.update(env)
# Run awscli in the same process
exit_code = create_clidriver().main(*cmd)
# Deal with problems
if exit_code > 0:
raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
finally:
os.environ.clear()
os.environ.update(old_env)
To execute:
aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')
import boto3, os
s3 = boto3.client('s3')
def download_bucket(bucket):
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if 'Contents' in page:
for obj in page['Contents']:
os.path.dirname(obj['Key']) and os.makedirs(os.path.dirname(obj['Key']), exist_ok=True)
try:
s3.download_file(bucket, obj['Key'], obj['Key'])
except NotADirectoryError:
pass
# Change bucket_name to name of bucket that you want to download
download_bucket(bucket_name)
This should work for all number of objects (also when there are more than 1000). Each paginator page can contain up to 1000 objects.Notice extra param in os.makedirs function - exist_ok=True which cause that it's not throwing error when path exist)
I've updated Grant's answer to run in parallel, it's much faster if anyone is interested:
from concurrent import futures
import os
import boto3
def download_dir(prefix, local, bucket):
client = boto3.client('s3')
def create_folder_and_download_file(k):
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
print(f'downloading {k} to {dest_pathname}')
client.download_file(bucket, k, dest_pathname)
keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket': bucket,
'Prefix': prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
if k[-1] != '/':
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
with futures.ThreadPoolExecutor() as executor:
futures.wait(
[executor.submit(create_folder_and_download_file, k) for k in keys],
return_when=futures.FIRST_EXCEPTION,
)
Yet another parallel downloader using asyncio/aioboto
import os, time
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
from pprint import pprint as pp
# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config
_NUM_WORKERS = 50
bucket_name= 'test-data'
bucket_prefix= 'etl2/test/20210330/f_api'
async def save_to_file(s3_client, bucket: str, key: str):
response = await s3_client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
if 1:
fn =f'out/downloaded/{bucket_name}/{key}'
dn= os.path.dirname(fn)
if not isdir(dn):
os.makedirs(dn,exist_ok=True)
if 1:
with open(fn, 'wb') as fh:
fh.write(content)
print(f'Downloaded to: {fn}')
return [0]
async def go(bucket: str, prefix: str) -> List[dict]:
"""
Returns list of dicts of object contents
:param bucket: s3 bucket
:param prefix: s3 bucket prefix
:return: list of download statuses
"""
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
session = aiobotocore.session.AioSession()
config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
contents = []
async with session.create_client('s3', config=config) as client:
worker_co = partial(save_to_file, client, bucket)
async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
contents.append(await work_pool.push(c['Key'], client))
# retrieve results from futures
contents = [c.result() for c in contents]
return list(chain.from_iterable(contents))
def S3_download_bucket_files():
s = time.perf_counter()
_loop = asyncio.get_event_loop()
_result = _loop.run_until_complete(go(bucket_name, bucket_prefix))
assert sum(_result)==0, _result
print(_result)
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")
It will fetch list of files from S3 first and then download using aioboto with _NUM_WORKERS=50 reading data in parallel from the network.
If you want to call a bash script using python, here is a simple method to load a file from a folder in S3 bucket to a local folder (in a Linux machine) :
import boto3
import subprocess
import os
###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###
# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
# print(object_summary.key)
FILES_NAMES.append(object_summary.key)
# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))
# 3.Time to load files in your destination folder
for new_filename in new_filenames:
upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)
subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
if subprocess_call != 0:
print("ALERT: loading files not working correctly, please re-check new loaded files")
From AWS S3 Docs (How do I use folders in an S3 bucket?):
In Amazon S3, buckets and objects are the primary resources, and objects are stored in buckets. Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using a shared name prefix for objects (that is, objects have names that begin with a common string). Object names are also referred to as key names.
For example, you can create a folder on the console named photos and store an object named myphoto.jpg in it. The object is then stored with the key name photos/myphoto.jpg, where photos/ is the prefix.
To download all files from "mybucket" into the current directory respecting the bucket's emulated directory structure (creating the folders from the bucket if they don't already exist locally):
import boto3
import os
bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
s3_key = s3_object["Key"]
path, filename = os.path.split(s3_key)
if len(path) != 0 and not os.path.exists(path):
os.makedirs(path)
if not s3_key.endswith("/"):
download_to = path + '/' + filename if path else filename
s3.download_file(bucket_name, s3_key, download_to)
It is a very bad idea to get all files in one go, you should rather get it in batches.
One implementation which I use to fetch a particular folder (directory) from S3 is,
def get_directory(directory_path, download_path, exclude_file_names):
# prepare session
session = Session(aws_access_key_id, aws_secret_access_key, region_name)
# get instances for resource and bucket
resource = session.resource('s3')
bucket = resource.Bucket(bucket_name)
for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
s3_object = s3_key['Key']
if s3_object not in exclude_file_names:
bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])
and still if you want to get the whole bucket use it via CLI as #John Rotenstein mentioned as below,
aws s3 cp --recursive s3://bucket_name download_path
for objs in my_bucket.objects.all():
print(objs.key)
path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
try:
if not os.path.exists(path):
os.makedirs(path)
my_bucket.download_file(objs.key, '/tmp/'+objs.key)
except FileExistsError as fe:
print(objs.key+' exists')
This code will download the content in /tmp/ directory. If you want you can change the directory.
I got the similar requirement and got help from reading few of the above solutions and across other websites, I have came up with below script, Just wanted to share if it might help anyone.
from boto3.session import Session
import os
def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):
session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
s3 = session.resource('s3')
your_bucket = s3.Bucket(bucket_name)
for s3_file in your_bucket.objects.all():
if folder in s3_file.key:
file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
if not os.path.exists(os.path.dirname(file)):
os.makedirs(os.path.dirname(file))
your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)
Reposting #glefait 's answer with an if condition at the end to avoid os error 20. The first key it gets is the folder name itself which cannot be written in the destination path.
def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
for file in result.get('Contents', []):
print("Content: ",result)
dest_pathname = os.path.join(local, file.get('Key'))
print("Dest path: ",dest_pathname)
if not os.path.exists(os.path.dirname(dest_pathname)):
print("here last if")
os.makedirs(os.path.dirname(dest_pathname))
print("else file key: ", file.get('Key'))
if not file.get('Key') == dist:
print("Key not equal? ",file.get('Key'))
resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here
I have been running into this problem for a while and with all of the different forums I've been through I haven't see a full end-to-end snip-it of what works. So, I went ahead and took all the pieces (add some stuff on my own) and have created a full end-to-end S3 Downloader!
This will not only download files automatically but if the S3 files are in subdirectories, it will create them on the local storage. In my application's instance, I need to set permissions and owners so I have added that too (can be comment out if not needed).
This has been tested and works in a Docker environment (K8) but I have added the environmental variables in the script just in case you want to test/run it locally.
I hope this helps someone out in their quest of finding S3 Download automation. I also welcome any advice, info, etc. on how this can be better optimized if needed.
#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime
import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger
formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')
json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)
#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"
#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
'ERROR' : logging.ERROR,
'INFO' : logging.INFO,
'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)
#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))
#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)
#Creating Download Directory
if not os.path.exists(download_location_path):
logger.info("Making download directory")
os.makedirs(download_location_path)
#Signal Hooks are fun
class GracefulKiller:
kill_now = False
def __init__(self):
signal.signal(signal.SIGINT, self.exit_gracefully)
signal.signal(signal.SIGTERM, self.exit_gracefully)
def exit_gracefully(self, signum, frame):
self.kill_now = True
#Downloading from S3 Bucket
def download_s3_bucket():
conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
logger.debug("Connection established: ")
bucket = conn.get_bucket(bucket_name)
logger.debug("Bucket: %s" % str(bucket))
bucket_list = bucket.list()
# logger.info("Number of items to download: {0}".format(len(bucket_list)))
for s3_item in bucket_list:
key_string = str(s3_item.key)
logger.debug("S3 Bucket Item to download: %s" % key_string)
s3_path = download_location_path + "/" + key_string
logger.debug("Downloading to: %s" % s3_path)
local_dir = os.path.dirname(s3_path)
if not os.path.exists(local_dir):
logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
os.makedirs(local_dir)
logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
os.chmod(local_dir, 0o775)
os.chown(local_dir, 60001, 60001)
logger.debug("Local directory for download: %s" % local_dir)
try:
logger.info("Downloading File: %s" % key_string)
s3_item.get_contents_to_filename(s3_path)
logger.info("Successfully downloaded File: %s" % s3_path)
#Updating Permissions
logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
os.chmod(s3_path, 0o664)
os.chown(s3_path, 60001, 60001)
except (OSError, S3ResponseError) as e:
logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
# logger.error("Exception in file download from S3: {}".format(e))
continue
logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
s3_item.delete()
def main():
killer = GracefulKiller()
while not killer.kill_now:
logger.info("Checking for new files on S3 to download...")
download_s3_bucket()
logger.info("Done checking for new files, will check in 120s...")
gc.collect()
sys.stdout.flush()
time.sleep(120)
if __name__ == '__main__':
main()
There are very minor differences in the way S3 organizes files and the way Windows does.
Here is a simple self-documenting example that accounts for those differences.
Also: Think of amazon file names as a normal string. They don't really represent a folder. Amazon SIMULATES folders, so if you try to just shove a file into a NAME of a folder that doesn't exist on your system, it cannot figure out where to place it. So you must MAKE a folder on your system for each simulated folder from S3. If you have a folder within a folder, don't use "mkdir(path)" it won't work. You have to use "makedirs(path)". ANOTHER THING! -> PC file paths are weirdly formatted. Amazon uses "/" and pc uses "\" and it MUST be the same for the whole file name. Check out my code block below (WHICH SHOWS AUTHENTICATION TOO).
In my example, I have a folder in my bucket called "iTovenGUIImages/gui_media". I want to put it in a folder on my system that MAY not exist yet. The folder on my system has it's own special prefix that can be whatever you want in your system as long as it's formatted like a folder path.
import boto3
import cred
import os
locale_file_Imagedirectory = r"C:\\Temp\\App Data\\iToven AI\\" # This is where all GUI files for iToven AI exist on PC
def downloadImageDirectoryS3(remoteDirectoryName, desired_parent_folder):
my_bucket = 'itovenbucket'
s3_resource = boto3.resource('s3', aws_access_key_id=cred.AWSAccessKeyId,
aws_secret_access_key=cred.AWSSecretKey)
bucket = s3_resource.Bucket(my_bucket)
for obj in bucket.objects.filter(Prefix=remoteDirectoryName):
pcVersionPrefix = remoteDirectoryName.replace("/", r"\\")
isolatedFileName = obj.key.replace(remoteDirectoryName, "")
clientSideFileName = desired_parent_folder+pcVersionPrefix+isolatedFileName
print(clientSideFileName) # Client-Side System File Structure
if not os.path.exists(desired_parent_folder+pcVersionPrefix): # CREATE DIRECTORIES FOR EACH FOLDER RECURSIVELY
os.makedirs(desired_parent_folder+pcVersionPrefix)
if obj.key not in desired_parent_folder+pcVersionPrefix:
bucket.download_file(obj.key, clientSideFileName) # save to new path
downloadImageDirectoryS3(r"iTovenGUIImagesPC/gui_media/", locale_file_Imagedirectory)

Python - Watch a folder for new .zip file and upload via FTP

I am working on creating a script to watch a folder, grab any new .zip files, and then upload them via FTP to a predetermined area. Right now FTP testing is being performed Locally, since the environment isnt yet created.
The strategy I am taking is to first, unzip into a local folder. Then, perform ftplib.storbinary , on the file from the local folder, to the ftpdestination. However, the unzipping process appears to be working but I am getting a "file does not exist" error, all though I can see it in the folder itself.
Also, is there anyway to unzip directly into an FTP location? I havent been able to find a way hence the approach I am taking.
Thanks, local ftp info removed from code. All paths that are relevant in this code will be changed, most likely to dynamic fashion, but for now this is a local environment
extractZip2.py
import zipfile
import ftplib
import os
import logging
import time
from socket import error as socket_error
#Logging Setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('__name__')
FTPaddress = ''
FTPusername = ''
FTPpassword = ''
ftp_destination_location = ''
path_to_watch = "C:/Users/206420055/Desktop/test2/"
before = dict ([(f,None) for f in os.listdir(path_to_watch)])
temp_destination_location = "C:/Users/206420055/Desktop/temp/"
def unzip(fullPath,temporaryPath):
with zipfile.ZipFile(fullPath, "r") as z :
logger.info("Unzipping {0}".format(fullPath))
z.extractall(temporaryPath)
logger.info("Unzipped into local directory {0}".format(temp_destination_location))
def check_or_create_ftp(session, folder):
"""
Checks to see if necessary folder for currenttab is available.
Creates the folder if not found, and enters it.
"""
if folder not in session.nlst():
logger.info('Directory for {0} does not exist, creating directory\n'.format(folder))
session.mkd(folder)
session.cwd(folder)
def check_or_create(temp_destination):
"""
Checks to see if local savepath exists. Will create savepath if not exists.
"""
if not os.path.exists(temp_destination):
logger.info('Directory for %s does not exist, creating directory\n' % temp_destination)
os.makedirs(str(temp_destination))
def transfer(address,username,password,filename,destination):
logger.info("Creating Session")
try:
session = session_init(address,username,password,destination)
except (socket_error,ftplib.error_perm) as e:
logger.error(str(e))
logger.error("Error in Session Init")
else:
try:
logger.info("Sending File {0}".format(filename))
send_file(session,filename)
except (IOError, OSError, ftplib.error_perm) as e:
logger.error(e)
def session_init(address,username,password,path):
session = ftplib.FTP(address,username,password)
check_or_create_ftp(session,path)
logger.info("Session Established")
return session
def send_file(session,filename):
file = open(filename,'rb')
logger.info('Sending File : STOR '+filename)
session.storbinary('STOR '+ filename, file)
file.close()
def delete_local_files(savepath, file):
logger.info("Cleaning Up Folder {0}".format(savepath))
os.remove(file)
while 1:
time.sleep(5)
after = dict ([(f,None) for f in os.listdir(path_to_watch)])
added = [f for f in after if not f in before]
removed = [f for f in before if not f in after]
if added: print "Added: ",", ".join(added)
before = after
check_or_create(temp_destination_location)
if added :
for file in added:
print file
if file.endswith('.zip'):
unzip(path_to_watch+file, temp_destination_location)
temp_files = os.listdir(temp_destination_location)
print("Temp Files {0}".format(temp_files))
for tf in temp_files:
print("TF {0}".format(tf))
transfer(FTPaddress,FTPusername,FTPpassword,tf,ftp_destination_location)
#delete_local_files(temp_destination_location,tf)
else:
pass
edit: adding error image
Seen above, we see the file in the temp folder. But the console obviously shows the error.
just change it to
from glob import glob
zips_in_path = dict ([(f,None) for f in glob("{base_path}/*.zip".format(base_path = path_to_watch)])
os.listdir does not include the path_to_watch part of the path it is just the filenames, however glob does.
so you could also do
after = dict ([(os.path.join(path_to_watch,f),None) for f in os.listdir(path_to_watch)])
using either of these methods you should be able to get the full path to the files in the path

Trouble making a good ftp checking python program

Please bear with me. Its only been a week I started on python. Heres the problem: I want to connect to a FTP Server. Presuming the file structure on my ftp and local directory is same. I want my python program to do the following:
1> On running the program, it should upload all the files that are NOT on the server but on my
local(Upload just the missing files-Not replacing all).
Say, I add a new directory or a new file, it should be uploaded on the server as it is.
2> It should then check for modified times of both the files on the local and the server and inform which is the latest.
NOw, I have made two programs:
1>One program that will upload ALL files from local to server, as it is. I would rather it check for missing files and then uipload just the missing files folders. NOt replace all.
2> Second program will list all files from the local, using os.walk and it will upload all files on the server without creating the correct directory structure.
All get copied to the root of server. Then it also checks for modified times.
I am in amess right now trying to JOIN these two modules into one perfect that does all I want it to.Anyone who could actually look at these codes and try joining them to what I wanana do, would be perfect. Sorry for being such a pain!!
PS:I might have not done everything the easy way!!
Code NO 1:
import sys
from ftplib import FTP
import os
def uploadDir(localdir,ftp):
"""
for each directory in an entire tree
upload simple files, recur into subdirectories
"""
localfiles = os.listdir(localdir)
for localname in localfiles:
localpath = os.path.join(localdir, localname)
print ('uploading', localpath, 'to', localname)
if not os.path.isdir(localpath):
os.chdir(localdir)
ftp.storbinary('STOR '+localname, open(localname, 'rb'))
else:
try:
ftp.mkd(localname)
print ('directory created')
except:
print ('directory not created')
ftp.cwd(localname) # change remote dir
uploadDir(localpath,ftp) # upload local subdir
ftp.cwd('..') # change back up
print ('directory exited')
def Connect(path):
ftp = FTP("127.0.0.1")
print ('Logging in.')
ftp.login('User', 'Pass')
uploadDir(path,ftp)
Connect("C:\\temp\\NVIDIA\\Test")
Code No2:
import os,pytz,smtplib
import time
from ftplib import FTP
from datetime import datetime,timedelta
from email.mime.text import MIMEText
def Connect_FTP(fileName,pathToFile): path from the local path
(dir,file) = os.path.split(fileName)
fo = open("D:\log.txt", "a+") # LOgging Important Events
os.chdir(dir)
ftp = FTP("127.0.0.1")
print ('Logging in.')
ftp.login('User', 'Pass')
l=dir.split(pathToFile)
print(l[1])
if file in ftp.nlst():
print("file2Check:"+file)
fo.write(str(datetime.now())+": File is in the Server. Checking the Versions....\n")
Check_Latest(file,fileName,ftp,fo)
else:
print("File is not on the Server. So it is being uploaded!!!")
fo.write(str(datetime.now())+": File is NOT in the Server. It is being Uploaded NOW\n")
ftp.storbinary('STOR '+file, open(file, 'rb'))
print("The End")
def Check_Latest(file2,path_on_local,ftp,fo): # Function to check the latest file, USING the "MOdified TIme"
LOcalFile = os.path.getmtime(path_on_local)
dloc=datetime.fromtimestamp(LOcalFile).strftime("%d %m %Y %H:%M:%S")
print("Local File:"+dloc)
localTimestamp=str(time.mktime(datetime.strptime(dloc, "%d %m %Y %H:%M:%S").timetuple())) # Timestamp to compare LOcalTime
modifiedTime = ftp.sendcmd('MDTM '+file2) # Using MDTM command to get the MOdified time.
IST = pytz.timezone('Asia/Kolkata')
ServTime=datetime.strptime(modifiedTime[4:], "%Y%m%d%H%M%S")
tz = pytz.timezone("UTC")
ServTime = tz.localize(ServTime)
j=str(ServTime.astimezone(IST)) # Changing TimeZone
g=datetime.strptime(j[:-6],"%Y-%m-%d %H:%M:%S")
ServerTime = g.strftime('%d %m %Y %H:%M:%S')
serverTimestamp=str(time.mktime(datetime.strptime(ServerTime, "%d %m %Y %H:%M:%S").timetuple())) # Timestamp to compare Servertime
print("ServerFile:"+ServerTime)
if serverTimestamp > localTimestamp:
print ("Old Version on The Client. You need to update your copy of the file")
fo.write(str(datetime.now())+": Old Version of the file "+file2+" on the Client\n")
return
else:
print ("The File on the Server is Outdated!!!New COpy Uploaded")
fo.write(str(datetime.now())+": The server has an outdated file: "+file2+". An email is being generated\n")
ftp.storbinary('STOR '+file2, open(file2, 'rb'))
def Connect(pathToFile):
for path, subdirs, files in os.walk(pathToFile):
for name in files:
j=os.path.join(path, name)
print(j)
Connect_FTP(j,pathToFile)
Connect("C:\temp\NVIDIA\Test")
May be this script will be useful.

Categories

Resources