Hi guys I need to generate a zip file from a bunch of images and add it as a file to the database
This is the code to generate the file
def create_zip_folder(asin):
print('creating zip folder for asin: ' + asin)
asin_object = Asin.objects.get(asin=asin)
# create folder
output_dir = f"/tmp/{asin}"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# download images
for img in asin_object.asin_images.all():
urllib.request.urlretrieve(img.url, f"{output_dir}/{img.id}.jpg")
# zip all files in output_dir
zip_file = shutil.make_archive(asin, 'zip', output_dir)
asin_object.zip_file = zip_file
asin_object.has_zip = True
asin_object.save()
# delete folder
shutil.rmtree(output_dir)
return True
This all works and I can see the files generated in my editor but when I try to access it in the template asin.zip_file.url I get this error
SuspiciousOperation at /history/
Attempted access to '/workspace/B08MFR2DRS.zip' denied.
Why is this happening? I thought the file is to be uploaded to the file storage through the model but apparently it's in a restricted folder, this happens both in development (with local file storage) and in production (with s3 bucket as file storage)
Related
I have a GCS called my-gcs with inconsistent subfolder such as;
parent-path/path1/path2/*
parent-path/path3/path4/path5/*
parent-path/path6/*
The files can be parquet/csv or other than this.
This is my function to copy the entire folder from local to GCS:
def upload_local_directory_to_gcs(src_path, dest_path, data_backup, file_name):
"""
Upload the whole directory to GCS
"""
logger.debug("Uploading directory...")
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.get_bucket(GCS_BUCKET)
if os.path.isfile(src_path):
blob = bucket.blob(os.path.join(dest_path, os.path.basename(src_path)))
blob.upload_from_filename(src_path)
return
for item in glob.glob(src_path + '/*'):
file_exist = check_file_exist(data_backup, file_name)
if os.path.isfile(item):
print(item)
if file_exist is False:
blob = bucket.blob(os.path.join(dest_path, os.path.basename(item)),
chunk_size=10485760)
blob.upload_from_filename(item)
else:
logger.warning("Skipping upload. File already existed")
else:
if file_exist is False:
upload_local_directory_to_gcs(item, os.path.join(dest_path, os.path.basename(item)),
data_backup, file_name)
else:
logger.warning("Skipping upload. File already existed")
This is the function to check if specific file exist in the directory & sub-directory:
def check_file_exist(dataset, file_name):
"""
Check if files existed
"""
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(GCS_BUCKET)
logger.debug("Checking if file already existed in GCS to skip upload...")
blobs = bucket.list_blobs(prefix=f'parent-path{dataset}/')
check_files = [blob.name for blob in blobs if file_name in blob.name] # if '.' in blob.name
return bool(len(check_files))
However the code is not running correctly. Say this path parent-path/path1/path2/* already has a file called first_file.csv. It will skip uploading the existing file in this path. Until it encounters a file that not yet existed, it will upload the file and overwrite the other files for all directories as well.
Where I was expecting it to only upload specific file that is not existed yet, without overwriting the other files.
I tried my best to explain... please help.
If you have a look to the documentation, you can see that on the Name property of the blob
The name of the blob. This corresponds to the unique path of the object in the bucket.
That means the value is not only the file name, but the fully qualified path + the name path/to/file.csv
If your loop, you check if a file name (file.csv for example) is included in the blob path. Consider this case
path/to/file.csv
path/to/to/file.csv
If you test is file.csv exists, both blobs will return true.
To fix your issue, you need to
Either compare the strict equality of the target_path + file_name and the blob.name
Or include an additional condition in your "if" to include the bucket path to check in addition to the file name.
I am trying to download multiple files from the Google cloud storage folder. I am able to download the single file but unable to download multiple files. I took this reference from this link but seems it is not working.
The code is as follow:
# [download multiple files]
bucket_name = 'bigquery-hive-load'
# The "folder" where the files you want to download are
folder="/projects/bigquery/download/shakespeare/"
# Create this folder locally
if not os.path.exists(folder):
os.makedirs(folder)
# Retrieve all blobs with a prefix matching the folder
bucket=storage_client.get_bucket(bucket_name)
print(bucket)
blobs=list(bucket.list_blobs(prefix=folder))
print(blobs)
for blob in blobs:
if(not blob.name.endswith("/")):
blob.download_to_filename(blob.name)
# [End download to multiple files]
Is there any way to download multiple files matching with the pattern(name) or something else. Since I am exporting the file from bigquery, the file names will be something like below:
shakespeare-000000000000.csv.gz
shakespeare-000000000001.csv.gz
shakespeare-000000000002.csv.gz
shakespeare-000000000003.csv.gz
Reference: Working code to download single file:
# [download to single files]
edgenode_destination_uri = '/projects/bigquery/download/shakespeare-000000000000.csv.gz'
bucket_name = 'bigquery-hive-load'
gcs_bucket = storage_client.get_bucket(bucket_name)
blob = gcs_bucket.blob("shakespeare.csv.gz")
blob.download_to_filename(edgenode_destination_uri)
logging.info('Downloded {} to {}'.format(
gcs_bucket, edgenode_destination_uri))
# [end download to single files]
After some trial, I solved this and couldn't stop myself from posting here as well.
bucket_name = 'mybucket'
folder='/projects/bigquery/download/shakespeare/'
delimiter='/'
file = 'shakespeare'
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(bucket_name)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=file, delimiter=delimiter) # Excluding folder inside bucket
for blob in blobs:
print(blob.name)
destination_uri = '{}/{}'.format(folder, blob.name)
blob.download_to_filename(destination_uri)
It looks like you may simply have the wrong level of indentation in your python code. The block beginning with # Retrieve all blobs with a prefix matching the folder is within the scope of the if above so it's never executed if the folder already exists.
Try this:
# [download multiple files]
bucket_name = 'bigquery-hive-load'
# The "folder" where the files you want to download are
folder="/projects/bigquery/download/shakespeare/"
# Create this folder locally
if not os.path.exists(folder):
os.makedirs(folder)
# Retrieve all blobs with a prefix matching the folder
bucket=storage_client.get_bucket(bucket_name)
print(bucket)
blobs=list(bucket.list_blobs(prefix=folder))
print(blobs)
for blob in blobs:
if(not blob.name.endswith("/")):
blob.download_to_filename(blob.name)
# [End download to multiple files]
The following code works fine except if there is a subfolder, which does not have any file inside, then the subfolder will not appear in S3. e.g.
if /home/temp/subfolder has no file, then subfolder will not show in S3. how to change the code so that the empty folder is also uploaded in S3?
I tried to write sth. (see note below), but do not know how to call put_object() to the empty subfolder.
#!/usr/bin/env python
import os
from boto3.session import Session
path = "/home/temp"
session = Session(aws_access_key_id='XXX', aws_secret_access_key='XXX')
s3 = session.resource('s3')
for subdir, dirs, files in os.walk(path):
# note: if not files ......
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
s3.Bucket('my_bucket').put_object(Key=full_path[len(path)+1:],
Body=data)
besides, I tried to call this function to check if a subfolder or file exist or not. it works for file, but not subfolder. how to check if a subfolder exists or not? (if a subfolder exists, I will not upload)
def check_exist(s3, bucket, key):
try:
s3.Object(bucket, key).load()
except botocore.exceptions.ClientError as e:
return False
return True
BTW, I refer the above code from
check if a key exists in a bucket in s3 using boto3
and
http://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
thanks them for sharing the code.
Directories (folders, subfolders, etc.) do not exist in S3.
When you copy this file to an empty S3 bucket /mydir/myfile.txt, only the file myfile.txt is copied to S3. The directory mydir is not created as that string is part of the file name mydir/myfile.txt. The actual file name is the full path, no subdirectories exist or are created.
S3 simulates directories by using a prefix when listing files in the bucket. If you specify mydir/, then all of the S3 objects that start with mydir/ will be returned including objects such as mydir/anotherfolder/myotherfile.txt. S3 supports a delimitor such as / so that the appearance of subdirectories can be created.
Note: There is no / at the beginning of a file name for S3 objects.
Listing Keys Hierarchically Using a Prefix and Delimiter
I'm trying to upload a whole folder to dropbox but only the files get uploaded. Should I create a folder programatically or can I solve the folder-uploading so simple? Thanks
import os
import dropbox
access_token = '***********************'
dbx = dropbox.Dropbox(access_token)
dropbox_destination = '/live'
local_directory = 'C:/Users/xoxo/Desktop/man'
for root, dirs, files in os.walk(local_directory):
for filename in files:
local_path = root + '/' + filename
print("local_path", local_path)
relative_path = os.path.relpath(local_path, local_directory)
dropbox_path = dropbox_destination + '/' + relative_path
# upload the file
with open(local_path, 'rb') as f:
dbx.files_upload(f.read(), dropbox_path)
error:
dropbox.exceptions.ApiError: ApiError('xxf84e5axxf86', UploadError('path', UploadWriteFailed(reason=WriteError('disallowed_name', None), upload_session_id='xxxxxxxxxxx')))
[Cross-linking for reference: https://www.dropboxforum.com/t5/API-support/UploadWriteFailed-reason-WriteError-disallowed-name-None/td-p/245765 ]
There are a few things to note here:
In your sample, you're only iterating over files, so you won't get dirs uploaded/created.
The /2/files/upload endpoint only accepts file uploads, not folders. If you want to create folders, use /2/files/create_folder_v2. You don't need to explicitly create folders for any parent folders in the path for files you upload via /2/files/upload though. Those will be automatically created with the upload.
Per the /2/files/upload documentation, disallowed_name means:
Dropbox will not save the file or folder because of its name.
So, it's likely you're getting this error because you're trying to upload an ignored filed, e.g., ".DS_STORE". You can find more information on those in this help article under "Ignored files".
Is it possible to explicitly move a file through python's googleapiclient module? I want to create the following function, given a file, original path and destination path:
def move_file(service, filename, init_drive_path, drive_path, copy=False):
"""Moves a file in Google Drive from one location to another.
service: Drive API service instance.
filename (string): full name of file on drive
init_drive_path (string): initial file location on Drive
drive_path (string): the file path to move the file in on Drive
copy (boolean): file should be saved in both locations
Returns nothing.
"""
Currently I have been executing this through manually downloading the file and then re-uploading it to the desired location, however this is not practical for big files and seems like a work-around method anyway.
Here's the documentation for the methods available on the google-drive-api.
EDIT See solution below:
Found it here. You just need to retrieve the file and folder ID and then use the update method. The remove_parents parameter can be excluded if you want to leave a copy of the file in the old folder(s)
file_id = '***'
folder_id = '***'
# Retrieve the existing parents to remove
file = drive_service.files().get(fileId=file_id, fields='parents').execute()
previous_parents = ",".join(file.get('parents'))
# Move the file to the new folder
file = drive_service.files().update(
fileId=file_id,
addParents=folder_id,
removeParents=previous_parents,
fields='id, parents'
).execute()
(Note I have not included my basic helper functions _getFileId and _getFolderId) So my original function will look something like:
def move_file(service, filename, init_drive_path, drive_path, copy=False):
"""Moves a file in Google Drive from one location to another.
service: Drive API service instance.
'filename' (string): file path on local machine, or a bytestream
to use as a file.
'init_drive_path' (string): the file path the file was initially in on Google
Drive (in <folder>/<folder>/<folder> format).
'drive_path' (string): the file path to move the file in on Google
Drive (in <folder>/<folder>/<folder> format).
'copy' (boolean): file should be saved in both locations
Returns nothing.
"""
file_id = _getFileId(service, filename, init_drive_path)
folder_id = _getFolderId(service, drive_path)
if not file_id or not folder_id:
raise Exception('Did not find file specefied: {}/{}'.format(init_drive_path, filename))
file = service.files().get(fileId=file_id, fields='parents').execute()
if copy:
previous_parents = ''
else:
previous_parents = ",".join(file.get('parents'))
file = drive_service.files().update(
fileId=file_id,
addParents=folder_id,
removeParents=previous_parents,
fields='id, parents'
).execute()