I am trying to download multiple files from the Google cloud storage folder. I am able to download the single file but unable to download multiple files. I took this reference from this link but seems it is not working.
The code is as follow:
# [download multiple files]
bucket_name = 'bigquery-hive-load'
# The "folder" where the files you want to download are
folder="/projects/bigquery/download/shakespeare/"
# Create this folder locally
if not os.path.exists(folder):
os.makedirs(folder)
# Retrieve all blobs with a prefix matching the folder
bucket=storage_client.get_bucket(bucket_name)
print(bucket)
blobs=list(bucket.list_blobs(prefix=folder))
print(blobs)
for blob in blobs:
if(not blob.name.endswith("/")):
blob.download_to_filename(blob.name)
# [End download to multiple files]
Is there any way to download multiple files matching with the pattern(name) or something else. Since I am exporting the file from bigquery, the file names will be something like below:
shakespeare-000000000000.csv.gz
shakespeare-000000000001.csv.gz
shakespeare-000000000002.csv.gz
shakespeare-000000000003.csv.gz
Reference: Working code to download single file:
# [download to single files]
edgenode_destination_uri = '/projects/bigquery/download/shakespeare-000000000000.csv.gz'
bucket_name = 'bigquery-hive-load'
gcs_bucket = storage_client.get_bucket(bucket_name)
blob = gcs_bucket.blob("shakespeare.csv.gz")
blob.download_to_filename(edgenode_destination_uri)
logging.info('Downloded {} to {}'.format(
gcs_bucket, edgenode_destination_uri))
# [end download to single files]
After some trial, I solved this and couldn't stop myself from posting here as well.
bucket_name = 'mybucket'
folder='/projects/bigquery/download/shakespeare/'
delimiter='/'
file = 'shakespeare'
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(bucket_name)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=file, delimiter=delimiter) # Excluding folder inside bucket
for blob in blobs:
print(blob.name)
destination_uri = '{}/{}'.format(folder, blob.name)
blob.download_to_filename(destination_uri)
It looks like you may simply have the wrong level of indentation in your python code. The block beginning with # Retrieve all blobs with a prefix matching the folder is within the scope of the if above so it's never executed if the folder already exists.
Try this:
# [download multiple files]
bucket_name = 'bigquery-hive-load'
# The "folder" where the files you want to download are
folder="/projects/bigquery/download/shakespeare/"
# Create this folder locally
if not os.path.exists(folder):
os.makedirs(folder)
# Retrieve all blobs with a prefix matching the folder
bucket=storage_client.get_bucket(bucket_name)
print(bucket)
blobs=list(bucket.list_blobs(prefix=folder))
print(blobs)
for blob in blobs:
if(not blob.name.endswith("/")):
blob.download_to_filename(blob.name)
# [End download to multiple files]
Related
Using python I am able to delete files from bucket using prefixes also but in python code, prefix means directory.
I want to delete the files from GCP bucket which starts with example.
For example:
example-2022-12-07
example-2022-12-08
I followed this(Delete Files from Google Cloud Storage) but did not get the answer.
I am trying this, but not working:
blobs = bucket.list_blobs()
fileList = [file.name for file in blobs if 'example' in file.name ]
print(fileList)
for file in fileList:
blob = blobs.blob(file)
blob.delete()
print(f"Blob {blob_name} deleted.")
You can try the following code to delete files from the Google Cloud Storage by using the blob.delete method as suggested in the Documentation.
Below is the example for what you are looking:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
# Add prefix as parameter to bucket.list_blobs
blobs = bucket.list_blobs(prefix=?)
for blob in blobs:
blob.delete()
print(f"Blob {blob_name} deleted.")`
You can check with this thread1 and thread2.
Hi guys I need to generate a zip file from a bunch of images and add it as a file to the database
This is the code to generate the file
def create_zip_folder(asin):
print('creating zip folder for asin: ' + asin)
asin_object = Asin.objects.get(asin=asin)
# create folder
output_dir = f"/tmp/{asin}"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# download images
for img in asin_object.asin_images.all():
urllib.request.urlretrieve(img.url, f"{output_dir}/{img.id}.jpg")
# zip all files in output_dir
zip_file = shutil.make_archive(asin, 'zip', output_dir)
asin_object.zip_file = zip_file
asin_object.has_zip = True
asin_object.save()
# delete folder
shutil.rmtree(output_dir)
return True
This all works and I can see the files generated in my editor but when I try to access it in the template asin.zip_file.url I get this error
SuspiciousOperation at /history/
Attempted access to '/workspace/B08MFR2DRS.zip' denied.
Why is this happening? I thought the file is to be uploaded to the file storage through the model but apparently it's in a restricted folder, this happens both in development (with local file storage) and in production (with s3 bucket as file storage)
I have a GCS called my-gcs with inconsistent subfolder such as;
parent-path/path1/path2/*
parent-path/path3/path4/path5/*
parent-path/path6/*
The files can be parquet/csv or other than this.
This is my function to copy the entire folder from local to GCS:
def upload_local_directory_to_gcs(src_path, dest_path, data_backup, file_name):
"""
Upload the whole directory to GCS
"""
logger.debug("Uploading directory...")
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.get_bucket(GCS_BUCKET)
if os.path.isfile(src_path):
blob = bucket.blob(os.path.join(dest_path, os.path.basename(src_path)))
blob.upload_from_filename(src_path)
return
for item in glob.glob(src_path + '/*'):
file_exist = check_file_exist(data_backup, file_name)
if os.path.isfile(item):
print(item)
if file_exist is False:
blob = bucket.blob(os.path.join(dest_path, os.path.basename(item)),
chunk_size=10485760)
blob.upload_from_filename(item)
else:
logger.warning("Skipping upload. File already existed")
else:
if file_exist is False:
upload_local_directory_to_gcs(item, os.path.join(dest_path, os.path.basename(item)),
data_backup, file_name)
else:
logger.warning("Skipping upload. File already existed")
This is the function to check if specific file exist in the directory & sub-directory:
def check_file_exist(dataset, file_name):
"""
Check if files existed
"""
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(GCS_BUCKET)
logger.debug("Checking if file already existed in GCS to skip upload...")
blobs = bucket.list_blobs(prefix=f'parent-path{dataset}/')
check_files = [blob.name for blob in blobs if file_name in blob.name] # if '.' in blob.name
return bool(len(check_files))
However the code is not running correctly. Say this path parent-path/path1/path2/* already has a file called first_file.csv. It will skip uploading the existing file in this path. Until it encounters a file that not yet existed, it will upload the file and overwrite the other files for all directories as well.
Where I was expecting it to only upload specific file that is not existed yet, without overwriting the other files.
I tried my best to explain... please help.
If you have a look to the documentation, you can see that on the Name property of the blob
The name of the blob. This corresponds to the unique path of the object in the bucket.
That means the value is not only the file name, but the fully qualified path + the name path/to/file.csv
If your loop, you check if a file name (file.csv for example) is included in the blob path. Consider this case
path/to/file.csv
path/to/to/file.csv
If you test is file.csv exists, both blobs will return true.
To fix your issue, you need to
Either compare the strict equality of the target_path + file_name and the blob.name
Or include an additional condition in your "if" to include the bucket path to check in addition to the file name.
Using https://github.com/googleapis/google-cloud-python/tree/master/storage or https://github.com/GoogleCloudPlatform/appengine-gcs-client, I can delete a files by specifying its file name, but there seems not to be ways to delete folders.
Is there any ways to delete folders ?
I found this(Google Cloud Storage: How to Delete a folder (recursively) in Python) in stackvoerflow, but this answer simply deletes all the files in the folder, not deleting the folder itself.
The code mentioned in the anwser you referred works, the prefix should look like this:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blobs = bucket.list_blobs(prefix='my-folder/')
for blob in blobs:
blob.delete()
from google.cloud import storage
def delete_storage_folder(bucket_name, folder):
"""
This function deletes from GCP Storage
:param bucket_name: The bucket name in which the file is to be placed
:param folder: Folder name to be deleted
:return: returns nothing
"""
cloud_storage_client = storage.Client()
bucket = cloud_storage_client.bucket(bucket_name)
try:
bucket.delete_blobs(blobs=list(bucket.list_blobs(prefix=folder)))
except Exception as e:
print(str(e.message))
The following code works fine except if there is a subfolder, which does not have any file inside, then the subfolder will not appear in S3. e.g.
if /home/temp/subfolder has no file, then subfolder will not show in S3. how to change the code so that the empty folder is also uploaded in S3?
I tried to write sth. (see note below), but do not know how to call put_object() to the empty subfolder.
#!/usr/bin/env python
import os
from boto3.session import Session
path = "/home/temp"
session = Session(aws_access_key_id='XXX', aws_secret_access_key='XXX')
s3 = session.resource('s3')
for subdir, dirs, files in os.walk(path):
# note: if not files ......
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
s3.Bucket('my_bucket').put_object(Key=full_path[len(path)+1:],
Body=data)
besides, I tried to call this function to check if a subfolder or file exist or not. it works for file, but not subfolder. how to check if a subfolder exists or not? (if a subfolder exists, I will not upload)
def check_exist(s3, bucket, key):
try:
s3.Object(bucket, key).load()
except botocore.exceptions.ClientError as e:
return False
return True
BTW, I refer the above code from
check if a key exists in a bucket in s3 using boto3
and
http://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
thanks them for sharing the code.
Directories (folders, subfolders, etc.) do not exist in S3.
When you copy this file to an empty S3 bucket /mydir/myfile.txt, only the file myfile.txt is copied to S3. The directory mydir is not created as that string is part of the file name mydir/myfile.txt. The actual file name is the full path, no subdirectories exist or are created.
S3 simulates directories by using a prefix when listing files in the bucket. If you specify mydir/, then all of the S3 objects that start with mydir/ will be returned including objects such as mydir/anotherfolder/myotherfile.txt. S3 supports a delimitor such as / so that the appearance of subdirectories can be created.
Note: There is no / at the beginning of a file name for S3 objects.
Listing Keys Hierarchically Using a Prefix and Delimiter