How can i delete specific sub-directories google cloud storage python SDK - python

I have the following code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/")
print('Blobs:')
for blob in blobs:
print(blob.name)
My storage hierarchy is :
"content/{uid}/subContentDirectory1/anyFiles*"
"content/{uid}/subContentDirectory2/anyFiles*"
My goal is to be able to delete from the python code content/{uid}/subContentDirectory1
With the above code, i am retrieving all the files in all sub-directories of content but i don't know how to wildcard to delete only a specific subdirectory.
I am aware i can do it with gsutil but it will be running in an AWS lambda function so i would rather use the python SDK.
How can i do that ?
EDIT : to make it more clear, the {uid} is the wildcard, that can possibly match millions of results and under each of these result, i want to delete subContentDirectory1.

The list blobs function of the Bucket class lets you filter the blobs based on the prefix that corresponds to a given subfolder in order to delete all the objects (files) within it. After that, you can use the delete method to remove each blob.
Here is updated code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/{uid}/subContentDirectory1/")
print('Blobs to delete:')
for blob in blobs:
print(blob.name)
blob.delete()

Related

Google Cloud Storage List Blob objects with specific file name

With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)
According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]
It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.
You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt
You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.

How do I walk through path of a google bucket in notebook?

I have a large dataset of images in a zip file. I want to list all the files so I can work with them.
Since a while I started using google cloud platform and I uploaded my data to a bucket. When working local, I used this piece of code to find the filenames of all the files in the dataset. I want to do something similar so I can load the images in my notebook.
#files directory in list
matches = []
for root, dirnames, filenames in os.walk("D:\LH\..."):
for filename in fnmatch.filter(filenames, '*.nii'):
matches.append(os.path.join(root, filename))
print(matches[0])
Since my dataset is split up in 10 different zip files, I have this piece of code to list the objects in my bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
bloblist = list_blobs('adni_data')
which gives me a list with the folders where my files are in. But how can I load this data using the nilearn library? Btw, the folders consist of many folders with more folders after which the file is stored (every file in other folders).
By iterating over all the blobs with list_blobs, you're just getting references to where the data exists in Cloud Storage -- no actual image data has been transferred between GCS and your script.
If you wanted to load the image with something like nilearn.image.load_img, you need a path to a local .nii file, so you'd need to do something like:
for blob in blobs:
local_filename = '/tmp/' + blob.name
blob.download_to_filename(local_filename)
nilearn.image.load_img(local_filename)

Google Cloud Storage - Move file from one folder to another - By using Python

I would like to move a list of file from google storage to another folder:
storage_client = storage.Client()
count = 0
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(BUCKET_NAME)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') # Excluding folder inside bucket
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF):
WHAT CAN GO HERE?
count += 1
The only useful information which I found in Google Documentation is:
https://cloud.google.com/storage/docs/renaming-copying-moving-objects
By this documentation, the only method is to copy from one folder to another and delete it.
Any way to actually MOVE files?
What is the best way to move all the files based on PREFIX like *BLABLA*.csv
P.S. Do not want to use
"gsutil mv gs://[SOURCE_BUCKET_NAME]/[SOURCE_OBJECT_NAME]
gs://[DESTINATION_BUCKET_NAME]/[DESTINATION_OBJECT_NAME]"
This could be a possible solution, as there is no move_blob function in google.cloud.storage:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.copy_blob(blob,dest_bucket,new_name = blob.name)
source_bucket.delete_blob(blob.name)
You could use method rename_blob in google.cloud.storage.Bucket, this function moves the file and deletes the old one. It's necessary that names of files(blobs) are different
Take a look at the code below:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.rename_blob(blob,dest_bucket,new_name = blob.name)

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

How do i delete a folder(blob) inside an azure container using delete_blob method of blockblobservice?

delete_blob() seems to delete only the files inside the container and from folders and subfolders inside the container. But i'm seeing below error in python while trying to delete a folder from container.
Client-Request-ID=7950669c-2c4a-11e8-88e7-00155dbf7128 Retry policy did not allow for a retry: Server-Timestamp=Tue, 20 Mar 2018 14:25:00 GMT, Server-Request-ID=54d1a5d6-b01e-007b-5e57-c08528000000, HTTP status code=404, Exception=The specified blob does not exist.ErrorCode: BlobNotFoundBlobNotFoundThe specified blob does not exist.RequestId:54d1a5d6-b01e-007b-5e57-c08528000000Time:2018-03-20T14:25:01.2130063Z.
azure.common.AzureMissingResourceHttpError: The specified blob does not exist.ErrorCode: BlobNotFound
BlobNotFoundThe specified blob does not exist.
RequestId:54d1a5d6-b01e-007b-5e57-c08528000000
Time:2018-03-20T14:25:01.2130063Z
Could anyone please help here?
In Azure Blob Storage, as such a folder doesn't exist. It is just a prefix for a blob's name. For example, if you see a folder named images and it contains a blob called myfile.png, then essentially the blob's name is images/myfile.png. Because the folders don't really exist (they are virtual), you can't delete the folder directly.
What you need to do is delete all blobs individually in that folder (or in other words delete the blobs whose name begins with that virtual folder name/path. Once you have deleted all the blobs, then that folder automatically goes away.
In order to accomplish this, first you would need to fetch all blobs whose name starts with the virtual folder path. For that you will use list_blobs method and specify the virtual folder path in prefix parameter. This will give you a list of blobs starting with that prefix. Once you have that list, you will delete the blobs one by one.
There are two things to understand from the process, you could delete specific files,folders,images...(blobs) using delete_blob , But if you want to delete containers, you have to use the delete_container which will delete all blobs within, here's a sample that i created which deletes blobs inside a path/virtual folder:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='yraccountname', account_key='accountkey')
print("Retreiving blobs in specified container...")
blob_list=[]
container="containername"
def list_blobs(container):
try:
global blob_list
content = block_blob_service.list_blobs(container)
print("******Blobs currently in the container:**********")
for blob in content:
blob_list.append(blob.name)
print(blob.name)
except:
print("The specified container does not exist, Please check the container name or if it exists.")
list_blobs(container)
print("The list() is:")
print(blob_list)
print("Delete this blob: ",blob_list[1])
#DELETE A SPECIFIC BLOB FROM THE CONTAINER
block_blob_service.delete_blob(container,blob_list[1],snapshot=None)
list_blobs(container)
Please refer to the code in my repo with explanation in Readme section, as well as new storage scripts:https://github.com/adamsmith0016/Azure-storage
For others searching for the solution in python. This worked for me.
First make a variable that stores all the files in the folder that you want to remove.
Then for every file in that folder, remove the file by stating the name of the container, and then the actual foldername.name .
By removing all the files in a folder, the folders is deleted in azure.
def delete_folder(self, containername, foldername):
folders = [blob for blob in blob_service.block_blob_service.list_blobs(containername) if blob.name.startswith(foldername)]
if len(folders) > 0:
for folder in folders:
blob_service.block_blob_service.delete_blob(containername, foldername.name)
print("deleted folder",folder name)
Use list_blobs(name_starts_with=folder_name) and delete_blob()
Complete code:
blob_service_client = BlobServiceClient.from_connection_string(conn_str=CONN_STR)
blob_client = blob_service_client.get_container_client(AZURE_BLOBSTORE_CONTAINER)
for blob in blob_client.list_blobs(name_starts_with=FOLDER_NAME):
blob_client.delete_blob(blob.name)
You cannot delete a non-empty folder in Azure blobs, but you can achieve it if you delete the files inside the sub-folders first. The below work around will start deleting it from the files to the parent folder.
from azure.storage.blob import BlockBlobService
blob_client = BlockBlobService(account_name='', account_key='')
containername = 'XXX'
foldername = 'XXX'
def delete_folder(containername, foldername):
folders = [blob.name for blob in blob_client.list_blobs(containername, prefix=foldername)]
folders.sort(reverse=True, key=len)
if len(folders) > 0:
for folder in folders:
blob_client.delete_blob(containername, folder)
print("deleted folder",folder)

Categories

Resources