I have a large dataset of images in a zip file. I want to list all the files so I can work with them.
Since a while I started using google cloud platform and I uploaded my data to a bucket. When working local, I used this piece of code to find the filenames of all the files in the dataset. I want to do something similar so I can load the images in my notebook.
#files directory in list
matches = []
for root, dirnames, filenames in os.walk("D:\LH\..."):
for filename in fnmatch.filter(filenames, '*.nii'):
matches.append(os.path.join(root, filename))
print(matches[0])
Since my dataset is split up in 10 different zip files, I have this piece of code to list the objects in my bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
bloblist = list_blobs('adni_data')
which gives me a list with the folders where my files are in. But how can I load this data using the nilearn library? Btw, the folders consist of many folders with more folders after which the file is stored (every file in other folders).
By iterating over all the blobs with list_blobs, you're just getting references to where the data exists in Cloud Storage -- no actual image data has been transferred between GCS and your script.
If you wanted to load the image with something like nilearn.image.load_img, you need a path to a local .nii file, so you'd need to do something like:
for blob in blobs:
local_filename = '/tmp/' + blob.name
blob.download_to_filename(local_filename)
nilearn.image.load_img(local_filename)
Related
I have the following code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/")
print('Blobs:')
for blob in blobs:
print(blob.name)
My storage hierarchy is :
"content/{uid}/subContentDirectory1/anyFiles*"
"content/{uid}/subContentDirectory2/anyFiles*"
My goal is to be able to delete from the python code content/{uid}/subContentDirectory1
With the above code, i am retrieving all the files in all sub-directories of content but i don't know how to wildcard to delete only a specific subdirectory.
I am aware i can do it with gsutil but it will be running in an AWS lambda function so i would rather use the python SDK.
How can i do that ?
EDIT : to make it more clear, the {uid} is the wildcard, that can possibly match millions of results and under each of these result, i want to delete subContentDirectory1.
The list blobs function of the Bucket class lets you filter the blobs based on the prefix that corresponds to a given subfolder in order to delete all the objects (files) within it. After that, you can use the delete method to remove each blob.
Here is updated code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/{uid}/subContentDirectory1/")
print('Blobs to delete:')
for blob in blobs:
print(blob.name)
blob.delete()
I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.
I am fetch objects from Google Cloud Storage using python, in the folder there are many files (around 20000).
But I just need a particular file which is .json file all other files are in csv format. For now I am using following code with prefix option:
from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="input"))
for blob in blobs:
if '.json' in blob.name:
filename = blob.name
break
This process is not stable as file count is going to be increased and will take much time to filter the json file.(file name is dynamic and could be anything)
Is there any option that can be used like regex filter while fetching the data from cloud storage?
If you want to check filename/extension against a regex it's quite easy.
Just import the 're' module at the beginning
import re
And check against a regex inside the loop:
for blob in blobs:
if re.match(r'\.json$',blob.name):
filename = blob.name
break
You can develop the regex at regex101.com before you burn it on your code.
BTW - I prefer to check extensions with str.endswith which is quite fast:
for blob in blobs:
if blob.name.endswith('.json'):
filename = blob.name
break
I wouldn't use
if '.json' in filename:
etc...
because it might match anything other filenames like 'compressed.json.gz'
With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)
According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]
It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.
You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt
You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.
I would like to move a list of file from google storage to another folder:
storage_client = storage.Client()
count = 0
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(BUCKET_NAME)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') # Excluding folder inside bucket
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF):
WHAT CAN GO HERE?
count += 1
The only useful information which I found in Google Documentation is:
https://cloud.google.com/storage/docs/renaming-copying-moving-objects
By this documentation, the only method is to copy from one folder to another and delete it.
Any way to actually MOVE files?
What is the best way to move all the files based on PREFIX like *BLABLA*.csv
P.S. Do not want to use
"gsutil mv gs://[SOURCE_BUCKET_NAME]/[SOURCE_OBJECT_NAME]
gs://[DESTINATION_BUCKET_NAME]/[DESTINATION_OBJECT_NAME]"
This could be a possible solution, as there is no move_blob function in google.cloud.storage:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.copy_blob(blob,dest_bucket,new_name = blob.name)
source_bucket.delete_blob(blob.name)
You could use method rename_blob in google.cloud.storage.Bucket, this function moves the file and deletes the old one. It's necessary that names of files(blobs) are different
Take a look at the code below:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.rename_blob(blob,dest_bucket,new_name = blob.name)