Google Cloud Storage List Blob objects with specific file name - python

With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)

According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]

It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.

You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt

You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.

Related

How can i delete specific sub-directories google cloud storage python SDK

I have the following code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/")
print('Blobs:')
for blob in blobs:
print(blob.name)
My storage hierarchy is :
"content/{uid}/subContentDirectory1/anyFiles*"
"content/{uid}/subContentDirectory2/anyFiles*"
My goal is to be able to delete from the python code content/{uid}/subContentDirectory1
With the above code, i am retrieving all the files in all sub-directories of content but i don't know how to wildcard to delete only a specific subdirectory.
I am aware i can do it with gsutil but it will be running in an AWS lambda function so i would rather use the python SDK.
How can i do that ?
EDIT : to make it more clear, the {uid} is the wildcard, that can possibly match millions of results and under each of these result, i want to delete subContentDirectory1.
The list blobs function of the Bucket class lets you filter the blobs based on the prefix that corresponds to a given subfolder in order to delete all the objects (files) within it. After that, you can use the delete method to remove each blob.
Here is updated code :
storage_client = storage.Client()
bucket = storage.Bucket(storage_client, name="mybucket")
blobs = storage_client.list_blobs(bucket_or_name=bucket, prefix="content/{uid}/subContentDirectory1/")
print('Blobs to delete:')
for blob in blobs:
print(blob.name)
blob.delete()

Create new csv file in Google Cloud Storage from cloud function

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.
My code:
import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO
def generate_urls(data, context):
if context.event_type == 'google.storage.object.finalize':
storage_client = storage.Client()
bucket_name = data['bucket']
bucket = storage_client.get_bucket(bucket_name)
folder_name = 'my-folder'
file_name = data['name']
if not file_name.endswith('.csv'):
return
These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.
# Prepend 'URL_' to the uploaded file name for the name of the new csv
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
# Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
blob = bucket.blob(file_name)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob)
input_csv = csv.reader(blob)
On the next line is where I get an error: No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'
with open(output, 'w') as output_csv:
csv_dict_reader = csv.DictReader(input_csv, )
csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csv_writer.writeheader()
line_count = 0
for row in csv_dict_reader:
line_count += 1
url = ''
...
# code that converts each line
...
csv_writer.writerow({'URL': url})
print(f'Total rows: {line_count}')
If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!
Probably I would say that I have a few questions about the code and the design of the solution:
As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:
folder_name = 'my-folder'
file_name = data['name']
the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".
The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...
b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.
c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.
About type changes
blob = bucket.blob(file_name)
blob = blob.download_as_string()
After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API
About the output:
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
with open(output, 'w') as output_csv:
Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.
=> Coming to some suggestions.
You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.
You can reach saving a csv in the Google Cloud Storage in two ways.
Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.
Use the power of the Python package "gcsfs"
gcsfs stands for "Google Cloud Storage File System". Add
gcsfs==2021.11.1
or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.
You can save a dataframe for example with:
df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')
or:
df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')
or use an environment variable of the first menu step when creating the CF:
from os import environ
df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)
Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with
fsspec==2021.11.1
and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:
Purpose (of fsspec):
To produce a template or specification for a file-system interface,
that specific implementations should follow, so that applications
making use of them can rely on a common behaviour and not have to
worry about the specific internal implementation decisions with any
given backend. Many such implementations are included in this package,
or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality,
such as a key-value store or FUSE mounting of the file-system
implementation may be available for all implementations "for free".
First in container's "/tmp", then push to GCS
Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):
# function `write_to_csv_file()` not used but might be helpful if no df at hand:
#def write_to_csv_file(file_path, file_content, root):
# """ Creates a file on runtime. """
# file_path = path.join(root, file_path)
#
# # If file is a binary, we rather use 'wb' instead of 'w'
# with open(file_path, 'w') as file:
# file.write(file_content)
def push_to_gcs(file, bucket):
""" Writes to Google Cloud Storage. """
file_name = file.split('/')[-1]
print(f"Pushing {file_name} to GCS...")
blob = bucket.blob(file_name)
blob.upload_from_filename(file)
print(f"File pushed to {blob.id} succesfully.")
# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'
# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)
# If you have a df anyway, `df.to_csv()` is easier.
# The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.
# dfAsString = df.to_string(header=True, index=False)
# write_to_csv_file(file_path, dfAsString, root)
# Cloud Storage Client
# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

How to apply regex for Google Cloud Storage Bucket using Python

I am fetch objects from Google Cloud Storage using python, in the folder there are many files (around 20000).
But I just need a particular file which is .json file all other files are in csv format. For now I am using following code with prefix option:
from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="input"))
for blob in blobs:
if '.json' in blob.name:
filename = blob.name
break
This process is not stable as file count is going to be increased and will take much time to filter the json file.(file name is dynamic and could be anything)
Is there any option that can be used like regex filter while fetching the data from cloud storage?
If you want to check filename/extension against a regex it's quite easy.
Just import the 're' module at the beginning
import re
And check against a regex inside the loop:
for blob in blobs:
if re.match(r'\.json$',blob.name):
filename = blob.name
break
You can develop the regex at regex101.com before you burn it on your code.
BTW - I prefer to check extensions with str.endswith which is quite fast:
for blob in blobs:
if blob.name.endswith('.json'):
filename = blob.name
break
I wouldn't use
if '.json' in filename:
etc...
because it might match anything other filenames like 'compressed.json.gz'

Google Cloud Storage - Move file from one folder to another - By using Python

I would like to move a list of file from google storage to another folder:
storage_client = storage.Client()
count = 0
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(BUCKET_NAME)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') # Excluding folder inside bucket
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF):
WHAT CAN GO HERE?
count += 1
The only useful information which I found in Google Documentation is:
https://cloud.google.com/storage/docs/renaming-copying-moving-objects
By this documentation, the only method is to copy from one folder to another and delete it.
Any way to actually MOVE files?
What is the best way to move all the files based on PREFIX like *BLABLA*.csv
P.S. Do not want to use
"gsutil mv gs://[SOURCE_BUCKET_NAME]/[SOURCE_OBJECT_NAME]
gs://[DESTINATION_BUCKET_NAME]/[DESTINATION_OBJECT_NAME]"
This could be a possible solution, as there is no move_blob function in google.cloud.storage:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.copy_blob(blob,dest_bucket,new_name = blob.name)
source_bucket.delete_blob(blob.name)
You could use method rename_blob in google.cloud.storage.Bucket, this function moves the file and deletes the old one. It's necessary that names of files(blobs) are different
Take a look at the code below:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.rename_blob(blob,dest_bucket,new_name = blob.name)

How to get a List of Files in IBM COS Bucket using Watson Studio

I have a working Python script for consolidating multiple xlsx files that I want to move to a Watson Studio project. My current code uses a path variable which is passed to glob...
path = '/Users/Me/My_Path/*.xlsx'
files = glob.glob(path)
Since credentials in Watson Studio are specific to individual files, how do I get a list of all files in my IBM COS storage bucket? I'm also wondering how to create folders to separate the files in my storage bucket?
Watson Studio cloud provides a helper library, named project-lib for working with objects in your Cloud Object Storage instance. Take a look at this documentation for using the package in Python: https://dataplatform.cloud.ibm.com/docs/content/analyze-data/project-lib-python.html
For your specific question, get_files() should do what you need. This will return a list of all the files in your bucket, then you can do pattern matching to only keep what you need. Based on this filtered list you can then iterate and use get_file(file_name) for each file_name in your list.
To create a "folder" in your bucket, you need to follow a naming convention for files to create a "pseudo folder". For example, if you want to create a "data" folder of assets, you should prefix file names for objects belonging to this folder with data/.
The credentials in IBM Cloud Object Storage (COS) is at COS instance level, not at individual file level. Each COS instance can have any number of buckets with each bucket containing files.
You can get the credentials for the COS instance from Bluemix console.
https://console.bluemix.net/docs/services/cloud-object-storage/iam/service-credentials.html#service-credentials
You can use boto3 python package to access the files.
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
import boto3
s3c = boto3.client('s3', endpoint_url='XXXXXXXXX',aws_access_key_id='XXXXXXXXXXX',aws_secret_access_key='XXXXXXXXXX')
s3.list_objects(Bucket=bucket_name, Prefix=file_path)
s3c.download_file(Filename=filename, Bucket=bucket, Key=objectname)
s3c.upload_file(Filename=filename, Bucket=bucket, Key=objectname)
There's probably a more pythonic way to write this but here is the code I wrote using project-lib per the answer provided by #Greg Filla
files = [] # List to hold data file names
# Get list of all file names in storage bucket
all_files = project.get_files() # returns list of dictionaries
# Create list of file names to load based on prefix
for f in all_files:
if f['name'][:3] == DataFile_Prefix:
files.append(f['name'])
print ("There are " + str(len(files)) + " data files in the storage bucket.")

Categories

Resources