How to apply regex for Google Cloud Storage Bucket using Python - python

I am fetch objects from Google Cloud Storage using python, in the folder there are many files (around 20000).
But I just need a particular file which is .json file all other files are in csv format. For now I am using following code with prefix option:
from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="input"))
for blob in blobs:
if '.json' in blob.name:
filename = blob.name
break
This process is not stable as file count is going to be increased and will take much time to filter the json file.(file name is dynamic and could be anything)
Is there any option that can be used like regex filter while fetching the data from cloud storage?

If you want to check filename/extension against a regex it's quite easy.
Just import the 're' module at the beginning
import re
And check against a regex inside the loop:
for blob in blobs:
if re.match(r'\.json$',blob.name):
filename = blob.name
break
You can develop the regex at regex101.com before you burn it on your code.
BTW - I prefer to check extensions with str.endswith which is quite fast:
for blob in blobs:
if blob.name.endswith('.json'):
filename = blob.name
break
I wouldn't use
if '.json' in filename:
etc...
because it might match anything other filenames like 'compressed.json.gz'

Related

How to extract only numeric digit from a .csv file name from big query location using python

I have one csv file name in my Google cloud storage bucket and the name of the file is like 'select_tb22322_c' this. Now I need to create a variable in python 'MAIL' and in this variable I need to extract only 22322 from the above .csv file name. That means in MAIL variable I will have only 22322. What would be the python or Bigquery code here ? Kindly response.
I tried with some python code but its not working. Need a solution with python code.
Assuming the pattern for "select_tb22322_c" is unique. You can create a regex to capture the digit in this pattern. See approach below:
from google.cloud import storage
import re
storage_client = storage.Client()
bucket = storage_client.get_bucket('your-bucket-name')
blobs = bucket.list_blobs()
for blob in blobs:
word = re.match(r'\w+\_tb(\d+)\_\w+\.csv', blob.name)
if word:
MAIL = word.group(1)
print(MAIL)
Output:
22322
NOTE: Modify the regex to be stricter to match your desired csv properly.

Moving files from one bucket to another in GCS

I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.

Create new csv file in Google Cloud Storage from cloud function

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.
My code:
import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO
def generate_urls(data, context):
if context.event_type == 'google.storage.object.finalize':
storage_client = storage.Client()
bucket_name = data['bucket']
bucket = storage_client.get_bucket(bucket_name)
folder_name = 'my-folder'
file_name = data['name']
if not file_name.endswith('.csv'):
return
These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.
# Prepend 'URL_' to the uploaded file name for the name of the new csv
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
# Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
blob = bucket.blob(file_name)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob)
input_csv = csv.reader(blob)
On the next line is where I get an error: No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'
with open(output, 'w') as output_csv:
csv_dict_reader = csv.DictReader(input_csv, )
csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csv_writer.writeheader()
line_count = 0
for row in csv_dict_reader:
line_count += 1
url = ''
...
# code that converts each line
...
csv_writer.writerow({'URL': url})
print(f'Total rows: {line_count}')
If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!
Probably I would say that I have a few questions about the code and the design of the solution:
As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:
folder_name = 'my-folder'
file_name = data['name']
the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".
The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...
b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.
c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.
About type changes
blob = bucket.blob(file_name)
blob = blob.download_as_string()
After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API
About the output:
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
with open(output, 'w') as output_csv:
Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.
=> Coming to some suggestions.
You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.
You can reach saving a csv in the Google Cloud Storage in two ways.
Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.
Use the power of the Python package "gcsfs"
gcsfs stands for "Google Cloud Storage File System". Add
gcsfs==2021.11.1
or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.
You can save a dataframe for example with:
df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')
or:
df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')
or use an environment variable of the first menu step when creating the CF:
from os import environ
df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)
Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with
fsspec==2021.11.1
and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:
Purpose (of fsspec):
To produce a template or specification for a file-system interface,
that specific implementations should follow, so that applications
making use of them can rely on a common behaviour and not have to
worry about the specific internal implementation decisions with any
given backend. Many such implementations are included in this package,
or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality,
such as a key-value store or FUSE mounting of the file-system
implementation may be available for all implementations "for free".
First in container's "/tmp", then push to GCS
Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):
# function `write_to_csv_file()` not used but might be helpful if no df at hand:
#def write_to_csv_file(file_path, file_content, root):
# """ Creates a file on runtime. """
# file_path = path.join(root, file_path)
#
# # If file is a binary, we rather use 'wb' instead of 'w'
# with open(file_path, 'w') as file:
# file.write(file_content)
def push_to_gcs(file, bucket):
""" Writes to Google Cloud Storage. """
file_name = file.split('/')[-1]
print(f"Pushing {file_name} to GCS...")
blob = bucket.blob(file_name)
blob.upload_from_filename(file)
print(f"File pushed to {blob.id} succesfully.")
# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'
# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)
# If you have a df anyway, `df.to_csv()` is easier.
# The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.
# dfAsString = df.to_string(header=True, index=False)
# write_to_csv_file(file_path, dfAsString, root)
# Cloud Storage Client
# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

Google Cloud Storage List Blob objects with specific file name

With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)
According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]
It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.
You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt
You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.

Google Cloud Storage - Move file from one folder to another - By using Python

I would like to move a list of file from google storage to another folder:
storage_client = storage.Client()
count = 0
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(BUCKET_NAME)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') # Excluding folder inside bucket
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF):
WHAT CAN GO HERE?
count += 1
The only useful information which I found in Google Documentation is:
https://cloud.google.com/storage/docs/renaming-copying-moving-objects
By this documentation, the only method is to copy from one folder to another and delete it.
Any way to actually MOVE files?
What is the best way to move all the files based on PREFIX like *BLABLA*.csv
P.S. Do not want to use
"gsutil mv gs://[SOURCE_BUCKET_NAME]/[SOURCE_OBJECT_NAME]
gs://[DESTINATION_BUCKET_NAME]/[DESTINATION_OBJECT_NAME]"
This could be a possible solution, as there is no move_blob function in google.cloud.storage:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.copy_blob(blob,dest_bucket,new_name = blob.name)
source_bucket.delete_blob(blob.name)
You could use method rename_blob in google.cloud.storage.Bucket, this function moves the file and deletes the old one. It's necessary that names of files(blobs) are different
Take a look at the code below:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.rename_blob(blob,dest_bucket,new_name = blob.name)

Categories

Resources