Create new csv file in Google Cloud Storage from cloud function - python

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.
My code:
import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO
def generate_urls(data, context):
if context.event_type == 'google.storage.object.finalize':
storage_client = storage.Client()
bucket_name = data['bucket']
bucket = storage_client.get_bucket(bucket_name)
folder_name = 'my-folder'
file_name = data['name']
if not file_name.endswith('.csv'):
return
These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.
# Prepend 'URL_' to the uploaded file name for the name of the new csv
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
# Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
blob = bucket.blob(file_name)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob)
input_csv = csv.reader(blob)
On the next line is where I get an error: No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'
with open(output, 'w') as output_csv:
csv_dict_reader = csv.DictReader(input_csv, )
csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csv_writer.writeheader()
line_count = 0
for row in csv_dict_reader:
line_count += 1
url = ''
...
# code that converts each line
...
csv_writer.writerow({'URL': url})
print(f'Total rows: {line_count}')
If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!

Probably I would say that I have a few questions about the code and the design of the solution:
As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:
folder_name = 'my-folder'
file_name = data['name']
the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".
The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...
b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.
c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.
About type changes
blob = bucket.blob(file_name)
blob = blob.download_as_string()
After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API
About the output:
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
with open(output, 'w') as output_csv:
Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.
=> Coming to some suggestions.
You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.

You can reach saving a csv in the Google Cloud Storage in two ways.
Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.
Use the power of the Python package "gcsfs"
gcsfs stands for "Google Cloud Storage File System". Add
gcsfs==2021.11.1
or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.
You can save a dataframe for example with:
df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')
or:
df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')
or use an environment variable of the first menu step when creating the CF:
from os import environ
df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)
Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with
fsspec==2021.11.1
and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:
Purpose (of fsspec):
To produce a template or specification for a file-system interface,
that specific implementations should follow, so that applications
making use of them can rely on a common behaviour and not have to
worry about the specific internal implementation decisions with any
given backend. Many such implementations are included in this package,
or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality,
such as a key-value store or FUSE mounting of the file-system
implementation may be available for all implementations "for free".
First in container's "/tmp", then push to GCS
Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):
# function `write_to_csv_file()` not used but might be helpful if no df at hand:
#def write_to_csv_file(file_path, file_content, root):
# """ Creates a file on runtime. """
# file_path = path.join(root, file_path)
#
# # If file is a binary, we rather use 'wb' instead of 'w'
# with open(file_path, 'w') as file:
# file.write(file_content)
def push_to_gcs(file, bucket):
""" Writes to Google Cloud Storage. """
file_name = file.split('/')[-1]
print(f"Pushing {file_name} to GCS...")
blob = bucket.blob(file_name)
blob.upload_from_filename(file)
print(f"File pushed to {blob.id} succesfully.")
# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'
# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)
# If you have a df anyway, `df.to_csv()` is easier.
# The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.
# dfAsString = df.to_string(header=True, index=False)
# write_to_csv_file(file_path, dfAsString, root)
# Cloud Storage Client
# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

Related

Moving files from one bucket to another in GCS

I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.

Deleting a file from Google Cloud using fsspec

I am currently using fsspec in order to read and write files to my google cloud buckets with the code as shown below:
with fs.open(gcs_file_tmp, "rb") as fp:
gcs_file_content = fp.read()
Now I want to remove a file from a GCS bucket, but I cannot seem to find the code for it. Reading the documentation here, it seems as though there are some rm based functions and also some delete functions but they don't seem to work as they cannot be called as such fs.rm(...) or so
fs.open() returns fsspec.core.OpenFile instance, having fs property.
When you open GCS file, fs is GCSFileSystem.
When you open local file, fs is LocalFileSystem.
I understand fsspec absorbs difference between file systems, as class AbstractFileSystem.
We can call delete() of it.
In this way, we should pass target file path again(we can use path property if you like).
# for GCS
f = fsspec.open("gs://bucketHoge/fuga.txt")
f.fs.delete(f.path)
# for local
f = fsspec.open("/tmp/fuga.txt")
f.fs.delete(f.path)
I wonder there is a better way...

Google Cloud Storage List Blob objects with specific file name

With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)
According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]
It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.
You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt
You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.

Google Cloud Storage - Move file from one folder to another - By using Python

I would like to move a list of file from google storage to another folder:
storage_client = storage.Client()
count = 0
# Retrieve all blobs with a prefix matching the file.
bucket=storage_client.get_bucket(BUCKET_NAME)
# List blobs iterate in folder
blobs=bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') # Excluding folder inside bucket
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF):
WHAT CAN GO HERE?
count += 1
The only useful information which I found in Google Documentation is:
https://cloud.google.com/storage/docs/renaming-copying-moving-objects
By this documentation, the only method is to copy from one folder to another and delete it.
Any way to actually MOVE files?
What is the best way to move all the files based on PREFIX like *BLABLA*.csv
P.S. Do not want to use
"gsutil mv gs://[SOURCE_BUCKET_NAME]/[SOURCE_OBJECT_NAME]
gs://[DESTINATION_BUCKET_NAME]/[DESTINATION_OBJECT_NAME]"
This could be a possible solution, as there is no move_blob function in google.cloud.storage:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.copy_blob(blob,dest_bucket,new_name = blob.name)
source_bucket.delete_blob(blob.name)
You could use method rename_blob in google.cloud.storage.Bucket, this function moves the file and deletes the old one. It's necessary that names of files(blobs) are different
Take a look at the code below:
from google.cloud import storage
dest_bucket = storage_client.create_bucket(bucket_to)
source_bucket = storage_client.get_bucket(bucket_from)
blobs = source_bucket.list_blobs(prefix=GS_FILES_PATH, delimiter='/') #assuming this is tested
for blob in blobs:
if fnmatch.fnmatch(blob.name, FILE_PREF): #assuming this is tested
source_bucket.rename_blob(blob,dest_bucket,new_name = blob.name)

Transfer files from S3 Bucket to another keeping folder structure - python boto

Have found many questions related to this with solutions using boto3, however I am in a position where I have to use boto, running Python 2.38.
Now I can successfully transfer my files in their folders (Not real folders I know as S3 doesn't have this concept) but I want them to be saved into a particular folder in my destination bucket
from boto.s3.connection import S3Connection
def transfer_files():
conn = S3Connection()
srcBucket = conn.get_bucket("source_bucket")
dstBucket = conn.get_bucket(bucket_name="destination_bucket")
objectlist = srcbucket.list()
for obj in objectlist:
dstBucket.copy_key(obj.key, srcBucket.name, obj.key)
My srcBucket will look like folder/subFolder/anotherSubFolder/file.txt which when transferred will land in the dstBucket like so destination_bucket/folder/subFolder/anotherSubFolder/file.txt
I would like it to end up in destination_bucket/targetFolder so the final directory structure would look like
destination_bucket/targetFolder/folder/subFolder/anotherSubFolder/file.txt
Hopefully I have explained this well enough and it makes sense
The first parameter is the name of the destination key.
Therefore, just use:
dstBucket.copy_key('targetFolder/' + obj.key, srcBucket.name, obj.key)

Categories

Resources