How to combine multiple files in GCS bucket with Cloud Function trigger

How to combine multiple files in GCS bucket with Cloud Function trigger - python

I have 3 files per date per name in this format:
'nameXX_date', here's an example:
'nameXX_01-01-20'
'nameXY_01-01-20'
'nameXZ_01-01-20'
where 'name' can be anything, and the date is whatever day the file was uploaded (almost every day).
I need to write a cloud function that triggers whenever a new file lands in the bucket, that combines the 3 XX,XY,XZ files into one file with filename = "name_date".
Here's what I've got so far:
bucket_id = 'bucketname'
client = gcs.Client()
bucket = client.get_bucket(bucket_id)
name =
date =
outfile = f'bucketname/{name}_{date}.CSV'
blobs = []
for shard in ('XX', 'XY', 'XZ'):
sfile = f'{name}{shard}_{date}'
blob = bucket.blob(sfile)
if not blob.exists():
# this causes a retry in 60s
raise ValueError(f'branch {sfile} not present')
blobs.append(blob)
bucket.blob(outfile).compose(blobs)
logging.info(f'Successfullt created {outfile}')
for blob in blobs:
blob.delete()
logging.info('Deleted {} blobs'.format(len(blobs)))
The issue I'm facing is that I'm not sure how to get the name and date of the new file that landed in the bucket, so that I can find the other 2 matching files and combine them
Btw, I've got this code from this article and I'm trying to implement it here: https://medium.com/google-cloud/how-to-write-to-a-single-shard-on-google-cloud-storage-efficiently-using-cloud-dataflow-and-cloud-3aeef1732325

As I understand, the cloud function is triggered by a google.storage.object.finalize event on an object in the specific GCS bucket.
In that case your cloud function "signature" looks like (taken from the "medium" article you mentioned):
def compose_shards(data, context):
The data is a dictionary with plenty of details about the object (file) has been finalized. See some details here: Google Cloud Storage Triggers
For example, the data["name"] - is the name of the object under discussion.
If you know the pattern/template/rule according to which those objects/shards are named, you can extract the relevant elements from an object/shard name, and use it to compose the target object/file name.

Related

How to process multiple CSV files from an Amazon S3 bucket in a lambda function?

In an Amazon S3 bucket, event logs are sent as a CSV file every hour. I would like to perform some brief descriptive analysis on 1 weeks worth of data, every week (e.g. 168 files every week). The point of the analysis is to output a list of trending products for each week. I have a python script written out on my local machine which retrieves the latest 168 files from S3 using boto3, and does all the necessary wrangling etc.
But now I need to put this into a lambda function. I will set up an eventbridge to trigger the lambda function every monday. But, is it possible to call multiple files into a lambda function using the standard boto3, or do I need to do something special when defining the lambda handler function?
Here is the code from my local machine for getting the 168 files:
# import modules
import boto3
import pandas as pd
from io import StringIO
# set up aws credentials
s3 = boto3.resource('s3')
client = boto3.client('s3', aws_access_key_id='XXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# name s3 bucket
my_bucket = s3.Bucket('bucket_with_data')
# get names of last weeks files from s3 bucket
files = []
for file in my_bucket.objects.all():
files.append(file.key)
files = files[-168:] # all files from last 7 days (24 files * 7 days per week)
bucket_name = 'bucket_with_data'
data = []
col_list = ["user_id", "event_type", "event_time", "session_id", "event_properties.product_id"]
for obj in files:
csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
temp = pd.read_csv(StringIO(csv_string))
final_list = list(set(col_list) & set(temp.columns))
temp = temp[final_list]
data.append(temp)
# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)
So, my question is, can I just put this code into a lambda function and it should work, or do I need to incorporate a lambda_handler function? If I need to use a lambda_handler function, how would I handle multiple files? Because the lambda is being triggered by a schedule, rather than one event taking place.

Yes, you need a Lambda function handler, otherwise it's not a Lambda function.
The Lambda runtime is looking for a specific entry point in your code and it will invoke that entry point with a specific set of parameters (event, context).
You can ignore the event and context if you choose to and simply use the boto3 SDK to list objects in a given S3 bucket (assuming your Lambda has permission to do this, via an IAM role), and then perform whatever actions you need to against those objects.
Your code example explicitly supplies boto3 with an access key and secret key, but that is not the best practice in AWS Lambda (or EC2 or other compute environment running on AWS). Instead, configure the Lambda function to assume an IAM role (that provides the necessary permissions).
Be aware of the Lambda function timeout, and increase it as necessary if you are going to process a lot of files. Increasing the RAM size is also potentially a good idea as it will give you correspondingly more CPU and network bandwidth.
If you're going to process a very large number of files then consider using S3 Batch.

How do you fetch files from Google cloud storage bucket after a certain file is fetched using python?

Suppose in my Google Cloud storage bucket there are approx 10k files and while fetching these using python I set a limit as max_results=100. I save the timestamp and the name of the last file using blob.updated and blob.name.
How do I make sure that the next time my python program is run it would fetch the files after the 100th file(which has already been saved). So basically fetching the files after max_results=100 ie from max_results=101
I have gone through the documentation and I couldn't find anything relevant to what I want to do. I am also aware of the fact that max_results parameter would give results till the number it has been called, in my case it is 100.
here is the code:
storage_client = storage.Client()
bucket_name = 'json_file.json'
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(max_results=100)
last_file_timestamp = list()
name_list = list()
for blob in blobs:
name_list.append(blob.name)
last_file_timestamp.append(blob.updated)
print(name_list)
print(last_file_timestamp)
To put it in simple terms - How do I ensure that the second time my python script is executed, it would fetch the files from bucket after 100 files.? Is there a way? Please help

When you perform a query to Google API, you have a set of result and a next page token in you have more results. In this case, use this token and request the next page to the API.
Here an example based on your code
storage_client = storage.Client()
bucket_name = 'json_file.json'
bucket = storage_client.get_bucket(bucket_name)
#First 100 results
blobs = bucket.list_blobs(max_results=100)
for blob in blobs:
print(blob.name)
#Next 100 results
blobs = bucket.list_blobs(page_token=blobs.next_page_token,max_results=100)
for blob in blobs:
print(blob.name)

Trying to move data from one Azure Blob Storage to another using a Python script

I have data that exists in a zipped format in container A that I need to transform using a Python script and am trying to schedule this to occur within Azure, but when writing the output to a new storage container (container B), it simply outputs a csv with the name of the file inside rather than the data.
I've followed the tutorial given on the microsoft site exactly, but I can't get it to work - what am I missing?
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
file_n='iris.csv'
# Load iris dataset from the task node
df = pd.read_csv(file_n)
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False, encoding="utf-8")
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Specifically, the final line seems to be just giving me the output of a csv called "iris_setosa.csv" with a contents of "iris_setosa.csv" in cell A1 rather than the actual data that it reads in.

Update:
replace create_blob_from_text with create_blob_from_path.
create_blob_from_text creates a new blob from str/unicode, or updates the content of an existing blob. So you will find text iris_setosa.csv in the content of the new blob.
create_blob_from_path creates a new blob from a file path, or updates the content of an existing blob. It is what you want.
This workaround uses copy_blob and delete_blob to move Azure Blob from one container to another.
from azure.storage.blob import BlobService
def copy_azure_files(self):
blob_service = BlobService(account_name='account_name', account_key='account_key')
blob_name = 'iris_setosa.csv'
copy_from_container = 'test-container'
copy_to_container = 'demo-container'
blob_url = blob_service.make_blob_url(copy_from_container, blob_name)
# blob_url:https://demostorage.blob.core.windows.net/test-container/iris_setosa.csv
blob_service.copy_blob(copy_to_container, blob_name, blob_url)
#for move the file use this line
blob_service.delete_blob(copy_from_container, blob_name)

Download 1 days azure blob file python

Requirement:
Files are being uploaded into azure container from various machines. Need to write a python script to download one day's file from azure container which will be scheduled daily.
Code:
import datetime
import os
import pytz
from azure.storage.blob import BlobClient, ContainerClient
utc=pytz.UTC
container_connection_string ="CONNECTION_STRING"
container_service_client = ContainerClient.from_connection_string(conn_str=container_connection_string, container_name="CONTAINER_NAME")
date_folder = start_time.strftime("%d-%m-%Y")
base_path = r"DOWNLOAD_PATH"
count = 0
threshold_time = utc.localize(start_time - datetime.timedelta(days = 1))
blob_list = container_service_client.list_blobs()
if not os.path.exists("{}\{}".format(base_path, date_folder)):
os.makedirs("{}\{}".format(base_path, date_folder))
print("Starting")
for ind, blob in enumerate(blob_list):
if threshold_time < blob.last_modified:
count += 1
print(count, blob.name)
blob_name = blob.name
blob = BlobClient.from_connection_string(conn_str=container_connection_string, container_name="CONTAINER_NAME", blob_name=blob_name)
with open("{}\{}\{}".format(base_path, date_folder, blob_name), "wb") as my_blob:
blob_data = blob.download_blob()
blob_data.readinto(my_blob)
Problem:
The above script iterates through all the blob in the container and checkeds if the blobs are less than one day and downloads them if they are. Since daily 15,000+ file are being uploaded in the blob traversing through them to identify today file are very time consuming and downloading blobs take a lot of time.

With the current approach, I believe there's no other way than to enumerate the blobs and filter on the client side to find the matching blobs.
However I do have an alternate solution. It's a bit convoluted solution but I thought I would propose nonetheless :).
Essentially the solution involves making use of Azure Event Grid and invoke an Azure Function on Microsoft.Storage.BlobCreated event which gets fired when a blob is created or replaced. This Azure Function will copy the blob to a different blob container. Now for each day a new blob container will be created and this blob container will hold blobs only for just that day. This makes iterating over blobs much easier.

Reading Data From Cloud Storage Via Cloud Functions

I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.
I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.
In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.

The function does not actually receive the contents of the file, just some metadata about it.
You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.
Putting that together with the tutorial you're using, you get a function like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
bucket = storage_client.get_bucket(data['bucket'])
blob = bucket.blob(data['name'])
contents = blob.download_as_string()
# Process the file contents, etc...

This is an alternative solution using pandas:
Cloud Function Code:
import pandas as pd
def GCSDataRead(event, context):
bucketName = event['bucket']
blobName = event['name']
fileName = "gs://" + bucketName + "/" + blobName
dataFrame = pd.read_csv(fileName, sep=",")
print(dataFrame)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to combine multiple files in GCS bucket with Cloud Function trigger - python

Related

How to process multiple CSV files from an Amazon S3 bucket in a lambda function?

How do you fetch files from Google cloud storage bucket after a certain file is fetched using python?

Trying to move data from one Azure Blob Storage to another using a Python script

Download 1 days azure blob file python

Reading Data From Cloud Storage Via Cloud Functions

Categories

Resources