I currently have a Cloud Function that is executing some asynchronous code. It is making a Get request to an Endpoint to retrieve some data and then it storing that data into a Cloud Storage. I have set up the Cloud Function to be triggered using Cloud Scheduler via HTTP. When I use the test option that Cloud Function has, everything works fine, but when I set up Cloud Scheduler to invoke the Cloud Function, it gets invoked more than once. I was able to tell by looking at the logs and it showing multiple execution id's and print statements I have in place. Does anyone know why the Cloud Scheduler is invoking more than once? I have the Max Retry Attempts set to 0. There is a portion in my code where I use asyncio's create_task and sleep in order to put make sure the tasks get put into the event loop to slow down the number of requests and I was wondering if this is causing Cloud Scheduler to do something unexpected?
async with aiohttp.ClientSession(headers=headers) as session:
tasks = []
for i in range(1, total_pages + 1):
tasks.append(asyncio.create_task(self.get_tasks(session=session,page=i)))
await asyncio.sleep(delay_per_request)
For my particular case, when natively testing (using the test option cloud function has built-in) my Cloud Function was performing as expected. However, when I set up Cloud Scheduler to trigger the Cloud Function via a HTTP, it unexpectedly ran more than once. As #EdoAkse mentioned in original thread here my event with Cloud Scheduler was running more than once. My solution was to set up Pub/Sub topic that the Cloud Function subscribes to and that topic will trigger that Cloud Function. The Cloud Scheduler would then invoke that Pub/Sub Trigger. It is essentially how Google describes it in their docs.
Cloud Scheduler -> Pub/Sub Trigger -> Cloud Function
Observed a behavior where cloud functions were being called twice by Cloud Scheduler. Apparently, despite them being located/designated as eu-west1, a duplicate entry in schedule was present for each scheduled function located in us-central1. Removing the duplicated calls in us-central1 resolved my issue.
Related
I am very new to GCP, my plan is create a webhook target on GCP to listen for events on a thirdparty application, kick off scripts to download files from webhook event and push to JIRA/Github. During my research read alot about cloud functions, but there were also cloud run, app engine and PubSub. Any suggestions on which path to follow?
Thanks!
There are use cases in which Cloud Functions, Cloud Run and App Engine can be used indistinctively (not Pubsub as it is a messaging service). There are however use cases that do not fit some of them properly.
CloudFunctions must be triggered and each execution is (should be) isolated, that implies you can not expect it to keep a connection alive to your third party. Also they have limited time per execution. They tend to be atomic in a way that if you have complex logic between them you must be careful in your design otherwise you will end with a very difficult to manage distributed solution.
App Engine is an application you deploy and it is permanently active, therefore you can mantain a connection to your third party app.
Cloud Run is somewhere in the middle, being triggered when is used but it can share a context and different requests benefit from that (keeping alive connections temporarily or caching, for instance). It also has more capabilities in terms of technologies you can use.
PubSub, as mentioned, is a service where you can send information (fire and forget) and allows you to have one or more listeners on the other side that may be your Cloud Function, App Engine or Cloud Run to process the information and proceed.
BTW consider using Cloud Storage for your files, specially if you expect to be there between different service calls.
I'm having trouble understanding how to use google cloud tasks to replace celery...
Currently, when you hit endpoint api/v1/longtask celery asyncs the task immediately and returns 200. Task runs, updates, ends.
So with cloud tasks, I would call api/v1/tasks invoke the specific task, and return 200. But the api endpoint api/v1/longtask will timeout now as the task takes 1 hour.
So do I need to adjust the timeout on the endpoint.
At this point I think a better solution is cloud functions but I would like to learn what the other side of the task looks like as the documentation only shows calling a URL. It never shows the long task api endpoint which in my experience times out at 60 seconds.
Thank you,
My google cloud function uses a weather api to return the weather in any input location. I am currently using an http trigger. Instead of calling the http trigger manually to check if the data is less than 0°C, how can I make the cloud function notify me whenever the weather drops below that threshold?
Cloud Functions are for relatively short-lived processes that are spun up in response to a supported event that occurs. Unless you weather API is a supported Cloud Functions event type, there is no way to use Cloud Functions to listen on the API until the temperature drops below zero.
The closest you can get is to run a Cloud Function periodically or use a Cloud Scheduler task to periodically trigger a Cloud Function you write that then calls the weather API, checks the temperature, and performs whatever logic you want in response. And if the weather API supports calling web hooks on temperature changes, you could also use those web hooks to call into Cloud Functions.
I am fairly new to GCP.
I have some items in a cloud storage bucket.
I have written some python code to access this bucket and perform update operations.
I want to make sure that whenever the python code is triggered, it has exclusive access to the bucket so that I do not run into some sort of race condition.
For example, if I put the python code in a cloud function and trigger it, I want to make sure it completes before another trigger occurs. Is this automatically handled or do I have to do something to prevent this? If I have to add something like a semaphore, will subsequent triggers happen automatically after the the semaphore is released?
Google Cloud
Scheduler is a
fully managed cron jobs scheduling service available in GCP. It's
basically the cron jobs which trigger at a given time. All you need
to do is specify the frequency(The time when the job needs to be
triggered) and the target(HTTP, Pub/Sub, App Engine HTTP) and you can
specify the retry configuration like Max retry attempts, Max retry
duration etc..
App Engine has a built-in cron service that allows you to write a simple cron.yaml containing the time at which you want the job to run and which endpoint it should hit. App Engine will ensure that the cron is executed at the time which you have specified. Here’s a sample cron.yaml that hits the /tasks/summary endpoint in AppEngine deployment every 24 hours.
cron:
- description: "daily summary job"
url: /tasks/summary
schedule: every 24 hours
All of the info supplied has been helpful. The best answer has been to use a max-concurrent-dispatches setting of 1 so that only one task is dispatched at a time.
We are using pubsub and cloud functions in GCP to orchestrate our data workflow.
Our workflow is something like :
workflow_gcp
pubsub1 and pubsub3 can be triggered at different times (ex: 1am and 4am). They are triggered daily, from an external source (our ETL, Talend).
Our cloud functions basically execute SQL in BigQuery.
This is working well but we had to manually create a orchestration database to log when functions start and end (to answer the question "function X executed ok?"). And the orchestration logic is strongly coupled with our business logic, since our cloud function must know what functions has to be executed before, and what pubsub to trigger after.
So we're looking for a solution that separate the orchestration logic and the business logic.
I found that composer (airflow) could be a solution, but :
it can't run cloud function natively (and with API it's very limited, 16 calls par 100 seconds per project)
we can use BigQuery inside airflow with BigQuery operators, but orchestration and business logics would be strongly coupled again
So what is the best practise in our case?
Thanks for your help
You can use Cloud Composer (Airflow) and still reutilise most of your existing set-up.
Firstly, you can keep all your existing Cloud Functions and use HTTP triggers (or others you prefer) to trigger them in Airflow. The only change you will need to do is to implement a PubSub Sensor in Airflow, so it triggers your Cloud Functions (therefore ensuring you can control orchestration from end to end of your process).
Your solution will be an Airflow DAG that triggers the Cloud Functions based on the PubSub messages, reports back to Airflow if the functions were successful and then, if both were successful, trigger the third Cloud Function with an HTTP trigger or similar, just the same.
A final note, which is not immediately intuitive. Airflow is not meant to run the jobs itself, it is meant to orchestrate and manage dependencies. The fact that you use Cloud Functions triggered by Airflow is not an anti-pattern, is actually a best practice.
In your case, you could 100% rewrite a few things and use the BigQuery operators, as you don't do any processing, just triggering of queries/jobs, but the concept stays true, the best practice is leveraging Airflow to make sure things happen when and in the order you need, not to process those things itself. (Hope that made any sense)
As an alternative to airflow I would have looked at "argo workflows" -> https://github.com/argoproj/argo
It doesnt have the cost overhead the composer has, especially for smaller workloads.
I would have:
Created a deployment that read pubsub messages from external tool and deployed this to kubernetes.
Based on message executed a workflow. Each step in the workflow could be a cloud function, packaged in docker.
(I would have replaced the cloud function with a kubernetes job, which is then triggered by the workflow.)
It is pretty straight forward to package a cloud function with docker and run it in kuberentes.
There exists prebuilt docker images with gsutil/bq/gcloud, so you could create bash scripts that uses "bq" command line to execute stuff inside bigquery.