I have a simple python script that I would like to run thousands of it's instances on GCP (at the same time). This script is triggered by the $Universe scheduler, something like "python main.py --date '2022_01'".
What architecture and technology I have to use to achieve this.
PS: I cannot drop $Universe but I'm not against suggestions to use another technologies.
My solution:
I already have a $Universe server running all the time.
Create Pub/Sub topic
Create permanent Compute Engine that listen to Pub/Sub all the time
$Universe send thousand of events to Pub/Sub
Compute engine trigger the creation of a Python Docker Image on another Compute Engine
Scale the creation of the Docker images (I don't know how to do it)
Is it a good architecture?
How to scale this kind of process?
Thank you :)
It might be very difficult to discuss architecture and design questions, as they usually are heavy dependent on the context, scope, functional and non functional requirements, cost, available skills and knowledge and so on...
Personally I would prefer to stay with entirely server-less approach if possible.
For example, use a Cloud Scheduler (server less cron jobs), which sends messages to a Pub/Sub topic, on the other side of which there is a Cloud Function (or something else), which is triggered by the message.
Should it be a Cloud Function, or something else, what and how should it do - depends on you case.
As I understand, you will have a lot of simultaneous call on a custom python code trigger by an orchestrator ($Universe) and you want it on GCP platform.
Like #al-dann, I would go to serverless approach in order to reduce the cost.
As I also understand, pub sub seems to be not necessary, you will could easily trigger the function from any HTTP call and will avoid Pub Sub.
PubSub is necessary only to have some guarantee (at least once processing), but you can have the same behaviour if the $Universe validate the http request for every call (look at http response code & body and retry if not match the expected result).
If you want to have exactly once processing, you will need more tooling, you are close to event streaming (that could be a good use case as I also understand). In that case in a full GCP, I will go to pub / sub & Dataflow that can guarantee exactly once, or Kafka & Kafka Streams or Flink.
If at least once processing is fine for you, I will go http version that will be simple to maintain I think. You will have 3 serverless options for that case :
App engine standard: scale to 0, pay for the cpu usage, can be more affordable than below function if the request is constrain to short period (few hours per day since the same hardware will process many request)
Cloud Function: you will pay per request(+ cpu, memory, network, ...) and don't have to think anything else than code but the code executed is constrain on a proprietary solution.
Cloud run: my prefered one since it's the same pricing than cloud function but you gain the portability, the application is a simple docker image that you can move easily (to kubernetes, compute engine, ...) and change the execution engine depending on cost (if the load change between the study and real world).
Related
I have a project in which the backend is written in FastAPI and the frontend uses React. My goal here is to add a component that will monitor some of the pc/server's performances in real time. Right now the project is still under development in local environment so basically I'll need to fetch the CPU/GPU usage and RAM (derived from my PC) from Python and then send them to my React app. My question here is, what is the cheapest way to accomplish this? Is setting an API and fetching a GET request every ten seconds a good approach or there're some better ones?
Explanation & Tips:
I know EXACTLY what you're describing. I also made a mobile app using Flutter and Python. I have been trying to get multiple servers to host the API instead of one server. I personally think Node.Js is worth checking out since it allows clustering which is extremely powerful. If you want to stick with python, the best way to get memory usage in python is using psutil like this: memory = psutil.virtual_memory().percent, but for the CPU usage you would have to do some sort of caching or multi threading because you cannot get the CPU usage without a delay cpu = psutil.cpu_percent(interval=1). If you want your API to be fast then the periodic approach is bad, it will slow down your server, also if you do anything wrong on the client side, you could end up DDOSing your API, which is an embarrassing thing that I did when I first published my app. The best approach is to only call the API when it is needed, and for example, flutter has cached widgets which was very useful, because I would have to fetch that piece of data only once every few hours.
Key Points:
-Only call the API when it is crucial to do so.
-Python cannot get the CPU usage in real-time.
-Node performed better than my Flask API (not FastAPI).
-Use client-side caching if possible.
We are using pubsub and cloud functions in GCP to orchestrate our data workflow.
Our workflow is something like :
workflow_gcp
pubsub1 and pubsub3 can be triggered at different times (ex: 1am and 4am). They are triggered daily, from an external source (our ETL, Talend).
Our cloud functions basically execute SQL in BigQuery.
This is working well but we had to manually create a orchestration database to log when functions start and end (to answer the question "function X executed ok?"). And the orchestration logic is strongly coupled with our business logic, since our cloud function must know what functions has to be executed before, and what pubsub to trigger after.
So we're looking for a solution that separate the orchestration logic and the business logic.
I found that composer (airflow) could be a solution, but :
it can't run cloud function natively (and with API it's very limited, 16 calls par 100 seconds per project)
we can use BigQuery inside airflow with BigQuery operators, but orchestration and business logics would be strongly coupled again
So what is the best practise in our case?
Thanks for your help
You can use Cloud Composer (Airflow) and still reutilise most of your existing set-up.
Firstly, you can keep all your existing Cloud Functions and use HTTP triggers (or others you prefer) to trigger them in Airflow. The only change you will need to do is to implement a PubSub Sensor in Airflow, so it triggers your Cloud Functions (therefore ensuring you can control orchestration from end to end of your process).
Your solution will be an Airflow DAG that triggers the Cloud Functions based on the PubSub messages, reports back to Airflow if the functions were successful and then, if both were successful, trigger the third Cloud Function with an HTTP trigger or similar, just the same.
A final note, which is not immediately intuitive. Airflow is not meant to run the jobs itself, it is meant to orchestrate and manage dependencies. The fact that you use Cloud Functions triggered by Airflow is not an anti-pattern, is actually a best practice.
In your case, you could 100% rewrite a few things and use the BigQuery operators, as you don't do any processing, just triggering of queries/jobs, but the concept stays true, the best practice is leveraging Airflow to make sure things happen when and in the order you need, not to process those things itself. (Hope that made any sense)
As an alternative to airflow I would have looked at "argo workflows" -> https://github.com/argoproj/argo
It doesnt have the cost overhead the composer has, especially for smaller workloads.
I would have:
Created a deployment that read pubsub messages from external tool and deployed this to kubernetes.
Based on message executed a workflow. Each step in the workflow could be a cloud function, packaged in docker.
(I would have replaced the cloud function with a kubernetes job, which is then triggered by the workflow.)
It is pretty straight forward to package a cloud function with docker and run it in kuberentes.
There exists prebuilt docker images with gsutil/bq/gcloud, so you could create bash scripts that uses "bq" command line to execute stuff inside bigquery.
I am interested in implementing a compute service for an application im working on in the cloud. The idea is there are 3 modules in the service. A compute manager that receives requests (with input data), triggers azure function computes (the computes are the 2nd 'module'). Both modules share same blob storage for the scripts to be run and the input / output data (json) for the compute.
I'm wanting to draw up a basic diagram but need to understand a few things first. Is the thing I described above possible, or must azure functions have their own separate storage. Can azure functions have concurrent executions of same script with different data.
I'm new to Azure so what I've been learning about Azure functions hasn't yet answered my questions. I'm also unsure how to minimise cost. The functions wont run often.
I hope someone could shed some light on this for me :)
Thanks
In fact, Azure function itself has many kinds of triggers. For example: HTTP trigger, Storage trigger, or Service Bus trigger.
So, I think you can use it without your computer manager if there is one inbuilt trigger meets your requirements.
At the same time, all functions can share same storage account. You just need to use the correct storage account connection string.
And, at the end, as your function will not run often, I suggest you use azure function consumption plan. When you're using the Consumption plan, instances of the Azure Functions host are dynamically added and removed based on the number of incoming events.
I'm a bit lost in the jungle of documentation, offers, and services. I'm wondering how the infrastructure should look like, and it would be very helpful to get a nudge in the right direction.
We have a python script with pytorch that runs a prediction. The script has to be triggered from a http request. Preferably, the samples to do a prediction on also has to come from the same requester. It has to return the prediction as fast as possible.
What is the best / easiest / fastest way of doing this?
We have the script laying in a Container Registry for now. Can we use it? Azure Kubernetes Service? Azure Container Instances (is this fast enough)?
And about the trigger, should we use Azure function, or logic app?
Thank you!
Azure Functions V2 has just launched a private preview for writing Functions using Python. You can find some instructions for how to play around with it here. This would probably be one of the most simple ways to execute this script with an HTTP request. Note that since it is in private preview, I would hesitate to recommend using it in a production scenario.
Another caveat to note with Azure Functions is that there will be a cold start whenever we create a new instance of your function application. This should be in the order of magnitude of ~2-4 seconds, and should only happen on the first request after the application has not seen much traffic for a while, or if a new instance has been created to scale up your application to receive more traffic. You can avoid this cold start by making your function on a dedicated App Service Plan, but at that point you are losing a lot of the benefits of Azure Functions.
I'm trying to write a custom Source using the Python Dataflow SDK to read JSON data in parallel from a REST endpoint.
E.g. for a given set of IDs, I need to retrieve the data from:
https://foo.com/api/results/1
https://foo.com/api/results/2
...
https://foo.com/api/results/{maxID}
The key features I need are monitoring & rate limiting : even though I need parallelism (either thread/process-based or using async/coroutines), I also need to make sure my job stays "polite" towards the API endpoint - effectively avoiding involuntary DDoS.
Using psq, I should be able to implement some kind of rate-limit mechanism, but then I'd lose the ability to monitor progress & ETA using the Dataflow Service Monitoring
It seems that, although they work well together, monitoring isn't unified between Google Cloud Dataflow and Google Cloud Pub/Sub (which uses Google Stackdriver Monitoring)
How should I go about building a massively-parallel HTTP consumer workflow which implements rate limiting and has web-based monitoring ?
Dataflow does not currently have a built-in way of doing global rate-limiting, but you can use the Source API to do this. The key concept is that each split of a Source will be processed by a single thread at most, so you can implement local rate limiting separately for each split.
This solution does not use Pub/Sub at all, so you can exclusively use the Dataflow Monitoring UI. If you want to set up alerts based on particular events in your pipeline, you could do something like this