HttpIO - Consuming external resources in a Dataflow transform - python

I'm trying to write a custom Source using the Python Dataflow SDK to read JSON data in parallel from a REST endpoint.
E.g. for a given set of IDs, I need to retrieve the data from:
https://foo.com/api/results/1
https://foo.com/api/results/2
...
https://foo.com/api/results/{maxID}
The key features I need are monitoring & rate limiting : even though I need parallelism (either thread/process-based or using async/coroutines), I also need to make sure my job stays "polite" towards the API endpoint - effectively avoiding involuntary DDoS.
Using psq, I should be able to implement some kind of rate-limit mechanism, but then I'd lose the ability to monitor progress & ETA using the Dataflow Service Monitoring
It seems that, although they work well together, monitoring isn't unified between Google Cloud Dataflow and Google Cloud Pub/Sub (which uses Google Stackdriver Monitoring)
How should I go about building a massively-parallel HTTP consumer workflow which implements rate limiting and has web-based monitoring ?

Dataflow does not currently have a built-in way of doing global rate-limiting, but you can use the Source API to do this. The key concept is that each split of a Source will be processed by a single thread at most, so you can implement local rate limiting separately for each split.
This solution does not use Pub/Sub at all, so you can exclusively use the Dataflow Monitoring UI. If you want to set up alerts based on particular events in your pipeline, you could do something like this

Related

Hosting webhook target GCP Cloud Function

I am very new to GCP, my plan is create a webhook target on GCP to listen for events on a thirdparty application, kick off scripts to download files from webhook event and push to JIRA/Github. During my research read alot about cloud functions, but there were also cloud run, app engine and PubSub. Any suggestions on which path to follow?
Thanks!
There are use cases in which Cloud Functions, Cloud Run and App Engine can be used indistinctively (not Pubsub as it is a messaging service). There are however use cases that do not fit some of them properly.
CloudFunctions must be triggered and each execution is (should be) isolated, that implies you can not expect it to keep a connection alive to your third party. Also they have limited time per execution. They tend to be atomic in a way that if you have complex logic between them you must be careful in your design otherwise you will end with a very difficult to manage distributed solution.
App Engine is an application you deploy and it is permanently active, therefore you can mantain a connection to your third party app.
Cloud Run is somewhere in the middle, being triggered when is used but it can share a context and different requests benefit from that (keeping alive connections temporarily or caching, for instance). It also has more capabilities in terms of technologies you can use.
PubSub, as mentioned, is a service where you can send information (fire and forget) and allows you to have one or more listeners on the other side that may be your Cloud Function, App Engine or Cloud Run to process the information and proceed.
BTW consider using Cloud Storage for your files, specially if you expect to be there between different service calls.

Run & scale simple python scripts on Google Cloud Platform

I have a simple python script that I would like to run thousands of it's instances on GCP (at the same time). This script is triggered by the $Universe scheduler, something like "python main.py --date '2022_01'".
What architecture and technology I have to use to achieve this.
PS: I cannot drop $Universe but I'm not against suggestions to use another technologies.
My solution:
I already have a $Universe server running all the time.
Create Pub/Sub topic
Create permanent Compute Engine that listen to Pub/Sub all the time
$Universe send thousand of events to Pub/Sub
Compute engine trigger the creation of a Python Docker Image on another Compute Engine
Scale the creation of the Docker images (I don't know how to do it)
Is it a good architecture?
How to scale this kind of process?
Thank you :)
It might be very difficult to discuss architecture and design questions, as they usually are heavy dependent on the context, scope, functional and non functional requirements, cost, available skills and knowledge and so on...
Personally I would prefer to stay with entirely server-less approach if possible.
For example, use a Cloud Scheduler (server less cron jobs), which sends messages to a Pub/Sub topic, on the other side of which there is a Cloud Function (or something else), which is triggered by the message.
Should it be a Cloud Function, or something else, what and how should it do - depends on you case.
As I understand, you will have a lot of simultaneous call on a custom python code trigger by an orchestrator ($Universe) and you want it on GCP platform.
Like #al-dann, I would go to serverless approach in order to reduce the cost.
As I also understand, pub sub seems to be not necessary, you will could easily trigger the function from any HTTP call and will avoid Pub Sub.
PubSub is necessary only to have some guarantee (at least once processing), but you can have the same behaviour if the $Universe validate the http request for every call (look at http response code & body and retry if not match the expected result).
If you want to have exactly once processing, you will need more tooling, you are close to event streaming (that could be a good use case as I also understand). In that case in a full GCP, I will go to pub / sub & Dataflow that can guarantee exactly once, or Kafka & Kafka Streams or Flink.
If at least once processing is fine for you, I will go http version that will be simple to maintain I think. You will have 3 serverless options for that case :
App engine standard: scale to 0, pay for the cpu usage, can be more affordable than below function if the request is constrain to short period (few hours per day since the same hardware will process many request)
Cloud Function: you will pay per request(+ cpu, memory, network, ...) and don't have to think anything else than code but the code executed is constrain on a proprietary solution.
Cloud run: my prefered one since it's the same pricing than cloud function but you gain the portability, the application is a simple docker image that you can move easily (to kubernetes, compute engine, ...) and change the execution engine depending on cost (if the load change between the study and real world).

Is Jenkins the right tool for real time streaming pipeline?

Say I would like to scrape json format information from an URL which refresh every 2 minutes. I need to run my pipeline (written in Python) continuously every 2 minutes to capture them without any missing data. In this case the pipeline is real time processing.
Currently, I'm using Jenkins to run the pipeline every 2 minutes but I do not think this is a correct practice and Jenkins is meant for CI/CD pipelines. I doubt mine is CI/CD pipeline. Even though I knew there is Jenkins pipeline plugin, I still think using the plugin is conceptually incorrect.
So, what is the best data processing tools in this case? At this point, no data transformation needed. I believe in the future as the process gets more complex, transformation is required. Just FYI, the data will pump into azure blob storage.
Jenkins is meant for CI/CD pipelines and not for scheduling jobs. So yes, you are right, Jenkins isn't the right tool for what you want to achieve.
I suppose you are using Microsoft Azure.
If you are using Databricks, use Spark Streaming application and crate the job.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs
The other approach is to use Azure DataFactory with Scheduled Trigger for the job. As you want to run every 2 minutes, I not sure what will be the cost involved.
As mentioned Jenkins probably should not be used as a distributed Cron system... Let it do what it's good at - CI.
You can boot any VM and create a Cron script or you can use a scheduler like Airflow
FWIW, "every 2 minutes" is not "real time"
Just FYI, the data will pump into azure blob storage.
If you are set on using Kafka (via Event Hub), you could use Kafka Connect to scrape the site, send data into a Kafka topic, then use a sink connector to send data to WASB

Azure functions: Can I implement my architecture and how do I minimize cost?

I am interested in implementing a compute service for an application im working on in the cloud. The idea is there are 3 modules in the service. A compute manager that receives requests (with input data), triggers azure function computes (the computes are the 2nd 'module'). Both modules share same blob storage for the scripts to be run and the input / output data (json) for the compute.
I'm wanting to draw up a basic diagram but need to understand a few things first. Is the thing I described above possible, or must azure functions have their own separate storage. Can azure functions have concurrent executions of same script with different data.
I'm new to Azure so what I've been learning about Azure functions hasn't yet answered my questions. I'm also unsure how to minimise cost. The functions wont run often.
I hope someone could shed some light on this for me :)
Thanks
In fact, Azure function itself has many kinds of triggers. For example: HTTP trigger, Storage trigger, or Service Bus trigger.
So, I think you can use it without your computer manager if there is one inbuilt trigger meets your requirements.
At the same time, all functions can share same storage account. You just need to use the correct storage account connection string.
And, at the end, as your function will not run often, I suggest you use azure function consumption plan. When you're using the Consumption plan, instances of the Azure Functions host are dynamically added and removed based on the number of incoming events.

GCP: Where to schedule PubSub subscriber which writes to BigQuery

I need to write to BigQuery from PubSub in Python. I tested some async subscriber code and it works fine. But this needs to run continuously and I am not 100% sure where to schedule this. I have been using Cloud Composer (Airflow) but it doesn't look like an ideal fit and it looks like Dataflow is the one recommended by GCP? Is that correct?
Or is there a way to run this from Cloud Composer reliably? I think I can run it once but I want to make sure it runs again in case it fails for some reason.
The two best ways to accomplish this goal would be by either using Cloud Functions or by using Cloud Dataflow. For Cloud Functions, you would set up a trigger on the Pub/Sub topic and then in your code write to BigQuery. It would look similar to the tutorial on streaming from Cloud Storage to BigQuery, except the input would be Pub/Sub messages. For Dataflow, you could use one of the Google-provided, open-source templates to write Pub/Sub messages to BigQuery.
Cloud Dataflow would probably be better suited if your throughput is high (thousands of messages per second) and consistent. If you have low or infrequent throughput, Cloud Functions would likely be a better fit. Either of these solutions would run constantly and write the messages to BigQuery when available.

Categories

Resources