I need to write to BigQuery from PubSub in Python. I tested some async subscriber code and it works fine. But this needs to run continuously and I am not 100% sure where to schedule this. I have been using Cloud Composer (Airflow) but it doesn't look like an ideal fit and it looks like Dataflow is the one recommended by GCP? Is that correct?
Or is there a way to run this from Cloud Composer reliably? I think I can run it once but I want to make sure it runs again in case it fails for some reason.
The two best ways to accomplish this goal would be by either using Cloud Functions or by using Cloud Dataflow. For Cloud Functions, you would set up a trigger on the Pub/Sub topic and then in your code write to BigQuery. It would look similar to the tutorial on streaming from Cloud Storage to BigQuery, except the input would be Pub/Sub messages. For Dataflow, you could use one of the Google-provided, open-source templates to write Pub/Sub messages to BigQuery.
Cloud Dataflow would probably be better suited if your throughput is high (thousands of messages per second) and consistent. If you have low or infrequent throughput, Cloud Functions would likely be a better fit. Either of these solutions would run constantly and write the messages to BigQuery when available.
Related
I have a simple python script that I would like to run thousands of it's instances on GCP (at the same time). This script is triggered by the $Universe scheduler, something like "python main.py --date '2022_01'".
What architecture and technology I have to use to achieve this.
PS: I cannot drop $Universe but I'm not against suggestions to use another technologies.
My solution:
I already have a $Universe server running all the time.
Create Pub/Sub topic
Create permanent Compute Engine that listen to Pub/Sub all the time
$Universe send thousand of events to Pub/Sub
Compute engine trigger the creation of a Python Docker Image on another Compute Engine
Scale the creation of the Docker images (I don't know how to do it)
Is it a good architecture?
How to scale this kind of process?
Thank you :)
It might be very difficult to discuss architecture and design questions, as they usually are heavy dependent on the context, scope, functional and non functional requirements, cost, available skills and knowledge and so on...
Personally I would prefer to stay with entirely server-less approach if possible.
For example, use a Cloud Scheduler (server less cron jobs), which sends messages to a Pub/Sub topic, on the other side of which there is a Cloud Function (or something else), which is triggered by the message.
Should it be a Cloud Function, or something else, what and how should it do - depends on you case.
As I understand, you will have a lot of simultaneous call on a custom python code trigger by an orchestrator ($Universe) and you want it on GCP platform.
Like #al-dann, I would go to serverless approach in order to reduce the cost.
As I also understand, pub sub seems to be not necessary, you will could easily trigger the function from any HTTP call and will avoid Pub Sub.
PubSub is necessary only to have some guarantee (at least once processing), but you can have the same behaviour if the $Universe validate the http request for every call (look at http response code & body and retry if not match the expected result).
If you want to have exactly once processing, you will need more tooling, you are close to event streaming (that could be a good use case as I also understand). In that case in a full GCP, I will go to pub / sub & Dataflow that can guarantee exactly once, or Kafka & Kafka Streams or Flink.
If at least once processing is fine for you, I will go http version that will be simple to maintain I think. You will have 3 serverless options for that case :
App engine standard: scale to 0, pay for the cpu usage, can be more affordable than below function if the request is constrain to short period (few hours per day since the same hardware will process many request)
Cloud Function: you will pay per request(+ cpu, memory, network, ...) and don't have to think anything else than code but the code executed is constrain on a proprietary solution.
Cloud run: my prefered one since it's the same pricing than cloud function but you gain the portability, the application is a simple docker image that you can move easily (to kubernetes, compute engine, ...) and change the execution engine depending on cost (if the load change between the study and real world).
We are using pubsub and cloud functions in GCP to orchestrate our data workflow.
Our workflow is something like :
workflow_gcp
pubsub1 and pubsub3 can be triggered at different times (ex: 1am and 4am). They are triggered daily, from an external source (our ETL, Talend).
Our cloud functions basically execute SQL in BigQuery.
This is working well but we had to manually create a orchestration database to log when functions start and end (to answer the question "function X executed ok?"). And the orchestration logic is strongly coupled with our business logic, since our cloud function must know what functions has to be executed before, and what pubsub to trigger after.
So we're looking for a solution that separate the orchestration logic and the business logic.
I found that composer (airflow) could be a solution, but :
it can't run cloud function natively (and with API it's very limited, 16 calls par 100 seconds per project)
we can use BigQuery inside airflow with BigQuery operators, but orchestration and business logics would be strongly coupled again
So what is the best practise in our case?
Thanks for your help
You can use Cloud Composer (Airflow) and still reutilise most of your existing set-up.
Firstly, you can keep all your existing Cloud Functions and use HTTP triggers (or others you prefer) to trigger them in Airflow. The only change you will need to do is to implement a PubSub Sensor in Airflow, so it triggers your Cloud Functions (therefore ensuring you can control orchestration from end to end of your process).
Your solution will be an Airflow DAG that triggers the Cloud Functions based on the PubSub messages, reports back to Airflow if the functions were successful and then, if both were successful, trigger the third Cloud Function with an HTTP trigger or similar, just the same.
A final note, which is not immediately intuitive. Airflow is not meant to run the jobs itself, it is meant to orchestrate and manage dependencies. The fact that you use Cloud Functions triggered by Airflow is not an anti-pattern, is actually a best practice.
In your case, you could 100% rewrite a few things and use the BigQuery operators, as you don't do any processing, just triggering of queries/jobs, but the concept stays true, the best practice is leveraging Airflow to make sure things happen when and in the order you need, not to process those things itself. (Hope that made any sense)
As an alternative to airflow I would have looked at "argo workflows" -> https://github.com/argoproj/argo
It doesnt have the cost overhead the composer has, especially for smaller workloads.
I would have:
Created a deployment that read pubsub messages from external tool and deployed this to kubernetes.
Based on message executed a workflow. Each step in the workflow could be a cloud function, packaged in docker.
(I would have replaced the cloud function with a kubernetes job, which is then triggered by the workflow.)
It is pretty straight forward to package a cloud function with docker and run it in kuberentes.
There exists prebuilt docker images with gsutil/bq/gcloud, so you could create bash scripts that uses "bq" command line to execute stuff inside bigquery.
Say I would like to scrape json format information from an URL which refresh every 2 minutes. I need to run my pipeline (written in Python) continuously every 2 minutes to capture them without any missing data. In this case the pipeline is real time processing.
Currently, I'm using Jenkins to run the pipeline every 2 minutes but I do not think this is a correct practice and Jenkins is meant for CI/CD pipelines. I doubt mine is CI/CD pipeline. Even though I knew there is Jenkins pipeline plugin, I still think using the plugin is conceptually incorrect.
So, what is the best data processing tools in this case? At this point, no data transformation needed. I believe in the future as the process gets more complex, transformation is required. Just FYI, the data will pump into azure blob storage.
Jenkins is meant for CI/CD pipelines and not for scheduling jobs. So yes, you are right, Jenkins isn't the right tool for what you want to achieve.
I suppose you are using Microsoft Azure.
If you are using Databricks, use Spark Streaming application and crate the job.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs
The other approach is to use Azure DataFactory with Scheduled Trigger for the job. As you want to run every 2 minutes, I not sure what will be the cost involved.
As mentioned Jenkins probably should not be used as a distributed Cron system... Let it do what it's good at - CI.
You can boot any VM and create a Cron script or you can use a scheduler like Airflow
FWIW, "every 2 minutes" is not "real time"
Just FYI, the data will pump into azure blob storage.
If you are set on using Kafka (via Event Hub), you could use Kafka Connect to scrape the site, send data into a Kafka topic, then use a sink connector to send data to WASB
I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.
One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!
As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
I'm trying to write a custom Source using the Python Dataflow SDK to read JSON data in parallel from a REST endpoint.
E.g. for a given set of IDs, I need to retrieve the data from:
https://foo.com/api/results/1
https://foo.com/api/results/2
...
https://foo.com/api/results/{maxID}
The key features I need are monitoring & rate limiting : even though I need parallelism (either thread/process-based or using async/coroutines), I also need to make sure my job stays "polite" towards the API endpoint - effectively avoiding involuntary DDoS.
Using psq, I should be able to implement some kind of rate-limit mechanism, but then I'd lose the ability to monitor progress & ETA using the Dataflow Service Monitoring
It seems that, although they work well together, monitoring isn't unified between Google Cloud Dataflow and Google Cloud Pub/Sub (which uses Google Stackdriver Monitoring)
How should I go about building a massively-parallel HTTP consumer workflow which implements rate limiting and has web-based monitoring ?
Dataflow does not currently have a built-in way of doing global rate-limiting, but you can use the Source API to do this. The key concept is that each split of a Source will be processed by a single thread at most, so you can implement local rate limiting separately for each split.
This solution does not use Pub/Sub at all, so you can exclusively use the Dataflow Monitoring UI. If you want to set up alerts based on particular events in your pipeline, you could do something like this