I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.
One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!
As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
Related
I am using python sdk and trying to broadcast a spacy model (~50MB). The job will run on Dataflow.
I am new to beam, and based on my understanding: we cannot load large objects in map function and we cannot load them before submitting jobs since job sizes are capped. Below is a workaround to "lazy-loading" large objects on workers.
ner_model = (
pipeline
| "ner_model" >> beam.Create([None])
| beam.Map(lambda x: spacy.load("en_core_web_md"))
)
(
pipeline
| bq_input_op
| beam.Map(use_model_to_extract_person, beam.pvalue.AsSingleton(ner_model))
| bq_output_op
)
but I got
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging.
However, there are no Stackdriver logs generated at all. Am I on the right track?
Edit:
I am using apache-beam 2.23.0
The issue might be that your worker has not enough memory. Probably you could solve it, using a worker with more memory. Currently the default worker is n1-standard-1 with only 3.75 GB RAM.
The related PipelineOption is:
workerMachineType String
The Compute Engine machine type that Dataflow uses when starting
worker VMs. You can use any of the available Compute Engine machine
type families as well as custom machine types.
See here for more information.
If you have a large, static model to load, you could try using a DoFn and loading it in DoFn.setUp rather than pass it as a side input.
Say I would like to scrape json format information from an URL which refresh every 2 minutes. I need to run my pipeline (written in Python) continuously every 2 minutes to capture them without any missing data. In this case the pipeline is real time processing.
Currently, I'm using Jenkins to run the pipeline every 2 minutes but I do not think this is a correct practice and Jenkins is meant for CI/CD pipelines. I doubt mine is CI/CD pipeline. Even though I knew there is Jenkins pipeline plugin, I still think using the plugin is conceptually incorrect.
So, what is the best data processing tools in this case? At this point, no data transformation needed. I believe in the future as the process gets more complex, transformation is required. Just FYI, the data will pump into azure blob storage.
Jenkins is meant for CI/CD pipelines and not for scheduling jobs. So yes, you are right, Jenkins isn't the right tool for what you want to achieve.
I suppose you are using Microsoft Azure.
If you are using Databricks, use Spark Streaming application and crate the job.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs
The other approach is to use Azure DataFactory with Scheduled Trigger for the job. As you want to run every 2 minutes, I not sure what will be the cost involved.
As mentioned Jenkins probably should not be used as a distributed Cron system... Let it do what it's good at - CI.
You can boot any VM and create a Cron script or you can use a scheduler like Airflow
FWIW, "every 2 minutes" is not "real time"
Just FYI, the data will pump into azure blob storage.
If you are set on using Kafka (via Event Hub), you could use Kafka Connect to scrape the site, send data into a Kafka topic, then use a sink connector to send data to WASB
I'm having a few problems running a relatively vanilla Dataflow job from an AI Platform Notebook (the job is meant to take data from BigQuery > cleanse and prep > write to a CSV in GCS):
options = {'staging_location': '/staging/location/',
'temp_location': '/temp/location/',
'job_name': 'dataflow_pipeline_job',
'project': PROJECT,
'teardown_policy': 'TEARDOWN_ALWAYS',
'max_num_workers': 3,
'region': REGION,
'subnetwork': 'regions/<REGION>/subnetworks/<SUBNETWORK>',
'no_save_main_session': True}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
p = beam.Pipeline('DataflowRunner', options=opts)
(p
| 'read' >> beam.io.Read(beam.io.BigQuerySource(query=selquery, use_standard_sql=True))
| 'csv' >> beam.FlatMap(to_csv)
| 'out' >> beam.io.Write(beam.io.WriteToText('OUTPUT_DIR/out.csv')))
p.run()
Error returned from stackdriver:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Following warning:
S01:eval_out/WriteToText/Write/WriteImpl/DoOnce/Read+out/WriteToText/Write/WriteImpl/InitializeWrite failed.
Unfortunately not much else other than that. Other things to note:
The job ran locally without any error
The network is running in custom mode but is the default network
Python Version == 3.5.6
Python Apache Beam version == 2.16.0
The AI Platform Notebook is infact a GCE instance with a Deep Learning VM image deployed on top (with a container optimised OS), we have then used port forwarding to access the Jupyter environment
The service account requesting the job (Compute Engine default service account) has the necessary permissions required to complete this
Notebook instance, dataflow job, GCS bucket are all in europe-west1
I've also tried running this on a standard AI Platform Notebook and
still the same problem.
Any help would be much appreciated! Please let me know if there is any other info I can provide which will help.
I've realised that my error is the same as the following:
Why do Dataflow steps not start?
The reason my job has gotten stuck is because the write to gcs step runs first even though it is meant to run last. Any ideas on how to fix this?
Upon code inspection, I noticed that the syntax of the ‘WriteToText transform’ used does not match the one suggested in the Apache beam docs.
Please follow the “WriteToText” argument syntax as outlined in here.
The suggested workaround is to consider using BQ to CSV file export option available in batch mode.
There are even more export options available. The full list can be found in “the data formats and compression types” documentation here.
I need to write to BigQuery from PubSub in Python. I tested some async subscriber code and it works fine. But this needs to run continuously and I am not 100% sure where to schedule this. I have been using Cloud Composer (Airflow) but it doesn't look like an ideal fit and it looks like Dataflow is the one recommended by GCP? Is that correct?
Or is there a way to run this from Cloud Composer reliably? I think I can run it once but I want to make sure it runs again in case it fails for some reason.
The two best ways to accomplish this goal would be by either using Cloud Functions or by using Cloud Dataflow. For Cloud Functions, you would set up a trigger on the Pub/Sub topic and then in your code write to BigQuery. It would look similar to the tutorial on streaming from Cloud Storage to BigQuery, except the input would be Pub/Sub messages. For Dataflow, you could use one of the Google-provided, open-source templates to write Pub/Sub messages to BigQuery.
Cloud Dataflow would probably be better suited if your throughput is high (thousands of messages per second) and consistent. If you have low or infrequent throughput, Cloud Functions would likely be a better fit. Either of these solutions would run constantly and write the messages to BigQuery when available.
We currently have a Python Apache Beam pipeline working and able to be run locally. We are now in the process of having the pipeline run on Google Cloud Dataflow and be fully automated but have a found a limitation in Dataflow/Apache Beam's pipeline monitoring.
Currently, Cloud Dataflow has two ways of monitoring your pipeline(s) status, either through their UI interface or through gcloud in the command line. Both of these solutions do not work great for a fully automated solution where we can account for loss-less file processing.
Looking at Apache Beam's github they have a file, internal/apiclient.py that shows there is a function used to get the status of a job, get_job.
The one instance that we have found get_job used is in runners/dataflow_runner.py.
The end goal is to use this API to get the status of a job or several jobs that we automatically trigger to run to ensure they are all eventually processed successfully through the pipeline.
Can anyone explain to us how this API can be used after we run our pipeline (p.run())? We do not understand where runner in response = runner.dataflow_client.get_job(job_id) comes from.
If someone could provide a larger understanding of how we can access this API call while setting up / running our pipeline that would be great!
I ended up just fiddling around with the code and found how to get the job details. Our next step is to see if there is a way to get a list of all of the jobs.
# start the pipeline process
pipeline = p.run()
# get the job_id for the current pipeline and store it somewhere
job_id = pipeline.job_id()
# setup a job_version variable (either batch or streaming)
job_version = dataflow_runner.DataflowPipelineRunner.BATCH_ENVIRONMENT_MAJOR_VERSION
# setup "runner" which is just a dictionary, I call it local
local = {}
# create a dataflow_client
local['dataflow_client'] = apiclient.DataflowApplicationClient(pipeline_options, job_version)
# get the job details from the dataflow_client
print local['dataflow_client'].get_job(job_id)