Apache beam pipeline reads from kafka but hangs when writing to bigquery - python

I am creating a pipeline that reads from Kafka using beam_nuggets.io and writes to BigQuery using apache beam WriteToBigQuery.
I am currently running this locally using the DirectRunner to test some of the functionality and concepts. It is able to read from Kafka with no issue, however when writing to BigQuery it logs the message "Refreshing access_token" and then nothing happens.
What is really odd is if I remove the part reading from kafka and replace it with a simple beam.Create(...) it will successfully refresh the token and write to BigQuery as expected.
Extract of the code can be seen below:
messages = (p
| "KafkaConsumer" >> kafkaio.KafkaConsume({"topic": "test",
"bootstrap_servers": "localhost:9092",
"auto_offset_reset": "earliest"})
# | "ManualCreate" >> beam.Create([{"name": "ManualTest", "desc": "a test"}])
| 'Get message' >> beam.Map(lambda x: x[1])
| 'Parse' >> beam.Map(parse_json)
)
messages | "Write to BigQuery" >> beam.io.WriteToBigQuery(pipeline_options.table_spec.get(),
schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
batch_size=1)
messages | 'Writing to stdout' >> beam.Map(print)
As an additional point when running this locally I have the environment variable "GOOGLE_APPLICATION_CREDENTIALS" set to the location of my service account.
Any help in working out what might be causing this issue would be greatly appreciated.

Assuming a streaming pipeline, can you try setting an appropriate Window or a Trigger according to instructions here.
Even though you are not directly using GroupByKey operations in your pipeline, beam.io.WriteToBigQuery uses GroupByKey transforms in it's implementation, so you have to do above to make sure that the data in your pipeline gets propagated appropriately.
When you used the Create transform, the watermark would go from zero to infinity which allowed your pipeline to complete.
Also, note that Kafka implementation in beam_nuggets.io is a very straightforward Kafka implementation that uses a Beam DoFn to read. Many Beam streaming runners require sources to implement other streaming features to operate correctly (for example, checkpointing). I suggest trying out kafka.py available in Beam repo which provides a more complete Kafka unbounded read implementation.

Related

Apache Beam/Dataflow: pipeline doesn't use most recent version of side input (streaming pipeline with global window and frequently updated side input)

I am using Apache Beam (SDK 2.40.0) with the GCP Dataflow runner and a streaming pipeline. I need to use a configuration for processing my data that can be altered at any time. Therefore, I'm loading it every 2 minutes (acceptable delay) as a side input like this:
configs = (
p
| PeriodicImpulse(fire_interval=120, apply_windowing=False)
| "Global Window" >> beam.WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING
)
| 'Get Side Input' >> beam.ParDo(GetConfigsFn())
)
With an additional print statement I have verified that the configs are successfully loaded every 2 minutes and output into a PCollection.
I use the configs in another step where I process PubSub messages like that (I left out all irrelevant steps, the messages are in a global window as well):
msgs_with_config = (
pubsub_messages
| 'Merge data and configs' >> beam.ParDo(AddConfigFromSideInputFn(), config_dict=beam.pvalue.AsDict(configs))
)
The problem I am facing is, that the merge data and configs step is using older versions of the configs instead of the most recent one. It takes an arbitrary amount of time (from a few minutes, 20 minutes to several hours) until a newer version of the configs is used. My suspision is, that the side input is cached somewhere and is not loaded for every processed message.
Is this a valid explanation for this behaviour and is it expected behaviour? Are there other possible reasons for it?
How can I avoid this behaviour, so that always the most recent side input version is used?
Yes, side inputs are cached in Dataflow workers and used for bundles. If you actually need faster re-loading, I would suggest refactoring the pipeline to perform a join of two windowed PCollections instead of using one PCollection as a side input. For example, using the CoGroupByKey transform.

Low parallelism when running Apache Beam wordcount pipeline on Spark with Python SDK

I am quite experienced with Spark cluster configuration and running Pyspark pipelines, but I'm just starting with Beam. So, I am trying to do an apple-to-apple comparison between Pyspark and the Beam python SDK on a Spark PortableRunner (running on top of the same small Spark cluster, 4 workers each with 4 cores and 8GB RAM), and I've settled on a wordcount job for a reasonably large dataset, storing the results in a Parquet table.
I have thus downloaded 50GB of Wikipedia text files, splitted across about 100 uncompressed files, and stored them in the directory /mnt/nfs_drive/wiki_files/ (/mnt/nfs_drive is a NFS drive mounted on all workers).
First, I am running the following Pyspark wordcount script:
from pyspark.sql import SparkSession, Row
from operator import add
wiki_files = '/mnt/nfs_drive/wiki_files/*'
spark = SparkSession.builder.appName("WordCountSpark").getOrCreate()
spark_counts = spark.read.text(wiki_files).rdd.map(lambda r: r['value']) \
.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add) \
.map(lambda x: Row(word=x[0], count=x[1]))
spark.createDataFrame(spark_counts).write.parquet(path='/mnt/nfs_drive/spark_output', mode='overwrite')
The script runs perfectly well and outputs the parquet files at the desired location in about 8 minutes. The main stage (reading and splitting tokens) is divided in a reasonable number of tasks, so that the cluster is used efficiently:
I am now trying to achieve the same with Beam and the portable runner. First, I have started the Spark job server (on the Spark master node) with the following command:
docker run --rm --net=host -e SPARK_EXECUTOR_MEMORY=8g apache/beam_spark_job_server:2.25.0 --spark-master-url=spark://localhost:7077
Then, on the master and worker nodes, I am running the SDK Harness as follows:
docker run --net=host -d --rm -v /mnt/nfs_drive:/mnt/nfs_drive apache/beam_python3.6_sdk:2.25.0 --worker_pool
Now that the Spark cluster is set up to run Beam pipelines, I can submit the following script:
import apache_beam as beam
import pyarrow
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import fileio
options = PipelineOptions([
"--runner=PortableRunner",
"--job_endpoint=localhost:8099",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000",
"--job_name=WordCountBeam"
])
wiki_files = '/mnt/nfs_drive/wiki_files/*'
p = beam.Pipeline(options=options)
beam_counts = (
p
| fileio.MatchFiles(wiki_files)
| beam.Map(lambda x: x.path)
| beam.io.ReadAllFromText()
| 'ExtractWords' >> beam.FlatMap(lambda x: x.split(' '))
| beam.combiners.Count.PerElement()
| beam.Map(lambda x: {'word': x[0], 'count': x[1]})
)
_ = beam_counts | 'Write' >> beam.io.WriteToParquet('/mnt/nfs_drive/beam_output',
pyarrow.schema(
[('word', pyarrow.binary()), ('count', pyarrow.int64())]
)
)
result = p.run().wait_until_finish()
The code is submitted successfully, I can see the job on the Spark UI and the workers are executing it. However, even if left running for more than 1 hour, it does not produce any output!
I thus wanted to make sure that there is not a problem with my setup, so I've run the exact same script on a smaller dataset (just 1 Wiki file). This completes successfully in about 3.5 minutes (Spark wordcount on the same dataset takes 16s!).
I wondered how could Beam be that slower, so I started looking at the DAG submitted to Spark by the Beam pipeline via the job server. I noticed that the Spark job spends most of the time in the following stage:
This is just splitted in 2 tasks, as shown here:
Printing debugging lines show that this task is where the "heavy-lifting" (i.e. reading lines from the wiki files and splitting tokens) is performed - however, since this happens in 2 tasks only, the work will be distributed on 2 workers at most. What's also interesting is that running on the large 50GB dataset results on exactly the same DAG with exactly the same number of tasks.
I am quite unsure how to proceed further. It seems like the Beam pipeline has reduced parallelism, but I'm not sure if this is due to sub-optimal translation of the pipeline by the job server, or whether I should specify my PTransforms in some other way to increase the parallelism on Spark.
Any suggestion appreciated!
The file IO part of the pipeline can be simplified by using apache_beam.io.textio.ReadFromText(file_pattern='/mnt/nfs_drive/wiki_files/*').
Fusion is another reason that could prevent parallelism. The solution is to throw in a apache_beam.transforms.util.Reshuffle after reading in all the files.
It took a while, but I figured out what's the issue and a workaround.
The underlying problem is in Beam's portable runner, specifically where the Beam job is translated into a Spark job.
The translation code (executed by the job server) splits stages into tasks based on calls to sparkContext().defaultParallelism(). The job server does not configure default parallelism explicitly (and does not allow the user to set that through pipeline options), hence it falls back, in theory, to configure the parallelism based on the numbers of executors (see the explanation here https://spark.apache.org/docs/latest/configuration.html#execution-behavior). This seems to be the goal of the translation code when calling defaultParallelism().
Now, in practice, it is well known that, when relying on the fallback mechanism, calling sparkContext().defaultParallelism() too soon can result in lower numbers than expected since executors might not have registered with the context yet. In particular, calling defaultParallelism() too soon will give 2 as result, and stages will be split into 2 tasks only.
My "dirty hack" workaround thus consists in modifying the source code of the job server by simply adding a delay of 3 seconds after instantiating SparkContext and before doing anything else:
$ git diff v2.25.0
diff --git a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
index aa12192..faaa4d3 100644
--- a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
+++ b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
## -95,7 +95,13 ## public final class SparkContextFactory {
conf.setAppName(contextOptions.getAppName());
// register immutable collections serializers because the SDK uses them.
conf.set("spark.kryo.registrator", SparkRunnerKryoRegistrator.class.getName());
- return new JavaSparkContext(conf);
+ JavaSparkContext jsc = new JavaSparkContext(conf);
+ try {
+ Thread.sleep(3000);
+ } catch (InterruptedException e) {
+ }
+ return jsc;
}
}
}
After recompiling the job server and launching it with this change, all the calls to defaultParallelism() are done after the executors are registered, and the stages are nicely split in 16 tasks (same as numbers of executors). As expected, the job now completes much faster since there are many more workers doing the work (it is however still 3 times slower than the pure Spark wordcount).
While this works, it is of course not a great solution. A much better solution would be one of the following:
change the translation engine so that it can deduce the number of tasks based on the number of available executors in a more robust way;
allow the user to configure, via pipeline options, the default parallelism to be used by the job server for translating jobs (this is what's done by the Flink portable runner).
Until a better solution is in place, it clearly prevents any use of Beam Spark job server in a production cluster. I will post the issue to Beam's ticket queue so that a better solution can be implemented (hopefully soon).

apache beam broadcast a spacy model as side inputs in Dataflow

I am using python sdk and trying to broadcast a spacy model (~50MB). The job will run on Dataflow.
I am new to beam, and based on my understanding: we cannot load large objects in map function and we cannot load them before submitting jobs since job sizes are capped. Below is a workaround to "lazy-loading" large objects on workers.
ner_model = (
pipeline
| "ner_model" >> beam.Create([None])
| beam.Map(lambda x: spacy.load("en_core_web_md"))
)
(
pipeline
| bq_input_op
| beam.Map(use_model_to_extract_person, beam.pvalue.AsSingleton(ner_model))
| bq_output_op
)
but I got
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging.
However, there are no Stackdriver logs generated at all. Am I on the right track?
Edit:
I am using apache-beam 2.23.0
The issue might be that your worker has not enough memory. Probably you could solve it, using a worker with more memory. Currently the default worker is n1-standard-1 with only 3.75 GB RAM.
The related PipelineOption is:
workerMachineType String
The Compute Engine machine type that Dataflow uses when starting
worker VMs. You can use any of the available Compute Engine machine
type families as well as custom machine types.
See here for more information.
If you have a large, static model to load, you could try using a DoFn and loading it in DoFn.setUp rather than pass it as a side input.

Dataflow BigQuery Insert Job fails instantly with big dataset

I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.
One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!
As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)

Python Apache Beam Pipeline Status API Call

We currently have a Python Apache Beam pipeline working and able to be run locally. We are now in the process of having the pipeline run on Google Cloud Dataflow and be fully automated but have a found a limitation in Dataflow/Apache Beam's pipeline monitoring.
Currently, Cloud Dataflow has two ways of monitoring your pipeline(s) status, either through their UI interface or through gcloud in the command line. Both of these solutions do not work great for a fully automated solution where we can account for loss-less file processing.
Looking at Apache Beam's github they have a file, internal/apiclient.py that shows there is a function used to get the status of a job, get_job.
The one instance that we have found get_job used is in runners/dataflow_runner.py.
The end goal is to use this API to get the status of a job or several jobs that we automatically trigger to run to ensure they are all eventually processed successfully through the pipeline.
Can anyone explain to us how this API can be used after we run our pipeline (p.run())? We do not understand where runner in response = runner.dataflow_client.get_job(job_id) comes from.
If someone could provide a larger understanding of how we can access this API call while setting up / running our pipeline that would be great!
I ended up just fiddling around with the code and found how to get the job details. Our next step is to see if there is a way to get a list of all of the jobs.
# start the pipeline process
pipeline = p.run()
# get the job_id for the current pipeline and store it somewhere
job_id = pipeline.job_id()
# setup a job_version variable (either batch or streaming)
job_version = dataflow_runner.DataflowPipelineRunner.BATCH_ENVIRONMENT_MAJOR_VERSION
# setup "runner" which is just a dictionary, I call it local
local = {}
# create a dataflow_client
local['dataflow_client'] = apiclient.DataflowApplicationClient(pipeline_options, job_version)
# get the job details from the dataflow_client
print local['dataflow_client'].get_job(job_id)

Categories

Resources