apache beam broadcast a spacy model as side inputs in Dataflow

apache beam broadcast a spacy model as side inputs in Dataflow - python

I am using python sdk and trying to broadcast a spacy model (~50MB). The job will run on Dataflow.
I am new to beam, and based on my understanding: we cannot load large objects in map function and we cannot load them before submitting jobs since job sizes are capped. Below is a workaround to "lazy-loading" large objects on workers.
ner_model = (
pipeline
| "ner_model" >> beam.Create([None])
| beam.Map(lambda x: spacy.load("en_core_web_md"))
)
(
pipeline
| bq_input_op
| beam.Map(use_model_to_extract_person, beam.pvalue.AsSingleton(ner_model))
| bq_output_op
)
but I got
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging.
However, there are no Stackdriver logs generated at all. Am I on the right track?
Edit:
I am using apache-beam 2.23.0

The issue might be that your worker has not enough memory. Probably you could solve it, using a worker with more memory. Currently the default worker is n1-standard-1 with only 3.75 GB RAM.
The related PipelineOption is:
workerMachineType String
The Compute Engine machine type that Dataflow uses when starting
worker VMs. You can use any of the available Compute Engine machine
type families as well as custom machine types.
See here for more information.

If you have a large, static model to load, you could try using a DoFn and loading it in DoFn.setUp rather than pass it as a side input.

Related

Apache beam pipeline reads from kafka but hangs when writing to bigquery

I am creating a pipeline that reads from Kafka using beam_nuggets.io and writes to BigQuery using apache beam WriteToBigQuery.
I am currently running this locally using the DirectRunner to test some of the functionality and concepts. It is able to read from Kafka with no issue, however when writing to BigQuery it logs the message "Refreshing access_token" and then nothing happens.
What is really odd is if I remove the part reading from kafka and replace it with a simple beam.Create(...) it will successfully refresh the token and write to BigQuery as expected.
Extract of the code can be seen below:
messages = (p
| "KafkaConsumer" >> kafkaio.KafkaConsume({"topic": "test",
"bootstrap_servers": "localhost:9092",
"auto_offset_reset": "earliest"})
# | "ManualCreate" >> beam.Create([{"name": "ManualTest", "desc": "a test"}])
| 'Get message' >> beam.Map(lambda x: x[1])
| 'Parse' >> beam.Map(parse_json)
)
messages | "Write to BigQuery" >> beam.io.WriteToBigQuery(pipeline_options.table_spec.get(),
schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
batch_size=1)
messages | 'Writing to stdout' >> beam.Map(print)
As an additional point when running this locally I have the environment variable "GOOGLE_APPLICATION_CREDENTIALS" set to the location of my service account.
Any help in working out what might be causing this issue would be greatly appreciated.

Assuming a streaming pipeline, can you try setting an appropriate Window or a Trigger according to instructions here.
Even though you are not directly using GroupByKey operations in your pipeline, beam.io.WriteToBigQuery uses GroupByKey transforms in it's implementation, so you have to do above to make sure that the data in your pipeline gets propagated appropriately.
When you used the Create transform, the watermark would go from zero to infinity which allowed your pipeline to complete.
Also, note that Kafka implementation in beam_nuggets.io is a very straightforward Kafka implementation that uses a Beam DoFn to read. Many Beam streaming runners require sources to implement other streaming features to operate correctly (for example, checkpointing). I suggest trying out kafka.py available in Beam repo which provides a more complete Kafka unbounded read implementation.

Debugging a slow PyTorch GPU Inference Pipeline on Beam/Google Cloud Dataflow

We are attempting to use Google Cloud Dataflow to build a simple GPU-based classification pipeline that looks like this: Pub/Sub request comes in with link to a file on GCS → Read data from GCS → Chop up and batch data → Run inference in PyTorch.
Background
We deploy our pipeline on Dataflow with a custom Docker image adapted from the pytorch-minimal sample.
We ingest Pub/Sub messages and download data audio files from GCS using pathy, then chop audio into chunks for classification.
We've adapted Beam's relatively new RunInference function. Currently, there is no GPU support for RunInference on Dataflow
(see open issue https://issues.apache.org/jira/browse/BEAM-13986). Upon building the Beam pipeline locally before deploy to Dataflow, the model initialization step doesn't recognize a CUDA environment and defaults to a CPU device for inference. This configuration gets propagated to the Dataflow execution environment that is properly GPU-enabled. So, we force a GPU device if requested without a CUDA device check. Other than that, the code is the same as the general RunInference code: A BatchElements operation followed by a ParDo that calls the model.
Problem
Everything is working-ish, but GPU inference is very slow – much slower than what we can time the same GPU-instance on processing batches on Google Cloud Compute Engine.
We're looking for advice on how to debug and speed up the pipeline. We suspect that the issue might have to do with threading as well as how Beam/Dataflow manages load across the pipeline stages. We kept running into CUDA OOM issues with multiple threads trying to access the GPU in the ParDo function. We launch our jobs with --num_workers=1 --experiment="use_runner_v2" --experiment="no_use_multiple_sdk_containers" to avoid multi-processing altogether. We saw that this 2021 beam summit talk on using Dataflow for local ML batch inference recommended going even further and just using a single worker thread --number_of_worker_harness_threads=1. However, we ideally don't want to do this: it's pretty common practice in ML pipelines like these to have multiple threads doing the I/O work of downloading data from the bucket and preparing batches so that the GPU never sits idle. Unfortunately, it seems that there is no way to tell beam to use a certain max number of threads per stage (?), so the best solution we could come up with is to protect the GPU with a Semaphore like so:
class _RunInferenceDoFn(beam.DoFn, Generic[ExampleT, PredictionT]):
...
def _get_semaphore(self):
def get_semaphore():
logging.info('intializing semaphore...')
return Semaphore(1)
return self._shared_semaphore.acquire(get_semaphore)
def setup(self):
...
self._model = self._load_model()
self._semaphore = self._get_semaphore()
def process(self, batch, inference_args):
...
logging.info('trying to acquire semaphore...')
self._semaphore.acquire()
logging.info('semaphore acquired')
start_time = _to_microseconds(self._clock.time_ns())
result_generator = self._model_handler.run_inference(
batch, self._model, inference_args)
end_time = _to_microseconds(self._clock.time_ns())
self._semaphore.release()
...
We make three odd observations in that setup:
Beam always uses the minimum possible batch size we allow; if we specify a batch size of min 8 max 32, it'll always choose a batch size of at most 8, sometimes lower.
The inference timed here is still much much slower when allow multiple threads (--number_of_worker_harness_threads=10) than when we single-thread (--number_of_worker_harness_threads=1). 2.7s per batch vs. 0.4s per batch, both of which are a bit slower than running on compute engine directly.
In the multi-threaded setup, we keep seeing occasional CUDA OOM errors despite using a conservative batch size.
Would appreciate any and all debugging guidance for how to make this work! Right now, the whole pipeline is so slow that we have resorted to just running things in batches on Compute Engine again :/ – but there must be a way to make this work on Dataflow, right?
For reference:
Single-threaded job:
catalin-debug-classifier-test-1660143139 (Job ID: 2022-08-10_07_53_06-5898402459767488826)
Multi-threaded job:
catalin-debug-classifier-10threads-32batch-1660156741 (Job ID: 2022-08-10_11_39_50-2452382118954657386)

Thanks for trying RunInference!
I believe the problems you were encountering have been documented in the following issues. Can you please confirm if this is the case, or if it's different, explain the errors you were getting? We intend to work on those soon.
Map state_dict to the correct device during loading in PytorchModelHandler
Warn user about automatic GPU to CPU conversion.
Beam always uses the minimum possible batch size we allow; if we specify a batch size of min 8 max 32, it'll always choose a batch size of at most 8, sometimes lower.
The way that BatchElements decides on the size is by "profiling the time taken by (fused) downstream operations". Please see here and here for more information. It could be due to the specific size/nature of your data that creates a certain type of pattern of historical timings that causes this. Curious: are the data in your pipeline very similar?
The inference timed here is still much much slower when allow multiple threads (--number_of_worker_harness_threads=1) than when we single-thread (--number_of_worker_harness_threads=10). 2.7s per batch vs. 0.4s per batch, both of which are a bit slower than running on compute engine directly.
Just to clarify (probably a typo): did you mean "single threads (--number_of_worker_harness_threads=1) than when we multi-thread (--number_of_worker_harness_threads=10)" take 2.7s per batch vs. 0.4s per batch, respectively?
Some other questions:
How big is the model that you are using?
What types of GPUs have you tried this on?
We are currently looking into a similar issue with respect to GPUs and multithreading in the PR of our TensorRT RunInference. Here's a thread on that discussion. Something we are actively looking into to manage threads is to use start_bundle and finish_bundle. Please stay tuned on an update on this.
And let me dig into the logs of the job to see what is going on, and get back to you as soon as I can.

Low parallelism when running Apache Beam wordcount pipeline on Spark with Python SDK

I am quite experienced with Spark cluster configuration and running Pyspark pipelines, but I'm just starting with Beam. So, I am trying to do an apple-to-apple comparison between Pyspark and the Beam python SDK on a Spark PortableRunner (running on top of the same small Spark cluster, 4 workers each with 4 cores and 8GB RAM), and I've settled on a wordcount job for a reasonably large dataset, storing the results in a Parquet table.
I have thus downloaded 50GB of Wikipedia text files, splitted across about 100 uncompressed files, and stored them in the directory /mnt/nfs_drive/wiki_files/ (/mnt/nfs_drive is a NFS drive mounted on all workers).
First, I am running the following Pyspark wordcount script:
from pyspark.sql import SparkSession, Row
from operator import add
wiki_files = '/mnt/nfs_drive/wiki_files/*'
spark = SparkSession.builder.appName("WordCountSpark").getOrCreate()
spark_counts = spark.read.text(wiki_files).rdd.map(lambda r: r['value']) \
.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add) \
.map(lambda x: Row(word=x[0], count=x[1]))
spark.createDataFrame(spark_counts).write.parquet(path='/mnt/nfs_drive/spark_output', mode='overwrite')
The script runs perfectly well and outputs the parquet files at the desired location in about 8 minutes. The main stage (reading and splitting tokens) is divided in a reasonable number of tasks, so that the cluster is used efficiently:
I am now trying to achieve the same with Beam and the portable runner. First, I have started the Spark job server (on the Spark master node) with the following command:
docker run --rm --net=host -e SPARK_EXECUTOR_MEMORY=8g apache/beam_spark_job_server:2.25.0 --spark-master-url=spark://localhost:7077
Then, on the master and worker nodes, I am running the SDK Harness as follows:
docker run --net=host -d --rm -v /mnt/nfs_drive:/mnt/nfs_drive apache/beam_python3.6_sdk:2.25.0 --worker_pool
Now that the Spark cluster is set up to run Beam pipelines, I can submit the following script:
import apache_beam as beam
import pyarrow
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import fileio
options = PipelineOptions([
"--runner=PortableRunner",
"--job_endpoint=localhost:8099",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000",
"--job_name=WordCountBeam"
])
wiki_files = '/mnt/nfs_drive/wiki_files/*'
p = beam.Pipeline(options=options)
beam_counts = (
p
| fileio.MatchFiles(wiki_files)
| beam.Map(lambda x: x.path)
| beam.io.ReadAllFromText()
| 'ExtractWords' >> beam.FlatMap(lambda x: x.split(' '))
| beam.combiners.Count.PerElement()
| beam.Map(lambda x: {'word': x[0], 'count': x[1]})
)
_ = beam_counts | 'Write' >> beam.io.WriteToParquet('/mnt/nfs_drive/beam_output',
pyarrow.schema(
[('word', pyarrow.binary()), ('count', pyarrow.int64())]
)
)
result = p.run().wait_until_finish()
The code is submitted successfully, I can see the job on the Spark UI and the workers are executing it. However, even if left running for more than 1 hour, it does not produce any output!
I thus wanted to make sure that there is not a problem with my setup, so I've run the exact same script on a smaller dataset (just 1 Wiki file). This completes successfully in about 3.5 minutes (Spark wordcount on the same dataset takes 16s!).
I wondered how could Beam be that slower, so I started looking at the DAG submitted to Spark by the Beam pipeline via the job server. I noticed that the Spark job spends most of the time in the following stage:
This is just splitted in 2 tasks, as shown here:
Printing debugging lines show that this task is where the "heavy-lifting" (i.e. reading lines from the wiki files and splitting tokens) is performed - however, since this happens in 2 tasks only, the work will be distributed on 2 workers at most. What's also interesting is that running on the large 50GB dataset results on exactly the same DAG with exactly the same number of tasks.
I am quite unsure how to proceed further. It seems like the Beam pipeline has reduced parallelism, but I'm not sure if this is due to sub-optimal translation of the pipeline by the job server, or whether I should specify my PTransforms in some other way to increase the parallelism on Spark.
Any suggestion appreciated!

The file IO part of the pipeline can be simplified by using apache_beam.io.textio.ReadFromText(file_pattern='/mnt/nfs_drive/wiki_files/*').
Fusion is another reason that could prevent parallelism. The solution is to throw in a apache_beam.transforms.util.Reshuffle after reading in all the files.

It took a while, but I figured out what's the issue and a workaround.
The underlying problem is in Beam's portable runner, specifically where the Beam job is translated into a Spark job.
The translation code (executed by the job server) splits stages into tasks based on calls to sparkContext().defaultParallelism(). The job server does not configure default parallelism explicitly (and does not allow the user to set that through pipeline options), hence it falls back, in theory, to configure the parallelism based on the numbers of executors (see the explanation here https://spark.apache.org/docs/latest/configuration.html#execution-behavior). This seems to be the goal of the translation code when calling defaultParallelism().
Now, in practice, it is well known that, when relying on the fallback mechanism, calling sparkContext().defaultParallelism() too soon can result in lower numbers than expected since executors might not have registered with the context yet. In particular, calling defaultParallelism() too soon will give 2 as result, and stages will be split into 2 tasks only.
My "dirty hack" workaround thus consists in modifying the source code of the job server by simply adding a delay of 3 seconds after instantiating SparkContext and before doing anything else:
$ git diff v2.25.0
diff --git a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
index aa12192..faaa4d3 100644
--- a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
+++ b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkContextFactory.java
## -95,7 +95,13 ## public final class SparkContextFactory {
conf.setAppName(contextOptions.getAppName());
// register immutable collections serializers because the SDK uses them.
conf.set("spark.kryo.registrator", SparkRunnerKryoRegistrator.class.getName());
- return new JavaSparkContext(conf);
+ JavaSparkContext jsc = new JavaSparkContext(conf);
+ try {
+ Thread.sleep(3000);
+ } catch (InterruptedException e) {
+ }
+ return jsc;
}
}
}
After recompiling the job server and launching it with this change, all the calls to defaultParallelism() are done after the executors are registered, and the stages are nicely split in 16 tasks (same as numbers of executors). As expected, the job now completes much faster since there are many more workers doing the work (it is however still 3 times slower than the pure Spark wordcount).
While this works, it is of course not a great solution. A much better solution would be one of the following:
change the translation engine so that it can deduce the number of tasks based on the number of available executors in a more robust way;
allow the user to configure, via pipeline options, the default parallelism to be used by the job server for translating jobs (this is what's done by the Flink portable runner).
Until a better solution is in place, it clearly prevents any use of Beam Spark job server in a production cluster. I will post the issue to Beam's ticket queue so that a better solution can be implemented (hopefully soon).

Dataflow BigQuery Insert Job fails instantly with big dataset

I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.

One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!

As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)

Python Apache Beam Pipeline Status API Call

We currently have a Python Apache Beam pipeline working and able to be run locally. We are now in the process of having the pipeline run on Google Cloud Dataflow and be fully automated but have a found a limitation in Dataflow/Apache Beam's pipeline monitoring.
Currently, Cloud Dataflow has two ways of monitoring your pipeline(s) status, either through their UI interface or through gcloud in the command line. Both of these solutions do not work great for a fully automated solution where we can account for loss-less file processing.
Looking at Apache Beam's github they have a file, internal/apiclient.py that shows there is a function used to get the status of a job, get_job.
The one instance that we have found get_job used is in runners/dataflow_runner.py.
The end goal is to use this API to get the status of a job or several jobs that we automatically trigger to run to ensure they are all eventually processed successfully through the pipeline.
Can anyone explain to us how this API can be used after we run our pipeline (p.run())? We do not understand where runner in response = runner.dataflow_client.get_job(job_id) comes from.
If someone could provide a larger understanding of how we can access this API call while setting up / running our pipeline that would be great!

I ended up just fiddling around with the code and found how to get the job details. Our next step is to see if there is a way to get a list of all of the jobs.
# start the pipeline process
pipeline = p.run()
# get the job_id for the current pipeline and store it somewhere
job_id = pipeline.job_id()
# setup a job_version variable (either batch or streaming)
job_version = dataflow_runner.DataflowPipelineRunner.BATCH_ENVIRONMENT_MAJOR_VERSION
# setup "runner" which is just a dictionary, I call it local
local = {}
# create a dataflow_client
local['dataflow_client'] = apiclient.DataflowApplicationClient(pipeline_options, job_version)
# get the job details from the dataflow_client
print local['dataflow_client'].get_job(job_id)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.