I'm having a few problems running a relatively vanilla Dataflow job from an AI Platform Notebook (the job is meant to take data from BigQuery > cleanse and prep > write to a CSV in GCS):
options = {'staging_location': '/staging/location/',
'temp_location': '/temp/location/',
'job_name': 'dataflow_pipeline_job',
'project': PROJECT,
'teardown_policy': 'TEARDOWN_ALWAYS',
'max_num_workers': 3,
'region': REGION,
'subnetwork': 'regions/<REGION>/subnetworks/<SUBNETWORK>',
'no_save_main_session': True}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
p = beam.Pipeline('DataflowRunner', options=opts)
(p
| 'read' >> beam.io.Read(beam.io.BigQuerySource(query=selquery, use_standard_sql=True))
| 'csv' >> beam.FlatMap(to_csv)
| 'out' >> beam.io.Write(beam.io.WriteToText('OUTPUT_DIR/out.csv')))
p.run()
Error returned from stackdriver:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Following warning:
S01:eval_out/WriteToText/Write/WriteImpl/DoOnce/Read+out/WriteToText/Write/WriteImpl/InitializeWrite failed.
Unfortunately not much else other than that. Other things to note:
The job ran locally without any error
The network is running in custom mode but is the default network
Python Version == 3.5.6
Python Apache Beam version == 2.16.0
The AI Platform Notebook is infact a GCE instance with a Deep Learning VM image deployed on top (with a container optimised OS), we have then used port forwarding to access the Jupyter environment
The service account requesting the job (Compute Engine default service account) has the necessary permissions required to complete this
Notebook instance, dataflow job, GCS bucket are all in europe-west1
I've also tried running this on a standard AI Platform Notebook and
still the same problem.
Any help would be much appreciated! Please let me know if there is any other info I can provide which will help.
I've realised that my error is the same as the following:
Why do Dataflow steps not start?
The reason my job has gotten stuck is because the write to gcs step runs first even though it is meant to run last. Any ideas on how to fix this?
Upon code inspection, I noticed that the syntax of the ‘WriteToText transform’ used does not match the one suggested in the Apache beam docs.
Please follow the “WriteToText” argument syntax as outlined in here.
The suggested workaround is to consider using BQ to CSV file export option available in batch mode.
There are even more export options available. The full list can be found in “the data formats and compression types” documentation here.
Related
currently I automatically start a VM after running a cloud function via this code:
def start_vm(context, event):
compute = googleapiclient.discovery.build('compute', 'v1')
result = compute.instances().start(project='PROJECT', zone='ZONE', instance='NAME').execute()
Now I am looking for a way to deliver a message or a parameter at the same time. After the VM starts and based on the added message/parameter, a different code runs. Does anyone know how to achieve this?
Appreciate every help.
Thank you.
You can use the Guest attributes. The Cloud Functions add the guest attribute and then run the VM.
In the startup script, you read the data in the guest attributes and then you use them to perform stuff.
The other solution is to start a webserver in the VM and then to POST a request to this webserver.
This solution is better is you have several task to perform on the VM. But, take care of the security is you expose a webserver. Expose it only internally and use a VPC connector on your Cloud Function to reach your VM.
I am using python sdk and trying to broadcast a spacy model (~50MB). The job will run on Dataflow.
I am new to beam, and based on my understanding: we cannot load large objects in map function and we cannot load them before submitting jobs since job sizes are capped. Below is a workaround to "lazy-loading" large objects on workers.
ner_model = (
pipeline
| "ner_model" >> beam.Create([None])
| beam.Map(lambda x: spacy.load("en_core_web_md"))
)
(
pipeline
| bq_input_op
| beam.Map(use_model_to_extract_person, beam.pvalue.AsSingleton(ner_model))
| bq_output_op
)
but I got
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging.
However, there are no Stackdriver logs generated at all. Am I on the right track?
Edit:
I am using apache-beam 2.23.0
The issue might be that your worker has not enough memory. Probably you could solve it, using a worker with more memory. Currently the default worker is n1-standard-1 with only 3.75 GB RAM.
The related PipelineOption is:
workerMachineType String
The Compute Engine machine type that Dataflow uses when starting
worker VMs. You can use any of the available Compute Engine machine
type families as well as custom machine types.
See here for more information.
If you have a large, static model to load, you could try using a DoFn and loading it in DoFn.setUp rather than pass it as a side input.
I am trying to train a model using Google Cloud Platform (GCP).
I chose the standard-1 scale tier (using the basic tier give memory exceptions which I think is due to the size(2.6GB) of the data) but my job fails after a log of "Finished tearing down training program" even though it is still downloading the data into the VM from the cloud storage.
It doesn't provide any Tracebacks as to what the error might be.
I have my data stored in the Cloud Storage and to make it available I use os.system('gsutil -m cp -r location_of_data_in_cloud_storage os.getcwd()') to store the data in the VM assigned so that it can be directly accessed by the program. This data is then loaded into the model.fit_generator() method through a generator.
As can be seen the data of 2.6GB hasn't been downloaded completely but the job fails before that!
Anyone else who stumbles upon this question in the future (possibly me ;) ), the problem above was occurring because the machine couldn't handle the compute and so I had to scale up a machine using the standard_p100 scale-tier from the basic scale-tier in GCP which solved the problem!
I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.
One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!
As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
We currently have a Python Apache Beam pipeline working and able to be run locally. We are now in the process of having the pipeline run on Google Cloud Dataflow and be fully automated but have a found a limitation in Dataflow/Apache Beam's pipeline monitoring.
Currently, Cloud Dataflow has two ways of monitoring your pipeline(s) status, either through their UI interface or through gcloud in the command line. Both of these solutions do not work great for a fully automated solution where we can account for loss-less file processing.
Looking at Apache Beam's github they have a file, internal/apiclient.py that shows there is a function used to get the status of a job, get_job.
The one instance that we have found get_job used is in runners/dataflow_runner.py.
The end goal is to use this API to get the status of a job or several jobs that we automatically trigger to run to ensure they are all eventually processed successfully through the pipeline.
Can anyone explain to us how this API can be used after we run our pipeline (p.run())? We do not understand where runner in response = runner.dataflow_client.get_job(job_id) comes from.
If someone could provide a larger understanding of how we can access this API call while setting up / running our pipeline that would be great!
I ended up just fiddling around with the code and found how to get the job details. Our next step is to see if there is a way to get a list of all of the jobs.
# start the pipeline process
pipeline = p.run()
# get the job_id for the current pipeline and store it somewhere
job_id = pipeline.job_id()
# setup a job_version variable (either batch or streaming)
job_version = dataflow_runner.DataflowPipelineRunner.BATCH_ENVIRONMENT_MAJOR_VERSION
# setup "runner" which is just a dictionary, I call it local
local = {}
# create a dataflow_client
local['dataflow_client'] = apiclient.DataflowApplicationClient(pipeline_options, job_version)
# get the job details from the dataflow_client
print local['dataflow_client'].get_job(job_id)