Apache beam DataFlow runner throwing setup error

Apache beam DataFlow runner throwing setup error - python

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,
A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.
But could not find detailed worker-startup logs.
We tried increasing memory size, worker count etc, but still getting the same error.
Here is the command we use,
python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2
pipeline snippet,
data = pipeline | "load data" >> beam.io.Read(
beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)
data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)
Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.
Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.
Update:
Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.
If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.
I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example
Read about it here - https://medium.com/#rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

I'm building a prediction pipeline using Apache Beam/Dataflow. I need to include the model files inside the dependencies available to the remote workers. The Dataflow job failed with the same error log:
Error message from worker: A setup error was detected in beamapp-xxx-xxxxxxxxxx-xxxxxxxx-xxxx-harness-xxxx. Please refer to the worker-startup log for detailed information.
However, this error message didn't give any details about the worker-startup log. Finally, I found a way to have the worker log and solve the problem.
As is known, Dataflow creates compute engines to run jobs and save logs on them so that we can access the vm to see logs. We can connect to the vm in use by our Dataflow job from the GCP console via SSH. Then we can check the boot-json.log file located in /var/log/dataflow/taskrunner/harness:
$ cd /var/log/dataflow/taskrunner/harness
$ cat boot-json.log
Here we should pay attention. When running in batch mode, the vm created by Dataflow is ephemeral and closed when the job failed. If the vm is closed, we can't access it anymore. But a process including a failing item is retried 4 times, so normally we have enough time to open boot-json.log and see what is going on.
At last, I put my Python setup solution here that may help someone else:
main.py
...
model_path = os.path.dirname(os.path.abspath(__file__)) + '/models/net.pd'
# pipeline code
...
MANIFEST.in
include models/*.*
setup.py complete example
REQUIRED_PACKAGES = [...]
setuptools.setup(
...
include_package_data=True,
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
package_data={"models": ["models/*"]},
...
)
Run Dataflow pipelines
$ python main.py --setup_file=/absolute/path/to/setup.py ...

Related

How to use Python to connect to an Oracle database by ApacheBeam?

import apache_beam as beam
from apache_beam.io.jdbc import ReadFromJdbc
with beam.Pipeline() as p:
result = (p
| 'Read from jdbc' >> ReadFromJdbc(
fetch_size=None,
table_name='table_name',
driver_class_name='oracle.jdbc.driver.OracleDriver',
jdbc_url='jdbc:oracle:thin:#localhost:1521:orcl',
username='xxx',
password='xxx',
query='selec * from table_name'
)
|beam.Map(print)
)
When I run the above code, the following error occurs:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'C:\\Users\\YFater/.apache_beam/cache/jars\\beam-sdks-java-extensions-schemaio-expansion-service-2.29.0.jar', '51933']
ERROR:apache_beam.utils.subprocess_server:Error bringing up service

Abache Beam needs to use a java expansion service in order to use JDBC from python.
You get an error because Beam cannot launch the expansion service.
To fix this, install the Java runtime in the computer where you run apache beam, and make sure java is in your path.
IF the problem persists after installing java (or if you already have it installed), probably the JAR files Beam downloaded are bad (maybe the download stopped or the file was truncated due to disk full). In that case, just remove the contents of the $HOME/.apache_beam/cache/jars directory and re-run the beam pipeline.

Add classpath param in ReadFromJdbc
Example:
classpath=['~/.apache_beam/cache/jars/ojdbc8.jar'],

Unable to find module after trying to install custom wheel using POST_BUILD_COMMAND in Zip Deploy task for a Python Function App on Azure

Intro
My scenario is that I want to re-use shared code from a repo in Azure DevOps across multiple projects. I've built a pipeline that produces a wheel as an artifact so I can download it to other pipelines.
The situation
Currently I have succesfully setup a pipeline that deploys the Python Function App. The app is running fine and stable. I use SCM_DO_BUILD_DURING_DEPLOYMENT=1 and ENABLE_ORYX_BUILD=1 to achieve this.
I am now in the position that I want to use the artifact (Python/pip wheel) as mentioned in the intro.
I've added a step in the pipeline and I am able to download the artifact successfully. The next step is ensuring that the artifact is installed during my Python Function App Zip Deployment. And that is where I am stuck at.
The structure of my zip looks like:
__app__
| - MyFirstFunction
| | - __init__.py
| | - function.json
| | - example.py
| - MySecondFunction
| | - __init__.py
| | - function.json
| - wheels
| | - my_wheel-20201014.10-py3-none-any.whl <<< this is my wheel
| - host.json
| - requirements.txt
The problem
I've tried to add commands like POST_BUILD_COMMAND and PRE_BUILD_COMMAND to get pip install the wheel but it seems the package is not found (by Oryx/Kudu) when I use the command:
-POST_BUILD_COMMAND "python -m pip install --find-links=home/site/wwwroot/wheels my_wheel"
Azure DevOps does not throw any exception or error message. Just when I execute the function I get an exception saying:
Failure Exception: ModuleNotFoundError: No module named 'my_wheel'.
My question is how can I change my solution to make sure the build is able to install my_wheel correctly.
Sidenote: Unfortunately I am not able to use the Artifacts feed from Azure DevOps to publish my_wheel and let pip consume that feed.

Here is how my custom wheel works in VS code locally:
Navigate to your DevOps, edit pipeline YAML file, add a python script to specify the wheel file to be installed:
pip install my_wheel-20201014.10-py3-none-any.whl
Like this:
Enable App service Log and navigate to Log Stream to see if it works on Azure:

I have solved my issue by checking out the repository of my shared code and included the shared code in the function app package.
Also I replaced the task AzureFunctionApp#1 with the AzureCLI#2 task and deploy the function app with a az functionapp deployment source config-zip command. I set the application settings via a separate AzureAppServiceSettings#1 step in the pipeline.
AzureCLI#2:
It is not the exact way I wanted to solve this because I still have to include the requirements of the shared code in the root requirements.txt as well.
Switching the task AzureFunctionApp#1 to the AzureCLI#2 gives me more feedback in the pipeline. The result should be the same.

How to pass requirements.txt parameter in Dataflow when Dataflow is being triggered by Cloud Function?

Objective- I have a dataflow template (written in python) that has a dependency on pandas and nltk also I want to trigger the dataflow job from cloud function. For this purpose, I have uploaded the code to a bucket and I am ready to specify the template location in the cloud function.
Problem- How to pass the requirements_file parameter that you would normally pass to install any third-party library when you trigger a dataflow job using the discovery google module from cloud function?
Prerequisites- I know this can be done when you are launching a job through the local machine by specifying a local directory path but when I try to specify the path from GCS such as --requirements_file gs://bucket/requirements.txt it gives me an error saying:
The file gs://bucket/requirements.txt cannot be found. It was specified in the --requirements_file command line option.

The template of dataflow is not a python or java code instead it is a compiled version of the code that you've written in the python or java. So, when you're creating your template you may pass your requirements.txt in the arguments like you normally do as shown below
python dataflow-using-cf.py \
--runner DataflowRunner \
--project <PROJECT_ID> \
--staging_location gs://<BUCKET_NAME>/staging \
--temp_location gs://<BUCKET_NAME>/temp \
--template_location ./template1 \
--requirements_file ./requirements.txt \
The above command will create a file with name template1 which if you read, contains a JSON structure, this file is a compiled version of the Dataflow code that you've written and during the compilation process, it will read your requirements.txt from your local directory and compile its steps. You may then add your template to a bucket and provide the path to the cloud function, you don't have to worry about the requirements.txt file after creating a template.

Apache AIrflow KubernetesExecutor and KubernetesPodOperator: xcom pushes not working

Have got an Apache Airflow instance placed in kubernetes cluster: webserver, scheduler and postgresql. Using custom helm charts built upon bitnami's with some changes.
Airflow is working with KubernetesExecutor. All my DAGs are PythonOperator and KubernetesPodOperator (former DockerOperator - before k8s). Xcom pushes work correctly only with PythonOperator, but with KubernetesPodOperator I'm getting errors at the end of its execution (all the dags are affected):
[2019-12-06 15:12:40,116] {logging_mixin.py:112} INFO - [2019-12-06 15:12:40,116] {pod_launcher.py:217} INFO - Running command... cat /airflow/xcom/return.json
[2019-12-06 15:12:40,201] {logging_mixin.py:112} INFO - [2019-12-06 15:12:40,201] {pod_launcher.py:224} INFO - cat: can't open '/airflow/xcom/return.json': No such file or directory
So it seems that this file is not created.
I've also tried to override post_execute method to create this file there and to json.dump the results, but it didn't help - this error still persists.
Would appreciate for any suggestions on how to resolve it.
UPDATE: I've also copy-pasted this code to my DAG https://github.com/apache/airflow/blob/36f3bfb0619cc78698280f6ec3bc985f84e58343/tests/contrib/minikube/test_kubernetes_pod_operator.py#L315, and I'm still getting this error even using apache/airflow code for unit tests.
Also have to mention that my kubernetes version is 1.11.10. Minikube 1.5.2

Changed the database (PostgreSQL) dependency and version to the newer one and got it working.

By default the xcom_push argument of the KubernetesPodOperator is True, which causes AirFlow to try to read /airflow/xcom/return.json from the executed containers. Just change it to False:
KubernetesPodOperator(
....
xcom_push=False
)

How to enable cloudbuild.yaml for zip-based CloudFunction deployments?

Given some generic Python code, structured like ...
cloudbuild.yaml
requirements.txt
functions/
folder_a/
test/
main_test.py
main.py
If I'm ...
creating a .zip from above folder and
using either Terraform's google_cloudfunctions_function resource or gcloud functions deploy to upload/deploy the function
... it seems the build configuration for cloudbuild (cloudbuild.yaml) included in the .zip is never considered during build (i.e. while / prior to resolving requirements.txt).
I've set up cloudbuild.yaml to grant access to a private github repository (which contains a dependency listed in requirements.txt). Unfortunately, build fails with (terraform output):
Error: Error waiting for Updating CloudFunctions Function: Error code 3, message: Build failed: {"error": {"canonicalCode": "INVALID_ARGUMENT", "errorMessage": "pip_download_wheels had stderr output:\nCommand \"git clone -q ssh://git#github.com/SomeWhere/SomeThing.git /tmp/pip-req-build-a29nsum1\" failed with error code 128 in None\n\nerror: pip_download_wheels returned code: 1", "errorType": "InternalError", "errorId": "92DCE9EA"}}
According to cloud build docs, a cloudbuild.yaml can be specified using gcloud builds submit --config=cloudbuild.yaml . -- is there any way to supply that parameter to gcloud functions deploy (or even Terraform), too? I'd like to stay with the current, "transparent" code build, i.e. I do not want to set up code build separately but just upload my zip and have the code be built and deployed "automatically", while respecting codebuild.yaml.

It looks like you're trying to authenticate to a private Git repo via SSH. This is unfortunately not currently supported by Cloud Functions.
The alternative would be to vendor your private dependency into the directory before creating your .zip file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.