Apache AIrflow KubernetesExecutor and KubernetesPodOperator: xcom pushes not working

Apache AIrflow KubernetesExecutor and KubernetesPodOperator: xcom pushes not working - python

Have got an Apache Airflow instance placed in kubernetes cluster: webserver, scheduler and postgresql. Using custom helm charts built upon bitnami's with some changes.
Airflow is working with KubernetesExecutor. All my DAGs are PythonOperator and KubernetesPodOperator (former DockerOperator - before k8s). Xcom pushes work correctly only with PythonOperator, but with KubernetesPodOperator I'm getting errors at the end of its execution (all the dags are affected):
[2019-12-06 15:12:40,116] {logging_mixin.py:112} INFO - [2019-12-06 15:12:40,116] {pod_launcher.py:217} INFO - Running command... cat /airflow/xcom/return.json
[2019-12-06 15:12:40,201] {logging_mixin.py:112} INFO - [2019-12-06 15:12:40,201] {pod_launcher.py:224} INFO - cat: can't open '/airflow/xcom/return.json': No such file or directory
So it seems that this file is not created.
I've also tried to override post_execute method to create this file there and to json.dump the results, but it didn't help - this error still persists.
Would appreciate for any suggestions on how to resolve it.
UPDATE: I've also copy-pasted this code to my DAG https://github.com/apache/airflow/blob/36f3bfb0619cc78698280f6ec3bc985f84e58343/tests/contrib/minikube/test_kubernetes_pod_operator.py#L315, and I'm still getting this error even using apache/airflow code for unit tests.
Also have to mention that my kubernetes version is 1.11.10. Minikube 1.5.2

Changed the database (PostgreSQL) dependency and version to the newer one and got it working.

By default the xcom_push argument of the KubernetesPodOperator is True, which causes AirFlow to try to read /airflow/xcom/return.json from the executed containers. Just change it to False:
KubernetesPodOperator(
....
xcom_push=False
)

Related

Python multiprocessing: AttributeError: Can't pickle local object

I wrote a ChatOps bot for the collaboration tool Mattermost using this framework. Now I'm trying to write and run integration tests and I used their examples. By cloning the git repository you can run the tests by yourself. Their docker-compose.yml file will only work on a Linux machine. If you want to reproduce it on a Mac machine, you'll have to edit the docker-compose.yml to:
version: "3.7"
services:
app:
container_name: "mattermost-bot-test"
build: .
command: ./mm/docker-entry.sh
ports:
- "8065:8065"
extra_hosts:
- "dockerhost:127.0.0.1"
After running the command docker-compose up -d Mattermost is available at localhost:8065. I only took one simple test from their project and copied it in base-test.py. You can see my source code here. After starting the test by running the command pytest --capture=no --log-cli-level=DEBUG . it will return the following error: AttributeError: Can't pickle local object 'start_bot.<locals>.run_bot'. This error also shows up on the same test case in their project. The error happens at line 92 in the utils.py file
What am I doing wrong here?

I don't know if you already went down this road but I think you might get past the pickling error by making run_bot take the bot that it does bot.run() with as an argument and then pass it to the process.

Take a look at the Action tab on that GitHub repository. Pytest seems to execute correctly (ignoring the exceptions on the webhook test)
Here is a recent run you can use to compare your environment set-up: https://github.com/attzonko/mmpy_bot/runs/4289644769?check_suite_focus=true

FERNET_KEY configuration is missing when creating a new environment with the same DAGS

I'm using Composer (Airflow) in Google Cloud. I want to create a new environment and take my same DAGs and Variables from the old environment into the new one.
To accomplish this I do the following:
I check several of my variables and export them to a JSON file.
In my new environment I import this same JSON file.
I use gsutil and upload my same DAGs to the new environment
However, in the new environment, all of my DAGs are breaking, due to a FERNET_KEY configuration is missing. My best guess is that this is related to importing my variables that were encrypted using a separate Fernet key but I'm unsure.
Has anyone encountered this issue before? If so, how did you fix it?

I can reliably reproduce the issue in Composer 1.9 / Airflow 1.10.6 by performing the following actions:
Create a new Composer Cluster
Upload a DAG that references an Airflow Connection
Set an Environment Variable in Composer
Wait for airflow-scheduler and airflow-worker to restart
Aside from the FERNET_KEY configuration is missing, the issue manifests itself with the following Airflow error banners:
Broken DAG: [/home/airflow/gcs/dags/MY_DAG.py] in invalid literal for int() with base 10: 'XXX'
Broken DAG: [/home/airflow/gcs/dags/MY_DAG.py] Expecting value: line 1 column 1 (char 0)
The root cause of the issue is that adding a new environment variable removes the AIRFLOW__CORE__FERNET_KEY environment variable from the airflow-scheduler and airflow-worker Kubernetes Deployment Spec Pod Templates:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
key: fernet_key
name: airflow-secrets
As a workaround, it's possible to apply a Kubernetes Deployment Spec Patch:
$ cat config/composer_airflow_scheduler_fernet_key_patch.yaml
spec:
template:
spec:
containers:
- name: airflow-scheduler
env:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
key: fernet_key
name: airflow-secrets
$ kubectl patch deployment airflow-scheduler --namespace=$AIRFLOW_ENV_GKE_NAMESPACE --patch "$(cat config/composer_airflow_scheduler_fernet_key_patch.yaml)"
NOTE: This patch must also be applied to airflow-worker.

We got the same error about FERNET_KEYs. I think there is a bug in new version (composer-1.9.0). They say 'The Fernet Key is now stored in Kubernetes Secrets instead of the Config Map.'
Even if re-enter your connections again, they are not working no.
They already fix the issue in version 1.9.1:
https://cloud.google.com/composer/docs/release-notes

As per documentation, the Fernet Key is generated by Composer and it is intended to be unique. The fernet key value can be retrieved from the Composer Airflow Configuration (Composer bucket -> airflow.cfg). You need to check if fernet_key exists.
There is a known issue due to a race condition in binary rollouts that can cause a new fernet key to be set in the webserver, making previously encrypted values in the metadata database unable to be decrypted.
What you can try, is to recreate the key Composer object path in the Airflow UI under Admin -> Variables.

Apache beam DataFlow runner throwing setup error

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,
A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.
But could not find detailed worker-startup logs.
We tried increasing memory size, worker count etc, but still getting the same error.
Here is the command we use,
python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2
pipeline snippet,
data = pipeline | "load data" >> beam.io.Read(
beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)
data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)
Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.
Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.
Update:
Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.
If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.
I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example
Read about it here - https://medium.com/#rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

I'm building a prediction pipeline using Apache Beam/Dataflow. I need to include the model files inside the dependencies available to the remote workers. The Dataflow job failed with the same error log:
Error message from worker: A setup error was detected in beamapp-xxx-xxxxxxxxxx-xxxxxxxx-xxxx-harness-xxxx. Please refer to the worker-startup log for detailed information.
However, this error message didn't give any details about the worker-startup log. Finally, I found a way to have the worker log and solve the problem.
As is known, Dataflow creates compute engines to run jobs and save logs on them so that we can access the vm to see logs. We can connect to the vm in use by our Dataflow job from the GCP console via SSH. Then we can check the boot-json.log file located in /var/log/dataflow/taskrunner/harness:
$ cd /var/log/dataflow/taskrunner/harness
$ cat boot-json.log
Here we should pay attention. When running in batch mode, the vm created by Dataflow is ephemeral and closed when the job failed. If the vm is closed, we can't access it anymore. But a process including a failing item is retried 4 times, so normally we have enough time to open boot-json.log and see what is going on.
At last, I put my Python setup solution here that may help someone else:
main.py
...
model_path = os.path.dirname(os.path.abspath(__file__)) + '/models/net.pd'
# pipeline code
...
MANIFEST.in
include models/*.*
setup.py complete example
REQUIRED_PACKAGES = [...]
setuptools.setup(
...
include_package_data=True,
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
package_data={"models": ["models/*"]},
...
)
Run Dataflow pipelines
$ python main.py --setup_file=/absolute/path/to/setup.py ...

Airflow quickstart not working

Hi I've just started using Airflow, but I cannot manage to make the task in the quickstart run: airflow run example_bash_operator runme_0 2015-01-01.
I've just created a conda environment with python 2.7.6 and installed airflow through pip which installed airflow==1.8.0. Then I ran the commands listed here https://airflow.incubator.apache.org/start.html.
When I try to run the first task instance, by looking at the UI nothing seems to happen. Here's the output of the command:
(airflow) ✔ se7entyse7en in ~/Projects/airflow  $ airflow run example_bash_operator runme_0 2015-01-01
[2017-07-28 12:06:22,992] {__init__.py:57} INFO - Using executor SequentialExecutor
Sending to executor.
[2017-07-28 12:06:23,950] {__init__.py:57} INFO - Using executor SequentialExecutor
Logging into: /Users/se7entyse7en/airflow/logs/example_bash_operator/runme_0/2015-01-01T00:00:00
On the other hand the backfill works fine: airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02.
What am I missing?

I've just found that if a single task is ran then it is listed under Browse > Task Instances as part of any DAG.

The run command is used to run a single task instance.
But it will only be able to run if you have cleared any previous runs.
To clear the run:
go the Airflow UI(Graph View)
Click on the particular task and click Clear
Now you will be able to run the task with the cmd that you initially had.
To view the logs for this task you can run:
vi /Users/se7entyse7en/airflow/logs/example_bash_operator/runme_0/2015-01-01T00:00:00
I had a task like:
t2 = BashOperator(
task_id='sleep',
depends_on_past=False,
bash_command='sleep 35',
dag=dag)
I was able to see the changes in the state of the task as it was getting executed.

What is causing this djcelery error: NotRegistered?

Running this script in a django shell:
import processors.topics.tasks as t
t.test.delay()
Gives this error:
NotRegistered: 'processors.topics.tasks.test'
The weird thing is that chorus.processors.topics.tasks.test is definitely included in the [Tasks] printout when I run
python celeryd --verbosity=2 --loglevel=INFO --purge
Why am I getting the error?

It has to do with the way you are importing the task for example, You are importing the task from the project instead of the app, chorus.processors.topics tasks instead of processors.topics.tasks. This creates problems for Celery, since the name will be different in the client and the server.
Here are the docs that relate to the following issue

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.