Is it possible to update/overwrite the Airflow [‘dag_run’].conf? - python

We typically start Airflow DAGs with the trigger_dag CLI command. For example:
airflow trigger_dag my_dag --conf '{"field1": 1, "field2": 2}'
We access this conf in our operators using context[‘dag_run’].conf
Sometimes when the DAG breaks at some task, we'd like to "update" the conf and restart the broken task (and downstream dependencies) with this new conf. For example:
new conf --> {"field1": 3, "field2": 4}
Is it possible to “update” the dag_run conf with a new json string like this?
Would be interested in hearing thoughts on this, other solutions, or potentially ways to avoid this situation to begin with.
Working with Apache Airflow v1.10.3
Thank you very much in advance.

Updating conf after a dag run has been created isn't as straight forward as reading from conf, because conf is read from the dag_run metadata table whenever it's used after a dag run has been created. While Variables have methods to both write to and read from a metadata table, dag runs only let you read.
I agree that Variables are a useful tool, but when you have k=v pairs that you only want to use for a single run, it gets complicated and messy.
Below is an operator that will let you update a dag_run's conf after instantiation (tested in v1.10.10):
#! /usr/bin/env python3
"""Operator to overwrite a dag run's conf after creation."""
import os
from airflow.models import BaseOperator
from airflow.utils.db import provide_session
from airflow.utils.decorators import apply_defaults
from airflow.utils.operator_helpers import context_to_airflow_vars
class UpdateConfOperator(BaseOperator):
"""Updates an existing DagRun's conf with `given_conf`.
Args:
given_conf: A dictionary of k:v values to update a DagRun's conf with. Templated.
replace: Whether or not `given_conf` should replace conf (True)
or be used to update the existing conf (False).
Defaults to True.
"""
template_fields = ("given_conf",)
ui_color = "#ffefeb"
#apply_defaults
def __init__(self, given_conf: Dict, replace: bool = True, *args, **kwargs):
super().__init__(*args, **kwargs)
self.given_conf = given_conf
self.replace = replace
#staticmethod
def update_conf(given_conf: Dict, replace: bool = True, **context) -> None:
#provide_session
def save_to_db(dag_run, session):
session.add(dag_run)
session.commit()
dag_run.refresh_from_db()
dag_run = context["dag_run"]
# When there's no conf provided,
# conf will be None if scheduled or {} if manually triggered
if replace or not dag_run.conf:
dag_run.conf = given_conf
elif dag_run.conf:
# Note: dag_run.conf.update(given_conf) doesn't work
dag_run.conf = {**dag_run.conf, **given_conf}
save_to_db(dag_run)
def execute(self, context):
# Export context to make it available for callables to use.
airflow_context_vars = context_to_airflow_vars(context, in_env_var_format=True)
self.log.debug(
"Exporting the following env vars:\n%s",
"\n".join(["{}={}".format(k, v) for k, v in airflow_context_vars.items()]),
)
os.environ.update(airflow_context_vars)
self.update_conf(given_conf=self.given_conf, replace=self.replace, **context)
Example usage:
CONF = {"field1": 3, "field2": 4}
with DAG(
"some_dag",
# schedule_interval="*/1 * * * *",
schedule_interval=None,
max_active_runs=1,
catchup=False,
) as dag:
t_update_conf = UpdateConfOperator(
task_id="update_conf", given_conf=CONF,
)
t_print_conf = BashOperator(
task_id="print_conf",
bash_command="echo {{ dag_run['conf'] }}",
)
t_update_conf >> t_print_conf

This seems like a good use-case of Airflow Variables. If you were to read your configs from Variables you can easily see and modify the configuration inputs from the Airflow UI itself.
You can even go creative and automate that updation of config (which is now stored in a Variable) before re-running a Task / DAG via another Airflow task itself. See With code, how do you update and airflow variable

Related

Airflow - What do I do when I have a variable amount of Work that needs to be handled by a DAG?

I have a sensor task that listens to files being created in S3.
After a poke I may have 3 files, after another poke I might have another 5 files.
I want to create a DAG (or multiple dags) that listen to work request, and creates others tasks or DAGs to handle that amount of work.
I wish I could access the xcom or dag_run variable from the DAG definition (see pseudo-code as follows):
def wait_for_s3_data(ti, **kwargs):
s3_wrapper = S3Wrapper()
work_load = s3_wrapper.work()
# work_load: {"filename1.json": "s3/key/filename1.json", ....}
ti.xcom_push(key="work_load", value=work_load)
return len(work_load) > 0
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
dag_run.conf['work_load'] = work_load
s3_wrapper.move_messages_from_waiting_to_processing(work_load)
with DAG(
"ListenAndCallWorkers",
description="This DAG waits for work request from s3",
schedule_interval="#once",
max_active_runs=1,
) as dag:
wait_for_s3_data: PythonSensor = PythonSensor(
task_id="wait_for_s3_data",
python_callable=wait_for_s3_data,
timeout=60,
poke_interval=30,
retries=2,
mode="reschedule",
)
get_data_task = PythonOperator(
task_id="GetData",
python_callable=query.get_work,
provide_context=True,
)
work_load = "{{ dag_run.conf['work_load'] }}" # <--- I WISH I COULD DO THIS
do_work_tasks = [
TriggerDagRunOperator(
task_id=f"TriggerDoWork_{work}",
trigger_dag_id="Work", # Ensure this equals the dag_id of the DAG to trigger
conf={"work":keypath},
)
for work, keypath in work_load.items():
]
wait_for_s3_data >> get_data_task >> do_work_tasks
I know I cannot do that.
I also tried to defined my own custom MultiTriggerDAG object (as in this https://stackoverflow.com/a/51790697/1494511). But at that step I still don't have access to the amount of work that needs to be done.
Another idea:
I am considering build a DAG with N doWork tasks, and I pass work to up to N via xcom
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
i = 1
for work, keypath in work_load.items()
dag_run.conf[f'work_{i}'] = keypath
i += 1
if i > N:
break
s3_wrapper.move_messages_from_waiting_to_processing(work_load[:N])
This idea would get the job done, but it sounds very inefficient
Related questions:
This is the same question as I have, but no code is presented on how to solve it:
Airflow: Proper way to run DAG for each file
This answer looks like it would solve the problem, but it seems to be related to Airflow versions lower than 2.2.2
How do we trigger multiple airflow dags using TriggerDagRunOperator?

Non-JSON-serializable params deprecated in Airflow 2.3.0. How should I pass non-Json-serializable objects now?

I am currently using the params kwarg in my DAG objects to pass extra configurations to my tasks, which are PythonDecoratedOperators.
These configurations (stored in python dictionaries) include datetime objects or even lambda functions that I'm able to handle from a configuration file. This allows me to easily change some important aspects without touching the rest of my code.
This practice has been deprecated in Airflow 2.3.0 and will be removed in future versions.
My question is: how should I proceed in the future? Is there a better way to handle this?
A very simple example in a single file would be:
# Packages
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context
import logging
# Here is my configuration dict.
config = {
'value': 5,
'operation': lambda x: x**2
}
default_args = {
'start_date': days_ago(1)
}
#dag(schedule_interval=None, default_args=default_args, catchup=False, params=config) # I pass my config dict using the params kwarg.
def dag_example():
#task
def extract_message(**kwargs) -> int:
context = get_current_context()
config = context['params']
value = config['value']
logging.info('The value is:' + str(value))
return value
#task
def process_data(value: int) -> int:
context = get_current_context()
config = context['params']
operation = config['operation']
value = operation(value)
return value
#task
def store_data(value: int) -> None:
with open('/opt/airflow/data/try1.text', 'a+') as file:
print(value, file=file)
store_data(process_data(extract_message()))
dag_example = dag_example()
if __name__ == '__main__':
from airflow.utils.state import State
dag_example.clear(dag_run_state=State.NONE)
dag_example.run()
With this, I would obviously receive the following warning:
[2022-05-04, 13:09:26 CEST] {warnings.py:109} WARNING - /home/---/.local/lib/python3.8/site-packages/---/models/param.py:59: DeprecationWarning: The use of non-json-serializable params is deprecated and will be removed in a future release
My naive solution would be to pass config as a param in your python functions.
I would do something like this (note I renamed value because it was being overridden):
#task
def process_data(key, config) -> int:
operation = config['operation']
value = operation(key)
return value
process_data(key=extract_message(), config=config)

Why does this code to get Airflow context get run on DAG import?

I have an Airflow DAG where I need to get the parameters the DAG was triggered with from the Airflow context.
Previously, I had the code to get those parameters within a DAG step (I'm using the Taskflow API from Airflow 2) -- similar to this:
from typing import Dict, Any, List
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
from airflow.utils.dates import days_ago
default_args = {"owner": "airflow"}
#dag(
default_args=default_args,
start_date=days_ago(1),
schedule_interval=None,
tags=["my_pipeline"],
)
def my_pipeline():
#task(multiple_outputs=True)
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
assert isinstance(params, dict)
return params
params = get_params()
pipeline = my_pipeline()
This worked as expected.
However, I needed to get these parameters in several steps, so I thought it would be a good idea to move code to get them into a separate function in the global scope, like this:
# ...
from airflow.operators.python import get_current_context
# other top-level code here
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
return params
#dag(...)
def my_pipeline():
#task()
def get_data():
params = get_params()
# other DAG tasks here
get_data()
pipeline = my_pipeline()
Now, this breaks right on DAG import, with the following error (names changed to match the examples above):
Broken DAG: [/home/airflow/gcs/dags/my_pipeline.py] Traceback (most recent call last):
File "/home/airflow/gcs/dags/my_pipeline.py", line 26, in get_params
context = get_context()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/operators/python.py", line 467, in get_context
raise AirflowException(
airflow.exceptions.AirflowException: Current context was requested but no context was found! Are you running within an airflow task?
And I get what the error is saying and how to fix it (move the code to get context back inside a #task). But my question is -- why does the error come up right on DAG import?
get_params doesn't get called anywhere outside of other tasks, and those tasks are obviously not run until the DAG runs. So why does the code in get_params run at all right when the DAG gets imported?
At this point, I want to understand this just because the fact that this error comes up when it comes up is breaking my understanding of how Python modules are evaluated on import. Code within a function shouldn't run until the function is run, and the only error that can come up before it's run is SyntaxError (and maybe some other core errors that I'm not remembering right now).
Is Airflow doing some special magic, or is there something simpler going on that I'm missing?
I am running Airflow 2.1.2 managed by Google Cloud Composer 1.17.2.
Unfortunately I am not able to reproduce your issue. The similar code below parses, renders a DAG, and completes successfully on Airflow 2.0, 2,1, and 2.2:
from datetime import datetime
from typing import Any, Dict
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
return params
#dag(
dag_id="get_current_context_test",
start_date=datetime(2021, 1, 1),
schedule_interval=None,
params={"my_param": "param_value"},
)
def my_pipeline():
#task()
def get_data():
params = get_params()
print(params)
get_data()
pipeline = my_pipeline()
Task log snippet:
However, context objects are directly accessible in task-decorated functions. You can update the task signature(s) to include an arg for params=None (default value used so the file parses without a TypeError exception) and then apply whatever logic you need with that arg. This can be done with ti, dag_run, etc. too. Perhaps this helps?
#dag(
dag_id="get_current_context_test",
start_date=datetime(2021, 1, 1),
schedule_interval=None,
params={"my_param": "param_value"},
)
def my_pipeline():
#task()
def get_data(params=None):
print(params)
get_data()
pipeline = my_pipeline()

Why am I getting transient errors when trying to use DAG.get_dagrun() in Airflow/Google Composer?

Been looking at ways to access the dag run config JSON and build my actual DAG and underlaying tasks dynamically depending on what's there.
As the Jinja templating is somewhat limited for my use I've opted to use 'vanilla' python, using functions to build out my tasks.
The backbone of all this is being able to access the config JSON which I found out how to in here: https://stackoverflow.com/a/68455786/5687904
However, as I am using Airflow 1.10.12 (Composer 1.13.3) I had to edit the above a bit with using older/deprecated attributes instead so what I got to is:
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
I got this to work in a new DAG for testing, here a minimum working example with any private data stripped:
from airflow import DAG
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.models import Variable
from dependencies.airflow_utils import (
DBT_IMAGE
)
from dependencies.kube_secrets import (
GIT_DATA_TESTS_PRIVATE_KEY
)
# Default arguments for the DAG
default_args = {
"depends_on_past": False,
"owner": "airflow",
"retries": 0,
"start_date": datetime(2021, 5, 7, 0, 0, 0),
'dataflow_default_options': {
'project': 'my-gcp_project',
'region': 'europe-west1'
}
}
# Create the DAG
dag = DAG("test_conf_strings2", default_args=default_args, schedule_interval=None)
# DBT task creation function
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
def dynamic_full_refresh_strings(conf, arguments):
if conf.get("full-refresh") and 'dbt snapshot' in arguments:
return ' --vars "full-refresh: true"'
elif conf.get("full-refresh"):
return conf.get("full-refresh")
else:
return ""
def task_dbt_run(conf, name, arguments, **kwargs):
return KubernetesPodOperator(
image=DBT_IMAGE,
task_id="dbt_run_{}".format(name),
name="dbt_run_{}".format(name),
secrets=[
GIT_DATA_TESTS_PRIVATE_KEY,
],
startup_timeout_seconds=540,
arguments=[arguments + dynamic_full_refresh_strings(conf, arguments)],
dag=dag,
get_logs=True,
image_pull_policy="Always",
resources={"request_memory": "512Mi", "request_cpu": "250m"},
retries=3,
namespace="default",
cmds=["/bin/bash", "-c"]
)
# DBT commands
dbt_bqtoscore = f"""
{clone_repo_simplified_cmd} &&
cd bigqueryprocessing/data &&
dbt run --profiles-dir .dbt --models execution_engine_filter"""
# Create all tasks for the dag
dbt_run_bqtoscore = task_dbt_run(conf, "bqtoscore", dbt_bqtoscore)
# Task dependencies setting
dbt_run_bqtoscore
However, when I tried adding this logic to my main DAG I started getting 'NoneType' object has no attribute 'get'.
After checking everything like a madman and doing a lot of diffchecker I confirmed there is no difference.
To ensure I am not going entirely crazy I even copied my working testing DAG and just changed the name to something else so it doesn't conflict with the original.
I got the error again, for essentially 1:1 copy of the dag!
So what's happening here judging by the error is that the same code for conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf produces different results in dags whose only difference is the dag name.
In my working tests I get the correct JSON I pass or simply {} if nothing is passed hence no error.
But in the erroring ones it is a None which causes the issue.
Does anybody have any ideas what might be happening here?
Or at least ideas of what tests/debugging I should do to dig deeper?
Add a task PythonOperator prior to the main task; which basically calculates what dynamic_full_refresh_strings returns, and pass that info from first task to second (using x_com push/pull or setting in dag_run.conf or any other way)

Airflow - Proper way to handle DAGs callbacks

I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?
The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.
default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"
What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).

Categories

Resources