Airflow - Proper way to handle DAGs callbacks - python

I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?

The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.

default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"

What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).

Related

Airflow - What do I do when I have a variable amount of Work that needs to be handled by a DAG?

I have a sensor task that listens to files being created in S3.
After a poke I may have 3 files, after another poke I might have another 5 files.
I want to create a DAG (or multiple dags) that listen to work request, and creates others tasks or DAGs to handle that amount of work.
I wish I could access the xcom or dag_run variable from the DAG definition (see pseudo-code as follows):
def wait_for_s3_data(ti, **kwargs):
s3_wrapper = S3Wrapper()
work_load = s3_wrapper.work()
# work_load: {"filename1.json": "s3/key/filename1.json", ....}
ti.xcom_push(key="work_load", value=work_load)
return len(work_load) > 0
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
dag_run.conf['work_load'] = work_load
s3_wrapper.move_messages_from_waiting_to_processing(work_load)
with DAG(
"ListenAndCallWorkers",
description="This DAG waits for work request from s3",
schedule_interval="#once",
max_active_runs=1,
) as dag:
wait_for_s3_data: PythonSensor = PythonSensor(
task_id="wait_for_s3_data",
python_callable=wait_for_s3_data,
timeout=60,
poke_interval=30,
retries=2,
mode="reschedule",
)
get_data_task = PythonOperator(
task_id="GetData",
python_callable=query.get_work,
provide_context=True,
)
work_load = "{{ dag_run.conf['work_load'] }}" # <--- I WISH I COULD DO THIS
do_work_tasks = [
TriggerDagRunOperator(
task_id=f"TriggerDoWork_{work}",
trigger_dag_id="Work", # Ensure this equals the dag_id of the DAG to trigger
conf={"work":keypath},
)
for work, keypath in work_load.items():
]
wait_for_s3_data >> get_data_task >> do_work_tasks
I know I cannot do that.
I also tried to defined my own custom MultiTriggerDAG object (as in this https://stackoverflow.com/a/51790697/1494511). But at that step I still don't have access to the amount of work that needs to be done.
Another idea:
I am considering build a DAG with N doWork tasks, and I pass work to up to N via xcom
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
i = 1
for work, keypath in work_load.items()
dag_run.conf[f'work_{i}'] = keypath
i += 1
if i > N:
break
s3_wrapper.move_messages_from_waiting_to_processing(work_load[:N])
This idea would get the job done, but it sounds very inefficient
Related questions:
This is the same question as I have, but no code is presented on how to solve it:
Airflow: Proper way to run DAG for each file
This answer looks like it would solve the problem, but it seems to be related to Airflow versions lower than 2.2.2
How do we trigger multiple airflow dags using TriggerDagRunOperator?

Why does this code to get Airflow context get run on DAG import?

I have an Airflow DAG where I need to get the parameters the DAG was triggered with from the Airflow context.
Previously, I had the code to get those parameters within a DAG step (I'm using the Taskflow API from Airflow 2) -- similar to this:
from typing import Dict, Any, List
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
from airflow.utils.dates import days_ago
default_args = {"owner": "airflow"}
#dag(
default_args=default_args,
start_date=days_ago(1),
schedule_interval=None,
tags=["my_pipeline"],
)
def my_pipeline():
#task(multiple_outputs=True)
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
assert isinstance(params, dict)
return params
params = get_params()
pipeline = my_pipeline()
This worked as expected.
However, I needed to get these parameters in several steps, so I thought it would be a good idea to move code to get them into a separate function in the global scope, like this:
# ...
from airflow.operators.python import get_current_context
# other top-level code here
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
return params
#dag(...)
def my_pipeline():
#task()
def get_data():
params = get_params()
# other DAG tasks here
get_data()
pipeline = my_pipeline()
Now, this breaks right on DAG import, with the following error (names changed to match the examples above):
Broken DAG: [/home/airflow/gcs/dags/my_pipeline.py] Traceback (most recent call last):
File "/home/airflow/gcs/dags/my_pipeline.py", line 26, in get_params
context = get_context()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/operators/python.py", line 467, in get_context
raise AirflowException(
airflow.exceptions.AirflowException: Current context was requested but no context was found! Are you running within an airflow task?
And I get what the error is saying and how to fix it (move the code to get context back inside a #task). But my question is -- why does the error come up right on DAG import?
get_params doesn't get called anywhere outside of other tasks, and those tasks are obviously not run until the DAG runs. So why does the code in get_params run at all right when the DAG gets imported?
At this point, I want to understand this just because the fact that this error comes up when it comes up is breaking my understanding of how Python modules are evaluated on import. Code within a function shouldn't run until the function is run, and the only error that can come up before it's run is SyntaxError (and maybe some other core errors that I'm not remembering right now).
Is Airflow doing some special magic, or is there something simpler going on that I'm missing?
I am running Airflow 2.1.2 managed by Google Cloud Composer 1.17.2.
Unfortunately I am not able to reproduce your issue. The similar code below parses, renders a DAG, and completes successfully on Airflow 2.0, 2,1, and 2.2:
from datetime import datetime
from typing import Any, Dict
from airflow.decorators import dag, task
from airflow.operators.python import get_current_context
def get_params() -> Dict[str, Any]:
context = get_current_context()
params = context["params"]
return params
#dag(
dag_id="get_current_context_test",
start_date=datetime(2021, 1, 1),
schedule_interval=None,
params={"my_param": "param_value"},
)
def my_pipeline():
#task()
def get_data():
params = get_params()
print(params)
get_data()
pipeline = my_pipeline()
Task log snippet:
However, context objects are directly accessible in task-decorated functions. You can update the task signature(s) to include an arg for params=None (default value used so the file parses without a TypeError exception) and then apply whatever logic you need with that arg. This can be done with ti, dag_run, etc. too. Perhaps this helps?
#dag(
dag_id="get_current_context_test",
start_date=datetime(2021, 1, 1),
schedule_interval=None,
params={"my_param": "param_value"},
)
def my_pipeline():
#task()
def get_data(params=None):
print(params)
get_data()
pipeline = my_pipeline()

Why am I getting transient errors when trying to use DAG.get_dagrun() in Airflow/Google Composer?

Been looking at ways to access the dag run config JSON and build my actual DAG and underlaying tasks dynamically depending on what's there.
As the Jinja templating is somewhat limited for my use I've opted to use 'vanilla' python, using functions to build out my tasks.
The backbone of all this is being able to access the config JSON which I found out how to in here: https://stackoverflow.com/a/68455786/5687904
However, as I am using Airflow 1.10.12 (Composer 1.13.3) I had to edit the above a bit with using older/deprecated attributes instead so what I got to is:
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
I got this to work in a new DAG for testing, here a minimum working example with any private data stripped:
from airflow import DAG
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.models import Variable
from dependencies.airflow_utils import (
DBT_IMAGE
)
from dependencies.kube_secrets import (
GIT_DATA_TESTS_PRIVATE_KEY
)
# Default arguments for the DAG
default_args = {
"depends_on_past": False,
"owner": "airflow",
"retries": 0,
"start_date": datetime(2021, 5, 7, 0, 0, 0),
'dataflow_default_options': {
'project': 'my-gcp_project',
'region': 'europe-west1'
}
}
# Create the DAG
dag = DAG("test_conf_strings2", default_args=default_args, schedule_interval=None)
# DBT task creation function
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
def dynamic_full_refresh_strings(conf, arguments):
if conf.get("full-refresh") and 'dbt snapshot' in arguments:
return ' --vars "full-refresh: true"'
elif conf.get("full-refresh"):
return conf.get("full-refresh")
else:
return ""
def task_dbt_run(conf, name, arguments, **kwargs):
return KubernetesPodOperator(
image=DBT_IMAGE,
task_id="dbt_run_{}".format(name),
name="dbt_run_{}".format(name),
secrets=[
GIT_DATA_TESTS_PRIVATE_KEY,
],
startup_timeout_seconds=540,
arguments=[arguments + dynamic_full_refresh_strings(conf, arguments)],
dag=dag,
get_logs=True,
image_pull_policy="Always",
resources={"request_memory": "512Mi", "request_cpu": "250m"},
retries=3,
namespace="default",
cmds=["/bin/bash", "-c"]
)
# DBT commands
dbt_bqtoscore = f"""
{clone_repo_simplified_cmd} &&
cd bigqueryprocessing/data &&
dbt run --profiles-dir .dbt --models execution_engine_filter"""
# Create all tasks for the dag
dbt_run_bqtoscore = task_dbt_run(conf, "bqtoscore", dbt_bqtoscore)
# Task dependencies setting
dbt_run_bqtoscore
However, when I tried adding this logic to my main DAG I started getting 'NoneType' object has no attribute 'get'.
After checking everything like a madman and doing a lot of diffchecker I confirmed there is no difference.
To ensure I am not going entirely crazy I even copied my working testing DAG and just changed the name to something else so it doesn't conflict with the original.
I got the error again, for essentially 1:1 copy of the dag!
So what's happening here judging by the error is that the same code for conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf produces different results in dags whose only difference is the dag name.
In my working tests I get the correct JSON I pass or simply {} if nothing is passed hence no error.
But in the erroring ones it is a None which causes the issue.
Does anybody have any ideas what might be happening here?
Or at least ideas of what tests/debugging I should do to dig deeper?
Add a task PythonOperator prior to the main task; which basically calculates what dynamic_full_refresh_strings returns, and pass that info from first task to second (using x_com push/pull or setting in dag_run.conf or any other way)

APScheduler job is not starting as scheduled

I'm trying to schedule a job to start every minute.
I have the scheduler defined in a scheduler.py script:
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
executors = {
'default': ThreadPoolExecutor(10),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 5
}
scheduler = BackgroundScheduler(executors=executors,job_defaults=job_defaults)
I initialize the scheduler in the __init__.py of the module like this:
from scheduler import scheduler
scheduler.start()
I want to start a scheduled job on a specific action, like this:
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
def TestScheduler():
for i in range(0,29):
starttime = time()
print "test"
sleep(1.0 - ((time() - starttime) % 1.0))
First: when I'm executing the AddJob() function in the python console it starts to run as expected but not in the background, the console is blocked until the TestScheduler function ends after 30 seconds. I was expecting it to run in the background because it's a background scheduler.
Second: the job never starts again even when specifying a repeat interval of 1 minute.
What am I missing?
UPDATE
I found the issue thanks to another thread. The wrong line is this:
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
I changed it to:
scheduler.add_job(func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
TestScheduler() becomes TestScheduler. Using TestScheduler() cause the result of the function TestScheduler() to be passed as an argument of the add_job().
The first problem seems to be that you are initializing the scheduler inside the __init__.py, which doesn't seem to be the recommended way.
Code that exists in the __init__.py gets executed the first time a module from the specific folder gets imported. For example, imagine this structure:
my_module
|--__init__.py
|--test.py
with __init__.py:
from scheduler import scheduler
scheduler.start()
the scheduler.start() command gets executed when from my_module import something. So it either doesn't start at all from __init__.py or it starts many times (depending on the rest of your code!).
Another problem must be the use of scheduler.scheduled_job() method. If you read the documentation on adding jobs, you will observe that the recomended way is to use the add_job() method and not the scheduled_job() which is a decorator for convenience.
I would suggest something like this:
Keep my_scheduler.py as is.
Remove the scheduler.start() line from __init__.py.
Change your main file as follows:
from my_scheduler import scheduler
if not scheduler.running: # Clause suggested by #CyrilleMODIANO
scheduler.start()
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.add_job(
func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
...

Airflow : Passing a dynamic value to Sub DAG operator

I am new to Airflow.
I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n) to Sub DAG.
Where as SubDAG will use this number to dynamically create n parallel tasks.
Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways :
Option - 1(Using xcom Pull)
I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value.
Parent Dag File
def load_dag(**kwargs):
number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs'])
dag_data = json.dumps({
"number_of_runs": number_of_runs
})
return dag_data
# ------------------ Tasks ------------------------------
load_config = PythonOperator(
task_id='load_config',
provide_context=True,
python_callable=load_dag,
dag=dag)
t1 = SubDagOperator(
task_id=CHILD_DAG_NAME,
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config') }}'" ),
default_args=default_args,
dag=dag,
)
Sub Dag File
def sub_dag(parent_dag_name, child_dag_name, args, num_of_runs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval=None)
variabe_names = {}
for i in range(num_of_runs):
variabe_names['task' + str(i + 1)] = DummyOperator(
task_id='dummy_task',
dag=dag_subdag,
)
return dag_subdag
Option - 2
I have also tried to pass number_of_runs as a global variable, which was not working.
Option - 3
Also we tried to write this value to a data file. But sub DAG is throwing File doesn't exist error. This might be because we are dynamically generating this file.
Can some one help me with this.
I've done it with Option 3. The key is to return a valid dag with no tasks, if the file does not exist. So load_config will generate a file with your number of tasks or more information if needed. Your subdag factory would look something like:
def subdag(...):
sdag = DAG('%s.%s' % (parent, child), default_args=args, schedule_interval=timedelta(hours=1))
file_path = "/path/to/generated/file"
if os.path.exists(file_path):
data_file = open(file_path)
list_tasks = data_file.readlines()
for task in list_tasks:
DummyOperator(
task_id='task_'+task,
default_args=args,
dag=sdag,
)
return sdag
At dag generation you will see a subdag with No tasks. At dag execution, after load_config is done, you can see you dynamically generated subdag
Option 1 should work if you just change the call to xcom_pull to include the dag_id of the parent dag. By default the xcom_pull call will look for the task_id 'load_config' in its own dag which doesnt exist.
so change the x_com call macro to:
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config', dag_id='" + PARENT_DAG_NAME + "' }}'" ),
If the filename you are writing to is not dynamic (e.g. you are writing over the same file over and over again for each task instance), Jaime's answer will work:
file_path = "/path/to/generated/file"
But if you need a unique filename or want different content written to the file by each task instance for tasks executed in parallel, airflow will not work for this case, since there is no way to pass the execution date or variable outside of a template. Take a look at this post.
Take a look at my answer here, in which I describe a way to create a task dynamically based on the results of a previously executed task using xcoms and subdags.

Categories

Resources