I try to customize SSHOperator like CustomSSHOperator. Because I need to assign dynamic values to ssh_conn_id and pool variables of SSHOperator. However these two are not in template_fields. So I've create a custom class like below
class CustomSSHOperator(SSHOperator):
template_fields: Sequence[str] = ('command', 'remote_host', 'ssh_conn_id', 'pool')
template_fields_renderers = {"command": "bash", "remote_host": "str", "ssh_conn_id": "str", "pool": "str"}
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
And I'm creating dag like below
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"],
start_date=datetime(2022, 9, 27), schedule_interval=None,
) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
#pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
#pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
Basically, I can see the rendered values from the UI. However my tasks are waiting as in the image
I also indicate that if I change the dag like below(just populating pool variable as static, ssh_conn_id is still dynamic variable), It works
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"], start_date=datetime(2022, 9, 27), schedule_interval=None,) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
dag_run.conf parameter is {"vm1": "112", "vm2": "140"}
I couldn't find the reason. I'd be appreciate any suggestions.
Template fields are rendered after the task has been scheduled, while the task pool field is used before the task is scheduled (by the Airflow scheduler itself).
This is the reason why a template cannot be used for the pool field. See also this discussion.
What is happening in your case is that the task remains stuck in the scheduled state because it is associated with a non-existent pool (actually it is vm-{{dag_run.conf['vm1']}}, that is, evaluated before the rendering).
You should have evidence of this in the scheduler logs:
Tasks using non-existent pool 'vm-{{dag_run.conf['vm1']}}' will not be scheduled
As a proof, you can create a new pool named exactly vm-{{dag_run.conf['vm1']}} and you will see that the task will be executed.
Only later the pool field will be rendered, and that's why you see the expected rendered values from the UI. But that's not what the scheduler saw.
Related
I have a sensor task that listens to files being created in S3.
After a poke I may have 3 files, after another poke I might have another 5 files.
I want to create a DAG (or multiple dags) that listen to work request, and creates others tasks or DAGs to handle that amount of work.
I wish I could access the xcom or dag_run variable from the DAG definition (see pseudo-code as follows):
def wait_for_s3_data(ti, **kwargs):
s3_wrapper = S3Wrapper()
work_load = s3_wrapper.work()
# work_load: {"filename1.json": "s3/key/filename1.json", ....}
ti.xcom_push(key="work_load", value=work_load)
return len(work_load) > 0
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
dag_run.conf['work_load'] = work_load
s3_wrapper.move_messages_from_waiting_to_processing(work_load)
with DAG(
"ListenAndCallWorkers",
description="This DAG waits for work request from s3",
schedule_interval="#once",
max_active_runs=1,
) as dag:
wait_for_s3_data: PythonSensor = PythonSensor(
task_id="wait_for_s3_data",
python_callable=wait_for_s3_data,
timeout=60,
poke_interval=30,
retries=2,
mode="reschedule",
)
get_data_task = PythonOperator(
task_id="GetData",
python_callable=query.get_work,
provide_context=True,
)
work_load = "{{ dag_run.conf['work_load'] }}" # <--- I WISH I COULD DO THIS
do_work_tasks = [
TriggerDagRunOperator(
task_id=f"TriggerDoWork_{work}",
trigger_dag_id="Work", # Ensure this equals the dag_id of the DAG to trigger
conf={"work":keypath},
)
for work, keypath in work_load.items():
]
wait_for_s3_data >> get_data_task >> do_work_tasks
I know I cannot do that.
I also tried to defined my own custom MultiTriggerDAG object (as in this https://stackoverflow.com/a/51790697/1494511). But at that step I still don't have access to the amount of work that needs to be done.
Another idea:
I am considering build a DAG with N doWork tasks, and I pass work to up to N via xcom
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
i = 1
for work, keypath in work_load.items()
dag_run.conf[f'work_{i}'] = keypath
i += 1
if i > N:
break
s3_wrapper.move_messages_from_waiting_to_processing(work_load[:N])
This idea would get the job done, but it sounds very inefficient
Related questions:
This is the same question as I have, but no code is presented on how to solve it:
Airflow: Proper way to run DAG for each file
This answer looks like it would solve the problem, but it seems to be related to Airflow versions lower than 2.2.2
How do we trigger multiple airflow dags using TriggerDagRunOperator?
Been looking at ways to access the dag run config JSON and build my actual DAG and underlaying tasks dynamically depending on what's there.
As the Jinja templating is somewhat limited for my use I've opted to use 'vanilla' python, using functions to build out my tasks.
The backbone of all this is being able to access the config JSON which I found out how to in here: https://stackoverflow.com/a/68455786/5687904
However, as I am using Airflow 1.10.12 (Composer 1.13.3) I had to edit the above a bit with using older/deprecated attributes instead so what I got to is:
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
I got this to work in a new DAG for testing, here a minimum working example with any private data stripped:
from airflow import DAG
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.models import Variable
from dependencies.airflow_utils import (
DBT_IMAGE
)
from dependencies.kube_secrets import (
GIT_DATA_TESTS_PRIVATE_KEY
)
# Default arguments for the DAG
default_args = {
"depends_on_past": False,
"owner": "airflow",
"retries": 0,
"start_date": datetime(2021, 5, 7, 0, 0, 0),
'dataflow_default_options': {
'project': 'my-gcp_project',
'region': 'europe-west1'
}
}
# Create the DAG
dag = DAG("test_conf_strings2", default_args=default_args, schedule_interval=None)
# DBT task creation function
conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf
def dynamic_full_refresh_strings(conf, arguments):
if conf.get("full-refresh") and 'dbt snapshot' in arguments:
return ' --vars "full-refresh: true"'
elif conf.get("full-refresh"):
return conf.get("full-refresh")
else:
return ""
def task_dbt_run(conf, name, arguments, **kwargs):
return KubernetesPodOperator(
image=DBT_IMAGE,
task_id="dbt_run_{}".format(name),
name="dbt_run_{}".format(name),
secrets=[
GIT_DATA_TESTS_PRIVATE_KEY,
],
startup_timeout_seconds=540,
arguments=[arguments + dynamic_full_refresh_strings(conf, arguments)],
dag=dag,
get_logs=True,
image_pull_policy="Always",
resources={"request_memory": "512Mi", "request_cpu": "250m"},
retries=3,
namespace="default",
cmds=["/bin/bash", "-c"]
)
# DBT commands
dbt_bqtoscore = f"""
{clone_repo_simplified_cmd} &&
cd bigqueryprocessing/data &&
dbt run --profiles-dir .dbt --models execution_engine_filter"""
# Create all tasks for the dag
dbt_run_bqtoscore = task_dbt_run(conf, "bqtoscore", dbt_bqtoscore)
# Task dependencies setting
dbt_run_bqtoscore
However, when I tried adding this logic to my main DAG I started getting 'NoneType' object has no attribute 'get'.
After checking everything like a madman and doing a lot of diffchecker I confirmed there is no difference.
To ensure I am not going entirely crazy I even copied my working testing DAG and just changed the name to something else so it doesn't conflict with the original.
I got the error again, for essentially 1:1 copy of the dag!
So what's happening here judging by the error is that the same code for conf = dag.get_dagrun(execution_date=dag.latest_execution_date).conf produces different results in dags whose only difference is the dag name.
In my working tests I get the correct JSON I pass or simply {} if nothing is passed hence no error.
But in the erroring ones it is a None which causes the issue.
Does anybody have any ideas what might be happening here?
Or at least ideas of what tests/debugging I should do to dig deeper?
Add a task PythonOperator prior to the main task; which basically calculates what dynamic_full_refresh_strings returns, and pass that info from first task to second (using x_com push/pull or setting in dag_run.conf or any other way)
We typically start Airflow DAGs with the trigger_dag CLI command. For example:
airflow trigger_dag my_dag --conf '{"field1": 1, "field2": 2}'
We access this conf in our operators using context[‘dag_run’].conf
Sometimes when the DAG breaks at some task, we'd like to "update" the conf and restart the broken task (and downstream dependencies) with this new conf. For example:
new conf --> {"field1": 3, "field2": 4}
Is it possible to “update” the dag_run conf with a new json string like this?
Would be interested in hearing thoughts on this, other solutions, or potentially ways to avoid this situation to begin with.
Working with Apache Airflow v1.10.3
Thank you very much in advance.
Updating conf after a dag run has been created isn't as straight forward as reading from conf, because conf is read from the dag_run metadata table whenever it's used after a dag run has been created. While Variables have methods to both write to and read from a metadata table, dag runs only let you read.
I agree that Variables are a useful tool, but when you have k=v pairs that you only want to use for a single run, it gets complicated and messy.
Below is an operator that will let you update a dag_run's conf after instantiation (tested in v1.10.10):
#! /usr/bin/env python3
"""Operator to overwrite a dag run's conf after creation."""
import os
from airflow.models import BaseOperator
from airflow.utils.db import provide_session
from airflow.utils.decorators import apply_defaults
from airflow.utils.operator_helpers import context_to_airflow_vars
class UpdateConfOperator(BaseOperator):
"""Updates an existing DagRun's conf with `given_conf`.
Args:
given_conf: A dictionary of k:v values to update a DagRun's conf with. Templated.
replace: Whether or not `given_conf` should replace conf (True)
or be used to update the existing conf (False).
Defaults to True.
"""
template_fields = ("given_conf",)
ui_color = "#ffefeb"
#apply_defaults
def __init__(self, given_conf: Dict, replace: bool = True, *args, **kwargs):
super().__init__(*args, **kwargs)
self.given_conf = given_conf
self.replace = replace
#staticmethod
def update_conf(given_conf: Dict, replace: bool = True, **context) -> None:
#provide_session
def save_to_db(dag_run, session):
session.add(dag_run)
session.commit()
dag_run.refresh_from_db()
dag_run = context["dag_run"]
# When there's no conf provided,
# conf will be None if scheduled or {} if manually triggered
if replace or not dag_run.conf:
dag_run.conf = given_conf
elif dag_run.conf:
# Note: dag_run.conf.update(given_conf) doesn't work
dag_run.conf = {**dag_run.conf, **given_conf}
save_to_db(dag_run)
def execute(self, context):
# Export context to make it available for callables to use.
airflow_context_vars = context_to_airflow_vars(context, in_env_var_format=True)
self.log.debug(
"Exporting the following env vars:\n%s",
"\n".join(["{}={}".format(k, v) for k, v in airflow_context_vars.items()]),
)
os.environ.update(airflow_context_vars)
self.update_conf(given_conf=self.given_conf, replace=self.replace, **context)
Example usage:
CONF = {"field1": 3, "field2": 4}
with DAG(
"some_dag",
# schedule_interval="*/1 * * * *",
schedule_interval=None,
max_active_runs=1,
catchup=False,
) as dag:
t_update_conf = UpdateConfOperator(
task_id="update_conf", given_conf=CONF,
)
t_print_conf = BashOperator(
task_id="print_conf",
bash_command="echo {{ dag_run['conf'] }}",
)
t_update_conf >> t_print_conf
This seems like a good use-case of Airflow Variables. If you were to read your configs from Variables you can easily see and modify the configuration inputs from the Airflow UI itself.
You can even go creative and automate that updation of config (which is now stored in a Variable) before re-running a Task / DAG via another Airflow task itself. See With code, how do you update and airflow variable
I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?
The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.
default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"
What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).
I am new to Airflow.
I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n) to Sub DAG.
Where as SubDAG will use this number to dynamically create n parallel tasks.
Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways :
Option - 1(Using xcom Pull)
I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value.
Parent Dag File
def load_dag(**kwargs):
number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs'])
dag_data = json.dumps({
"number_of_runs": number_of_runs
})
return dag_data
# ------------------ Tasks ------------------------------
load_config = PythonOperator(
task_id='load_config',
provide_context=True,
python_callable=load_dag,
dag=dag)
t1 = SubDagOperator(
task_id=CHILD_DAG_NAME,
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config') }}'" ),
default_args=default_args,
dag=dag,
)
Sub Dag File
def sub_dag(parent_dag_name, child_dag_name, args, num_of_runs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval=None)
variabe_names = {}
for i in range(num_of_runs):
variabe_names['task' + str(i + 1)] = DummyOperator(
task_id='dummy_task',
dag=dag_subdag,
)
return dag_subdag
Option - 2
I have also tried to pass number_of_runs as a global variable, which was not working.
Option - 3
Also we tried to write this value to a data file. But sub DAG is throwing File doesn't exist error. This might be because we are dynamically generating this file.
Can some one help me with this.
I've done it with Option 3. The key is to return a valid dag with no tasks, if the file does not exist. So load_config will generate a file with your number of tasks or more information if needed. Your subdag factory would look something like:
def subdag(...):
sdag = DAG('%s.%s' % (parent, child), default_args=args, schedule_interval=timedelta(hours=1))
file_path = "/path/to/generated/file"
if os.path.exists(file_path):
data_file = open(file_path)
list_tasks = data_file.readlines()
for task in list_tasks:
DummyOperator(
task_id='task_'+task,
default_args=args,
dag=sdag,
)
return sdag
At dag generation you will see a subdag with No tasks. At dag execution, after load_config is done, you can see you dynamically generated subdag
Option 1 should work if you just change the call to xcom_pull to include the dag_id of the parent dag. By default the xcom_pull call will look for the task_id 'load_config' in its own dag which doesnt exist.
so change the x_com call macro to:
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config', dag_id='" + PARENT_DAG_NAME + "' }}'" ),
If the filename you are writing to is not dynamic (e.g. you are writing over the same file over and over again for each task instance), Jaime's answer will work:
file_path = "/path/to/generated/file"
But if you need a unique filename or want different content written to the file by each task instance for tasks executed in parallel, airflow will not work for this case, since there is no way to pass the execution date or variable outside of a template. Take a look at this post.
Take a look at my answer here, in which I describe a way to create a task dynamically based on the results of a previously executed task using xcoms and subdags.