I am new to Airflow.
I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n) to Sub DAG.
Where as SubDAG will use this number to dynamically create n parallel tasks.
Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways :
Option - 1(Using xcom Pull)
I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value.
Parent Dag File
def load_dag(**kwargs):
number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs'])
dag_data = json.dumps({
"number_of_runs": number_of_runs
})
return dag_data
# ------------------ Tasks ------------------------------
load_config = PythonOperator(
task_id='load_config',
provide_context=True,
python_callable=load_dag,
dag=dag)
t1 = SubDagOperator(
task_id=CHILD_DAG_NAME,
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config') }}'" ),
default_args=default_args,
dag=dag,
)
Sub Dag File
def sub_dag(parent_dag_name, child_dag_name, args, num_of_runs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval=None)
variabe_names = {}
for i in range(num_of_runs):
variabe_names['task' + str(i + 1)] = DummyOperator(
task_id='dummy_task',
dag=dag_subdag,
)
return dag_subdag
Option - 2
I have also tried to pass number_of_runs as a global variable, which was not working.
Option - 3
Also we tried to write this value to a data file. But sub DAG is throwing File doesn't exist error. This might be because we are dynamically generating this file.
Can some one help me with this.
I've done it with Option 3. The key is to return a valid dag with no tasks, if the file does not exist. So load_config will generate a file with your number of tasks or more information if needed. Your subdag factory would look something like:
def subdag(...):
sdag = DAG('%s.%s' % (parent, child), default_args=args, schedule_interval=timedelta(hours=1))
file_path = "/path/to/generated/file"
if os.path.exists(file_path):
data_file = open(file_path)
list_tasks = data_file.readlines()
for task in list_tasks:
DummyOperator(
task_id='task_'+task,
default_args=args,
dag=sdag,
)
return sdag
At dag generation you will see a subdag with No tasks. At dag execution, after load_config is done, you can see you dynamically generated subdag
Option 1 should work if you just change the call to xcom_pull to include the dag_id of the parent dag. By default the xcom_pull call will look for the task_id 'load_config' in its own dag which doesnt exist.
so change the x_com call macro to:
subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config', dag_id='" + PARENT_DAG_NAME + "' }}'" ),
If the filename you are writing to is not dynamic (e.g. you are writing over the same file over and over again for each task instance), Jaime's answer will work:
file_path = "/path/to/generated/file"
But if you need a unique filename or want different content written to the file by each task instance for tasks executed in parallel, airflow will not work for this case, since there is no way to pass the execution date or variable outside of a template. Take a look at this post.
Take a look at my answer here, in which I describe a way to create a task dynamically based on the results of a previously executed task using xcoms and subdags.
Related
I try to customize SSHOperator like CustomSSHOperator. Because I need to assign dynamic values to ssh_conn_id and pool variables of SSHOperator. However these two are not in template_fields. So I've create a custom class like below
class CustomSSHOperator(SSHOperator):
template_fields: Sequence[str] = ('command', 'remote_host', 'ssh_conn_id', 'pool')
template_fields_renderers = {"command": "bash", "remote_host": "str", "ssh_conn_id": "str", "pool": "str"}
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
And I'm creating dag like below
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"],
start_date=datetime(2022, 9, 27), schedule_interval=None,
) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
#pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
#pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
Basically, I can see the rendered values from the UI. However my tasks are waiting as in the image
I also indicate that if I change the dag like below(just populating pool variable as static, ssh_conn_id is still dynamic variable), It works
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"], start_date=datetime(2022, 9, 27), schedule_interval=None,) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
dag_run.conf parameter is {"vm1": "112", "vm2": "140"}
I couldn't find the reason. I'd be appreciate any suggestions.
Template fields are rendered after the task has been scheduled, while the task pool field is used before the task is scheduled (by the Airflow scheduler itself).
This is the reason why a template cannot be used for the pool field. See also this discussion.
What is happening in your case is that the task remains stuck in the scheduled state because it is associated with a non-existent pool (actually it is vm-{{dag_run.conf['vm1']}}, that is, evaluated before the rendering).
You should have evidence of this in the scheduler logs:
Tasks using non-existent pool 'vm-{{dag_run.conf['vm1']}}' will not be scheduled
As a proof, you can create a new pool named exactly vm-{{dag_run.conf['vm1']}} and you will see that the task will be executed.
Only later the pool field will be rendered, and that's why you see the expected rendered values from the UI. But that's not what the scheduler saw.
I have a sensor task that listens to files being created in S3.
After a poke I may have 3 files, after another poke I might have another 5 files.
I want to create a DAG (or multiple dags) that listen to work request, and creates others tasks or DAGs to handle that amount of work.
I wish I could access the xcom or dag_run variable from the DAG definition (see pseudo-code as follows):
def wait_for_s3_data(ti, **kwargs):
s3_wrapper = S3Wrapper()
work_load = s3_wrapper.work()
# work_load: {"filename1.json": "s3/key/filename1.json", ....}
ti.xcom_push(key="work_load", value=work_load)
return len(work_load) > 0
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
dag_run.conf['work_load'] = work_load
s3_wrapper.move_messages_from_waiting_to_processing(work_load)
with DAG(
"ListenAndCallWorkers",
description="This DAG waits for work request from s3",
schedule_interval="#once",
max_active_runs=1,
) as dag:
wait_for_s3_data: PythonSensor = PythonSensor(
task_id="wait_for_s3_data",
python_callable=wait_for_s3_data,
timeout=60,
poke_interval=30,
retries=2,
mode="reschedule",
)
get_data_task = PythonOperator(
task_id="GetData",
python_callable=query.get_work,
provide_context=True,
)
work_load = "{{ dag_run.conf['work_load'] }}" # <--- I WISH I COULD DO THIS
do_work_tasks = [
TriggerDagRunOperator(
task_id=f"TriggerDoWork_{work}",
trigger_dag_id="Work", # Ensure this equals the dag_id of the DAG to trigger
conf={"work":keypath},
)
for work, keypath in work_load.items():
]
wait_for_s3_data >> get_data_task >> do_work_tasks
I know I cannot do that.
I also tried to defined my own custom MultiTriggerDAG object (as in this https://stackoverflow.com/a/51790697/1494511). But at that step I still don't have access to the amount of work that needs to be done.
Another idea:
I am considering build a DAG with N doWork tasks, and I pass work to up to N via xcom
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
i = 1
for work, keypath in work_load.items()
dag_run.conf[f'work_{i}'] = keypath
i += 1
if i > N:
break
s3_wrapper.move_messages_from_waiting_to_processing(work_load[:N])
This idea would get the job done, but it sounds very inefficient
Related questions:
This is the same question as I have, but no code is presented on how to solve it:
Airflow: Proper way to run DAG for each file
This answer looks like it would solve the problem, but it seems to be related to Airflow versions lower than 2.2.2
How do we trigger multiple airflow dags using TriggerDagRunOperator?
I have a dag called my_dag.py that utilizes the S3KeySensor in Airflow 2 to check if a s3 key exists. When I use the sensor directly inside the dag, it works:
with TaskGroup('check_exists') as check_exists:
path = 's3://my-bucket/data/my_file'
poke_interval = 30
timeout = 60*60
mode = 'reschedule'
dependency_name = 'my_file'
S3KeySensor(
task_id = 'check_' + dependency_name + '_exists',
bucket_key = path,
poke_interval = poke_interval,
timeout = timeout,
mode = mode
)
The log of the above looks like:
[2022-05-03, 19:51:26 UTC] {s3.py:105} INFO - Poking for key : s3://my-bucket/data/my_file
[2022-05-03, 19:51:26 UTC] {base_aws.py:90} INFO - Retrieving region_name from Connection.extra_config['region_name']
[2022-05-03, 19:51:27 UTC] {taskinstance.py:1701} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
This is correct. The reschedule is expected, because the file does not exist yet.
However, I want to check any number of paths in other dags, so I moved the sensor into a function called test in another file called helpers.py. I use a python operator in my_dag.py within the task group that calls test. It looks like this:
with TaskGroup('check_exists') as check_exists:
path = 's3://my-bucket/data/my_file'
dependency_name = 'my_file'
wait_for_dependencies = PythonOperator(
task_id = 'wait_for_my_file',
python_callable = test,
op_kwargs = {
'dependency_name': dependency_name,
'path': path
},
dag = dag
)
wait_for_dependencies
The function test in helpers.py looks like:
def test(dependency_name, path, poke_interval = 30, timeout = 60 * 60, mode = 'reschedule'):
S3KeySensor(
task_id = 'check_' + dependency_name + '_exists',
bucket_key = path,
poke_interval = poke_interval,
timeout = timeout,
mode = mode
)
However, when I run the dag, the step is marked as success even though the file is not there. The logs show:
[2022-05-03, 20:07:54 UTC] {python.py:175} INFO - Done. Returned value was: None
[2022-05-03, 20:07:54 UTC] {taskinstance.py:1282} INFO - Marking task as SUCCESS.
It seems airflow doesn't like using a sensor via a python operator. Is this true? Or am I doing something wrong?
My goal is to loop through multiple paths and check if each one exists. However, I do this in other dags, which is why I'm putting the sensor in a function that resides in another file.
If there are alternative ideas to doing this, I'm open!
Thanks for your help!
This will not work as you expect.
You created a case of operator inside operator. See this answer for information about what this means.
In your case you wrapped the S3KeySensor with PythonOperator. This means that when the PythonOperator runs it only execute the init function of S3KeySensor - it doesn't invoke the logic of the operator itself.
Using operator inside operator is a bad practice.
Your case is even more extreme as you are trying to use sensor inside operator. Sensors need to invoke the poke() function for every poking cycle.
To simplify - You can not enjoy the power of Sensor with mode = 'reschedule' when you set them as you did because reschedule means that you want to release the worker if condition is not met yet but PythonOperator doesn't know how to do that.
How to solve your issue:
Option 1:
From the code you showed you can simply do:
with TaskGroup('check_exists') as check_exists:
path = 's3://my-bucket/data/my_file'
dependency_name = 'my_file'
S3KeySensor(
task_id='check_' + dependency_name + '_exists',
bucket_key=path,
poke_interval=30,
timeout=60 * 60,
mode='reschedule'
)
I didn't see a reason why this can't work for you.
Option 2:
If for some reason option 1 is not good for you then create a custom sensor that accept also dependency_name, path and use it like any other operator.
I didn't test it but something like the following should work:
class MyS3KeySensor(S3KeySensor):
def __init__(
self,
*,
dependency_name:str = None,
path: str = None,
**kwargs,
):
super().__init__(**kwargs)
self.task_id = task_id = 'check_' + dependency_name + '_exists'
self.bucket_name = path
I'm having issues with Airflow 1.10 Python Branch operator. I have a dag that scans a cloud bucket, and processes files if found. If the file is missing it hits the no_file_found dummy operator and completes, otherwise it moves forward to some parsing steps.
With a single file this workflow works great. My issue arises when I add the same logic for a second file. Currently the check_for_Post_Performance returns cleans_headers_for_gcm task and I'm at a total loss how that happens. From the outline below it should have only two paths forward, clean_headers_Post_Perfromance or no_file_found.
I create these tasks dynamically from a list of file names. I loop through each filename and build the following operators:
def build_check(filename):
return BranchPythonOperator(
task_id=f'check_for_{file_name}'.replace(' ', '_'),
python_callable=check_file_exists,
op_kwargs={'filename': filename},
provide_context=True,
dag=dag
)
def check_file_exists(filename, **context):
xcom_value = context['ti'].xcom_pull(task_ids=f'list_files')
if any(filename in s for s in xcom_value):
return f'clean_headers_for_{file_name}'.replace(' ', '_')
else:
return 'no_file_found'
I've checked the rendered task template to confirm 'Post Performance' is passed for the filename variable
but when looking at the logs I see the following:
[2021-12-02 20:15:56,742] {logging_mixin.py:120} INFO - Running <TaskInstance: example_dag.check_for_Post_Performance 2021-12-02T20:14:50.724084+00:00 [running]> on host 21d0393eb686
[2021-12-02 20:15:56,766] {python_operator.py:114} INFO - Done. Returned value was: clean_headers_for_GCM
[2021-12-02 20:15:56,767] {skipmixin.py:122} INFO - Following branch clean_headers_for_GCM
[2021-12-02 20:15:56,773] {skipmixin.py:158} INFO - Skipping tasks ['no_file_found', 'clean_headers_for_Post_Performance']
My best guess is the function isn't created each loop like I think it is, or some trigger rule is tripping me up. How can I have each file in my source list either reach no_file_found or clean_headers task independently of each other?
EDIT
Here is the code I use to build the tasks from a static list:
for file_name, table_name in FILES().items():
import_to_bq = import_file(file_name, table_name)
clean_headers_task = clean_headers(file_name)
start_import >> list_files >> build_check(file_name) >> [clean_headers_task, no_file]
clean_headers_task >> import_to_bq >> archive_file(file_name)
Perhaps it's the difference between file_name and filename? Looks like the task IDs use file_name while the arg is filename. Should these functions both use filename?
def build_check(filename):
return BranchPythonOperator(
task_id=f'check_for_{filename}'.replace(' ', '_'),
python_callable=check_file_exists,
op_kwargs={'filename': filename},
provide_context=True,
dag=dag
)
def check_file_exists(filename, **context):
xcom_value = context['ti'].xcom_pull(task_ids=f'list_files')
if any(filename in s for s in xcom_value):
return f'clean_headers_for_{filename}'.replace(' ', '_')
else:
return 'no_file_found'
I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?
The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.
default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"
What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).