I am looping over a series of status tests running in Apache Airflow. Under certain conditions I want to publish a message. Something like this is what I have now:
for count, test in enumerate(test_list):
test = StatusTest(test['_id'],test['status'])
check_status_task = PythonOperator(
task_id='run_status_checker_'+str(count),
python_callable=run_status_checker,
op_kwargs={'status_test':test},
provide_context=True,
xcom_push=True,
retries=0,
dag=dag)
pub_results_task = PythonOperator(
task_id='pub_results_' + str(count),
python_callable=pub_result,
#op_kwargs={'task_id':'run_status_checker_'+str(count)},
provide_context=True,
trigger_rule='all_done',
dag=dag
)
check_status_task >> pub_results_task
Code that calls status checker, gets responses, etc:
def run_status_test(ti, **kwargs):
status_test_conn = MongoHook(conn_id='test_selector_mongo')
status_test = kwargs.pop('status_test', None)
is_up = check_test_status(status_test)
status, response = is_up
if status:
if status_test.already_failed(test_conn):
status_test.status = 'true'
status_test.update_status(status_test_conn)
message = {"Test {0} passed".format(status_test.uid)}
ti.xcom_push(XCOM_CHECK_STATUS_KEY, message)
else:
if (test.already_down(test_conn) and 'false' in test.status):
test.update_sty_down(status_test_conn, response=response)
else:
status_test.status = 'false'
status_test.update_status(status_test_conn)
message = {status_test.uid}
ti.xcom_push(XCOM_CHECK_STATUS_KEY, message)
status_test_con.close_conn()
Code that would do message publishing:
def pub_result(dag, ti, **context):
message = ti.xcom_pull(
task_ids=context['task_id'],
key=XCOM_CHECK_STATUS_KEY
)
message_con = pika.BlockingConnection(pika.URLParameters(os.getenv['BROKERURL']))
channel = message_con.channel()
channel.queue_declare(queue='status_test', durable=True, auto_delete=False, exclusive=False)
channel.basic_publish(exchange='', routing_key='outage', body=json.dumps(message), properties=pika.BasicProperties(delivery_mode=2))
message_con.close()
How do I tell Airflow to only do the publishing part of the workflow if certain conditions are met such as:
If there is a message then publish it (or run the publish task).
If not don't do anything.
I was thinking I could just check the value in the XCOM and publish if there is something or do nothing if it is empty. However, I wanted to see if there was a proper way to do it in Airflow.
You could use the BranchPythonOperator to create a task that determines whether to call pub_result or some other DummyTask. Here is an example from the Airflow Github repo.
Related
A task that performs the same task in one dag was created using a for loop. It is hoped to be divided into two branches that depend on the result of this task. However, all tasks created using the for loop return the xcom of the last task. How can tasks created using for loop return each xcom?
Each task a,b,c returns xcom_a, xcom_b, and xcom_c. However, branch tasks all get the same xcom_c. What should I do?
default_args ={'start_date':days_ago(1)}
dag=DAG(
dag_id='batch_test',
default_args=default_args,
schedule_interval=None)
def count(**context):
name = context['params']['name']
dict = {'a':50,
'b':100,
'c':150}
if dict[name]<100:
task_id=f'add_{name}'
return task_id
elif dict[name]>=100:
task_id=f'times_{name}'
return task_id
def branch(**context):
task_id = context['ti'].xcom_pull(task_ids=f'count_task_{name}')
return task_id
def add(**context):
ans = context['ti'].xcom_pull(task_ids=f'branch_task_{name}')
ans_dict = {'add_a':50+100,
'add_b':100+100,
'add_c':150+100}
ans = ans_dict[ans]
return print(ans)
def times(**context):
ans = context['ti'].xcom_pull(task_ids=f'branch_task_{name}')
ans_dict = {'times_a':50*100,
'times_b':100*100,
'times_c':150*100}
ans = ans_dict[ans]
return print(ans)
name_list = ['a','b','c']
for name in name_list:
exec_count_task = PythonOperator(
task_id = f'count_task_{name}',
python_callable = count,
provide_context=True,
params = {'name':name},
dag=dag
)
exec_branch_task = BranchPythonOperator(
task_id = f'branch_task_{name}',
python_callable = branch,
provide_context = True,
dag = dag
)
exec_add_count = PythonOperator(
task_id = f'add_{name}',
python_callable = add,
provide_context = True,
dag = dag
)
exec_times_count = PythonOperator(
task_id = f'times_{name}',
python_callable = times,
provide_context = True,
dag = dag
)
exec_count_task >> exec_branch_task >> [exec_add_count, exec_times_count]
i want this...
task_a >> branch_a (branch python operator, xcom pull returned by task_a) >> [task_a1, task_a2]
task_b >> branch_b (branch python operator, xcom pull returned by task_b) >> [task_b1, task_b2]
task_c (>> branch_c (branch python operator, xcom pull returned by task_c) >> [task_c1, task_c2]
but
task_a >> branch_a (branch python operator, xcom pull returned by task_c) >> [task_a1, task_a2]
task_b >> branch_b (branch python operator, xcom pull returned by task_c) >> [task_b1, task_b2]
task_c >> branch_c (branch python operator, xcom pull returned by task_c) >> [task_c1, task_c2]
I'm unable to reproduce the behavior you describe using classic operators and the TaskFlow API. If you are able to add more context and code of what you are actually executing that would be most helpful.
In the meantime, here are the examples I used should it give you some guidance for troubleshooting. I added a task at the end of the streams to check that the first task indeed pushes its expected value.
Classic Operators
from pendulum import datetime
from airflow.models import DAG
from airflow.operators.python import BranchPythonOperator, PythonOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG(dag_id="multiple_branch_loop", start_date=datetime(2023, 1, 1), schedule=None):
def xcom_push(val):
return val
def func():
...
def choose(val):
return f"task_{val}"
def check_xcom_output_from_first(val, expected_val):
assert val == expected_val
stuff = ["a", "b", "c"]
for i in stuff:
first = PythonOperator(task_id=f"first_task_{i}", python_callable=xcom_push, op_kwargs={"val": i})
branch = BranchPythonOperator(task_id=f"branch_{i}", python_callable=choose, op_kwargs={"val": i})
second = PythonOperator(task_id=f"task_{i}", python_callable=func)
third = PythonOperator(task_id=f"task_{i}a", python_callable=func)
check = PythonOperator(
task_id=f"check_{i}",
trigger_rule=TriggerRule.ALL_DONE,
python_callable=check_xcom_output_from_first,
op_kwargs={"val": first.output, "expected_val": i},
)
first >> branch >> [second, third] >> check
The check* tasks succeed meaning the first task in a given stream does push its value and not the last stream's.
TaskFlow API
from pendulum import datetime
from airflow.decorators import dag, task
from airflow.utils.trigger_rule import TriggerRule
#dag(start_date=datetime(2023, 1, 1), schedule=None)
def multiple_branch_loop():
#task()
def xcom_push(val):
return val
#task()
def func():
...
#task.branch()
def choose(val):
return f"task_{val}"
#task(trigger_rule=TriggerRule.ALL_DONE)
def check_xcom_output_from_first(val, expected_val):
assert val == expected_val
stuff = ["a", "b", "c"]
for i in stuff:
first = xcom_push.override(task_id=f"first_task_{i}")(val=i)
branch = choose.override(task_id=f"branch_{i}")(val=first)
second = func.override(task_id=f"task_{i}")()
third = func.override(task_id=f"task_{i}a")()
check = check_xcom_output_from_first.override(task_id=f"check_{i}")(val=first, expected_val=i)
first >> branch >> [second, third] >> check
multiple_branch_loop()
Same expected behavior as well confirmed in the check* tasks:
Your functions branch, add, and times don't define name themselves, so it is taken out of global context, which is at time of function execution the last value of for name in name_list. This is a common trap explained e.g. here: tkinter creating buttons in for loop passing command arguments
To fix it, you can either pull name from context as in count, or provide it via op_args or op_kwargs when you create the respective operator, as in the answer by Josh Fell:
first = PythonOperator(task_id=f"first_task_{i}", python_callable=xcom_push, op_kwargs={"val": i})
branch = BranchPythonOperator(task_id=f"branch_{i}", python_callable=choose, op_kwargs={"val": i})
I am trying to use a python operator to fetch a list of filenames that have the run date string in it and then download these files using the sftp-to-s3 operator. Is there a better way to do this? With this following code I get the error > name ti not found
def get_files(**kwargs):
sftp_hook = SFTPHook(ftp_conn_id='conn')
str_date = kwargs["date"]
files = []
with sftp_hook.get_conn() as conn:
for entry in conn.listdir_attr():
mode = entry.st_mode
if S_ISREG(mode) and str_date in entry.filename:
files.append(entry.filename)
return files -> list of files to download
with dag:
date = '{{ next_ds_nodash }}'
source_files = PythonOperator(task_id=f"get_files",
python_callable=get_files,
op_kwargs={'date': {date}},
provide_context=True,
dag=dag)
file_list = ti.xcom_pull(task_ids='get_files', key='files')
collect = []
for file in file_list:
op = SFTPToS3Operator(task_id=f"download_{file}",
sftp_conn_id="conn",
sftp_path=f"path1/{file}" if 'key' in file else f"path2/{file}",
s3_conn_id=aws_conn_id,
s3_bucket=s3_bucket,
s3_key =f"/temp/{date}/{file}",
dag=dag)
collect.append(op)
collect.set_upstream(source_files)
According to the XCOM Documentation, XCOMs are similar to Variables and therefore must be serialized and deserialized. Use json.dumps() and json.loads() to do so.
Additionally you shoul be using the xcom pull inside another task instead of in the DAG definition itself.
The DAG should only be calls to various task_ids. Perform all operations within tasks and chain them together in the DAG.
Here is an example of the proper use of XCOMs.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
'example_xcom',
schedule_interval="#once",
start_date=days_ago(2),
default_args={'owner': 'airflow'},
tags=['example'],
)
value_1 = [1, 2, 3]
value_2 = {'a': 'b'}
def push(**kwargs):
"""Pushes an XCom without a specific target"""
kwargs['ti'].xcom_push(key='value from pusher 1', value=value_1)
def push_by_returning(**kwargs):
"""Pushes an XCom without a specific target, just by returning it"""
return value_2
def puller(**kwargs):
"""Pull all previously pushed XComs and check if the pushed values match the pulled values."""
ti = kwargs['ti']
# get value_1
pulled_value_1 = ti.xcom_pull(key=None, task_ids='push')
if pulled_value_1 != value_1:
raise ValueError(f'The two values differ {pulled_value_1} and {value_1}')
# get value_2
pulled_value_2 = ti.xcom_pull(task_ids='push_by_returning')
if pulled_value_2 != value_2:
raise ValueError(f'The two values differ {pulled_value_2} and {value_2}')
# get both value_1 and value_2
pulled_value_1, pulled_value_2 = ti.xcom_pull(key=None, task_ids=['push', 'push_by_returning'])
if pulled_value_1 != value_1:
raise ValueError(f'The two values differ {pulled_value_1} and {value_1}')
if pulled_value_2 != value_2:
raise ValueError(f'The two values differ {pulled_value_2} and {value_2}')
push1 = PythonOperator(
task_id='push',
dag=dag,
python_callable=push,
)
push2 = PythonOperator(
task_id='push_by_returning',
dag=dag,
python_callable=push_by_returning,
)
pull = PythonOperator(
task_id='puller',
dag=dag,
python_callable=puller,
)
pull << [push1, push2]
I just tried the xcom pull to print the return value of a task.
Basically, I ran a SQL task select 123123123 it will return 123123123. And my dag is,
<<args and DAG details goes here>>
def puller(**kwargs):
ti = kwargs['ti']
pulled_value_1 = ti.xcom_pull(task_ids='push_result')
print("VALUE IN PULLER : ", pulled_value_1)
def get_dag_ids(**kwargs):
postgres_hook = PostgresHook(postgres_conn_id='cloudsqlpg')
records = postgres_hook.get_records(sql='select 123123123')
return records
t1 = PythonOperator(
task_id="push_result",
python_callable=get_dag_ids,
provide_context=True,
dag=dag
)
pull = PythonOperator(
task_id='pullee',
dag=dag,
python_callable=puller,
provide_context=True,
)
t1 >> pull
I got the output from the task t1 as below.
INFO - Done. Returned value was: [(123123123,)]
I just need the correct value as 123123123. Is it an array? how do I extract the correct value from this result?
You may want to unpack it into a single list/tuple.
[123123123..., n]
list(itertools.chain(*records))
tuple(itertools.chain(*records))
Then it'll be easier to access each result i.e records[0]
I have a Airflow job like below:
import time
job_id = int(time.time())
airflow_job1 = PythonOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job2 = BashOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job1 >> airflow_job2
I know every time when script launched, I will have a new job_id, used in each airflow task. But I wonder what if I run the script from middle, like airflow_job1 failed, and I fix problem and rerun from airflow_job1 in UI, is a new job_id generated in the rerun, or Airflow use the last job_id before?
Actually, after I check with a simple case:
# global parameter
job_id = int(time.time())
def airflow_job1(job_id, **context):
print("in airflow_job1, current timestamp: %s" % job_id)
def airflow_job2(job_id, **context):
print("in airflow_job2, current timestamp: %s" % job_id)
airflow_job1 = PythonOperator(
task_id='airflow_job1',
provide_context=True,
python_callable=airflow_job1,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job2 = PythonOperator(
task_id='airflow_job2',
provide_context=True,
python_callable=airflow_job2,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job1 >> airflow_job2
I find job_id in airflow_job1 and airflow_job2 are different even if in the same run.
So the conclusion is that we shouldn't set global parameter in this way, maybe use xcom_pull / xcom_push to solve that
import celery
def temptask(n):
header=list(tempsubtask.si(i) for i in range(n))
callback=templink.si('printed at last?')
r = celery.chord(celery.group(header))(callback)
return r
#task()
def tempsubtask(i):
print i
for x in range(i):
time.sleep(2)
current_task.update_state(
state='PROGRESS', meta={'completed': x, 'total': i })
#task()
def templink(x):
print 'this should be run at last %s'%x
#executing temptask
r = temptask(100)
I want acccess to the progress status updated by tempsubtask. How can I go about achieving it?
I've had a similar question. Most examples on the net are outdated, the docs didn't help much, but the docs have links to sources, reading which did help me.
My objective was to organize parallel tasks in groups. The groups would have to be executed sequentially in order.
So I decided to generate the task ids before starting any tasks separately and only assigning them. I'm using Celery 4.3.0
Here's a brief example.
Firstly I needed a dummy task to make execution sequential and to be able to check the state of a certain group. As this is used a callback, it will complete only after all other tasks in the group.
#celery.task(bind=True, name="app.tasks.dummy_task")
def dummy_task( self, results=None, *args, **kwargs ):
return results
My comments here explain how I assign ids.
from celery.utils import uuid
from celery import group, chord, chain
# Generating task ids,
# which can be saved to a db, sent to the client and so on
#
# This is done before executing any tasks
task_id_1 = uuid()
task_id_2 = uuid()
chord_callback_id_1 = uuid()
chord_callback_id_2 = uuid()
workflow_id = None
# Generating goups, using signatures
# the group may contain any number of tasks
group_1 = group(
[
celery.signature(
'app.tasks.real_task',
args=(),
kwargs = { 'email': some_email, 'data':some_data },
options = ( {'task_id': task_id_1 } )
)
]
)
group_2 = group(
[
celery.signature(
'app.tasks.real_task',
args=(),
kwargs = { 'email': some_email, 'data':some_data },
options = ( {'task_id': task_id_2 } )
)
]
)
# Creating callback task which will simply rely the result
# Using the task id, which has been generated before
#
# The dummy task start after all tasks in this group are completed
# This way we know that the group is completed
chord_callback = celery.signature(
'app.tasks.dummy_task',
options=( {'task_id': chord_callback_id_1 } )
)
chord_callback_2 = celery.signature(
'app.tasks.dummy_task',
options=( {'task_id': chord_callback_id_2 } )
)
# we can monitor each step status
# by its chord callback id
# the id of the chord callback
step1 = chord( group_1, body=chord_callback )
# the id of the chord callback
step2 = chord( group_2, body=chord_callback_2 )
# start the workflow execution
# the steps will execute sequentially
workflow = chain( step1, step2 )()
# the id of the last cord callback
workflow_id = workflow.id
# return any ids you need
print( workflow_id )
That's how I can check the status of any task in my app.
# This is a simplified example
# some code is omitted
from celery.result import AsyncResult
def task_status( task_id=None ):
# PENDING
# RECEIVED
# STARTED
# SUCCESS
# FAILURE
# REVOKED
# RETRY
task = AsyncResult(task_id)
response = {
'state': task.state,
}
return jsonify(response), 200
After hours of googling I stumbled upon http://www.manasupo.com/2012/03/chord-progress-in-celery.html . Though the solution there didn't work for me out of the box, it did inspire me to try something similar.
from celery.utils import uuid
from celery import chord
class ProgressChord(chord):
def __call__(self, body=None, **kwargs):
_chord = self.type
body = (body or self.kwargs['body']).clone()
kwargs = dict(self.kwargs, body=body, **kwargs)
if _chord.app.conf.CELERY_ALWAYS_EAGER:
return self.apply((), kwargs)
callback_id = body.options.setdefault('task_id', uuid())
r= _chord(**kwargs)
return _chord.AsyncResult(callback_id), r
and instead of executing celery.chord I use ProgressChord as follows:
def temptask(n):
header=list(tempsubtask.si(i) for i in range(n))
callback=templink.si('printed at last?')
r = celery.Progresschord(celery.group(header))(callback)
return r
returned value of r contained a tuple having both, callback's asyncresult and a group result. So success looked something like this:
In [3]: r
Out[3]:
(<AsyncResult: bf87507c-14cb-4ac4-8070-d32e4ff326a6>,
<GroupResult: af69e131-5a93-492d-b985-267484651d95 [4672cbbb-8ec3-4a9e-971a-275807124fae, a236e55f-b312-485c-a816-499d39d7de41, e825a072-b23c-43f2-b920-350413fd5c9e, e3f8378d-fd02-4a34-934b-39a5a735871d, c4f7093b-9f1a-4e5e-b90d-66f83b9c97c4, d5c7dc2c-4e10-4e71-ba2b-055a33e15f02, 07b1c6f7-fe95-4c1f-b0ba-6bc82bceaa4e, 00966cb8-41c2-4e95-b5e7-d8604c000927, e039c78e-6647-4c8d-b59b-e9baf73171a0, 6cfdef0a-25a2-4905-a40e-fea9c7940044]>)
I inherited and overrode [celery.chord][1] instead of [celery.task.chords.Chord][2] because I couldn't find it's source anywhere.
Old problem and I wasted a several days to find a better and modern solution. In my current project I must to track group progress separately and release lock in final callback.
And current solution is much more simple (but harder to guess), subject lines commented at the end:
#celery_app.task(name="_scheduler", track_started=True, ignore_result=False)
def _scheduler():
lock = cache.lock("test_lock")
if not lock.acquire(blocking=False):
return {"Error": "Job already in progress"}
lock_code = lock.local.token.decode("utf-8")
tasks = []
for x in range(100):
tasks.append(calculator.s())
_group = group(*tasks)
_chord = chord(_group)(_get_results.s(token=lock_code))
group_results = _chord.parent # This is actual group inside chord
group_results.save() # I am saving it to usual results backend, and can track progress inside.
return _chord # can return anything, I need only chord.
I am working in Celery 5.1