Chaining/Grouping Celery Tasks in separate docker containers - python

I've been using celery for a couple of months and, stumbled upon a case where i just don't see any info or even an example of what i intend to achieve.
In this specific case i have a docker container running an API and two other separate containers with celery workers.
I have my queues and tasks defined and i call a task with the send_task method. Example:
r = celery_app.send_task('task_a')
Similarly i have another container with "task_b" that can be called the same way as "task_a".
I'm defining my tasks by updating the configuration of my celery app and detailing their respective queues since they run on other separate containers.
ex:
celery_app.conf.update({
'broker_url': 'amqp://admin:mypass#rabbit:5672',
'result_backend': 'redis://redis:6379/0',
'imports': (
'tasks_a_dev',
'tasks_b_dev',
),
'task_routes': {
'task_a': {'queue': 'qtasks_a_dev'},
'task_b': {'queue': 'qtasks_b_dev'},
},
'task_serializer': 'json',
'result_serializer': 'json',
'accept_content': ['json']
})
Is there anyway i can chain these two tasks together while passing the result of task_a to task_b?

If you use Celery > v3.0.0, you can use chaining.
So if you wanted task_b to run with the result of task_a, you could do the following.
from celery import chain
#task()
def task_a(a, b):
time.sleep(5)
return a + b
#task()
def task_b(a, b):
time.sleep(5)
return a + b
# the result of the first job will be the first argument of the second job
res = chain(task_a.s(1, 2), task_b.s(3)).apply_async()
# Alternatively, you could do the following
res_2 = (task_a.s(1, 2) | task_b.s(3)).apply_async()
# check ret status to get result
if ret.status == u'SUCCESS':
print "result:", ret.get()

Related

Airflow Custom Operator Waiting In Scheduled Status

I try to customize SSHOperator like CustomSSHOperator. Because I need to assign dynamic values to ssh_conn_id and pool variables of SSHOperator. However these two are not in template_fields. So I've create a custom class like below
class CustomSSHOperator(SSHOperator):
template_fields: Sequence[str] = ('command', 'remote_host', 'ssh_conn_id', 'pool')
template_fields_renderers = {"command": "bash", "remote_host": "str", "ssh_conn_id": "str", "pool": "str"}
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
And I'm creating dag like below
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"],
start_date=datetime(2022, 9, 27), schedule_interval=None,
) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
#pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
#pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
Basically, I can see the rendered values from the UI. However my tasks are waiting as in the image
I also indicate that if I change the dag like below(just populating pool variable as static, ssh_conn_id is still dynamic variable), It works
VM_CONN_ID = "vm-{vm_name}"
VM_POOL = "vm-{vm_name}"
with DAG(dag_id="my_dag", tags=["Project", "Team"], start_date=datetime(2022, 9, 27), schedule_interval=None,) as dag:
tasks = []
vm1_task = CustomSSHOperator(task_id='vm1_task',
# ssh_conn_id='vm-112',
pool='vm-112',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm1']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
vm2_task = CustomSSHOperator(task_id='vm2_task',
# ssh_conn_id='vm-140',
pool='vm-140',
ssh_conn_id=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
#pool=VM_CONN_ID.format(vm_name="{{dag_run.conf['vm2']}}"),
get_pty=True,
command="d=$(date) && echo $d > my_file.txt"
)
dag_run.conf parameter is {"vm1": "112", "vm2": "140"}
I couldn't find the reason. I'd be appreciate any suggestions.
Template fields are rendered after the task has been scheduled, while the task pool field is used before the task is scheduled (by the Airflow scheduler itself).
This is the reason why a template cannot be used for the pool field. See also this discussion.
What is happening in your case is that the task remains stuck in the scheduled state because it is associated with a non-existent pool (actually it is vm-{{dag_run.conf['vm1']}}, that is, evaluated before the rendering).
You should have evidence of this in the scheduler logs:
Tasks using non-existent pool 'vm-{{dag_run.conf['vm1']}}' will not be scheduled
As a proof, you can create a new pool named exactly vm-{{dag_run.conf['vm1']}} and you will see that the task will be executed.
Only later the pool field will be rendered, and that's why you see the expected rendered values from the UI. But that's not what the scheduler saw.

How do you return the result of a completed celery task and store the data in variables?

I have two flask modules app.py and tasks.py.
I set up Celery in tasks.py to complete a selenium webdriver request (which takes about 20 seconds). My goal is to simply return the result of that request to app.py.
Running the Celery worker on another terminal, I can see in the console that the Celery task completes successfully and prints all the data I need from the selenium request. However, now I just want to return the task result to app.py.
How do I obtain the celery worker results data from tasks.py and store each result element as a variable in app.py?
app.py:
I define the marketplace and call the task function and request the indexed results:
import tasks
marketplace = 'cheddar_block_games'
# This is what I am trying to get back:
price_check = tasks.scope(marketplace[0])
image = tasks.scope(marketplace[1])
tasks.py:
celery = Celery(broker='redis://127.0.0.1:6379')
#celery.task()
def scope(marketplace):
web.get(f'https://magiceden.io/marketplace/{marketplace}')
price_check = WebDriverWait(web,30).until(EC.visibility_of_element_located((By.XPATH, "/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[3]/div[2]/div[4]/div/div[2]/div[1]/div[2]/div/div[2]/div/div[2]/div/span/div[2]/div/span[1]"))).text
image = WebDriverWait(web,30).until(EC.visibility_of_element_located((By.XPATH, "/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[3]/div[2]/div[4]/div/div[2]/div[1]/div[2]/div/div[1]/div/div/img")))
return (price_check, image)
This answer might be relevant:
https://stackoverflow.com/a/30760142/9347535
app.py should call the task e.g. using scope.delay or scope.apply_async. You could then fetch the task result with AsyncResult.get():
https://docs.celeryq.dev/en/latest/userguide/tasks.html#result-backends
Since the task returns a tuple, you can store each variable by unpacking it:
https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences
The result would be something like this:
import tasks
marketplace = 'cheddar_block_games'
result = tasks.scope.delay(marketplace)
price_check, image = result.get()

Airflow - What do I do when I have a variable amount of Work that needs to be handled by a DAG?

I have a sensor task that listens to files being created in S3.
After a poke I may have 3 files, after another poke I might have another 5 files.
I want to create a DAG (or multiple dags) that listen to work request, and creates others tasks or DAGs to handle that amount of work.
I wish I could access the xcom or dag_run variable from the DAG definition (see pseudo-code as follows):
def wait_for_s3_data(ti, **kwargs):
s3_wrapper = S3Wrapper()
work_load = s3_wrapper.work()
# work_load: {"filename1.json": "s3/key/filename1.json", ....}
ti.xcom_push(key="work_load", value=work_load)
return len(work_load) > 0
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
dag_run.conf['work_load'] = work_load
s3_wrapper.move_messages_from_waiting_to_processing(work_load)
with DAG(
"ListenAndCallWorkers",
description="This DAG waits for work request from s3",
schedule_interval="#once",
max_active_runs=1,
) as dag:
wait_for_s3_data: PythonSensor = PythonSensor(
task_id="wait_for_s3_data",
python_callable=wait_for_s3_data,
timeout=60,
poke_interval=30,
retries=2,
mode="reschedule",
)
get_data_task = PythonOperator(
task_id="GetData",
python_callable=query.get_work,
provide_context=True,
)
work_load = "{{ dag_run.conf['work_load'] }}" # <--- I WISH I COULD DO THIS
do_work_tasks = [
TriggerDagRunOperator(
task_id=f"TriggerDoWork_{work}",
trigger_dag_id="Work", # Ensure this equals the dag_id of the DAG to trigger
conf={"work":keypath},
)
for work, keypath in work_load.items():
]
wait_for_s3_data >> get_data_task >> do_work_tasks
I know I cannot do that.
I also tried to defined my own custom MultiTriggerDAG object (as in this https://stackoverflow.com/a/51790697/1494511). But at that step I still don't have access to the amount of work that needs to be done.
Another idea:
I am considering build a DAG with N doWork tasks, and I pass work to up to N via xcom
def get_work(self, dag_run, ti, **_):
s3_wrapper = S3Wrapper()
work_load = ti.xcom_pull(key="work_load")
i = 1
for work, keypath in work_load.items()
dag_run.conf[f'work_{i}'] = keypath
i += 1
if i > N:
break
s3_wrapper.move_messages_from_waiting_to_processing(work_load[:N])
This idea would get the job done, but it sounds very inefficient
Related questions:
This is the same question as I have, but no code is presented on how to solve it:
Airflow: Proper way to run DAG for each file
This answer looks like it would solve the problem, but it seems to be related to Airflow versions lower than 2.2.2
How do we trigger multiple airflow dags using TriggerDagRunOperator?

Airflow - Proper way to handle DAGs callbacks

I have a DAG and then whenever it success or fails, I want it to trigger a method which posts to Slack.
My DAG args is like below:
default_args = {
[...]
'on_failure_callback': slack.slack_message(sad_message),
'on_success_callback': slack.slack_message(happy_message),
[...]
}
And the DAG definition itself:
dag = DAG(
dag_id = dag_name_id,
default_args=default_args,
description='load data from mysql to S3',
schedule_interval='*/10 * * * *',
catchup=False
)
But when I check Slack there is more than 100 message each minute, as if is evaluating at each scheduler heartbeat and for every log it did runned the success and failure method as if it worked and didn't work for the same task instance (not fine).
How should I properly use the on_failure_callback and on_success_callback to handle dags statuses and call a custom method?
The reason it's creating the messages is because when you are defining your default_args, you are executing the functions. You need to just pass the function definition without executing it.
Since the function has an argument, it'll get a little trickier. You can either define two partial functions or define two wrapper functions.
So you can either do:
from functools import partial
success_msg = partial(slack.slack_message, happy_message);
failure_msg = partial(slack.slack_message, sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
or
def success_msg():
slack.slack_message(happy_message);
def failure_msg():
slack.slack_message(sad_message);
default_args = {
[...]
'on_failure_callback': failure_msg
'on_success_callback': success_msg
[...]
}
In either method, note how just the function definition failure_msg and success_msg are being passed, not the result they give when executed.
default_args expands at task level, therefore it becomes per task callback
apply the attribute at DAG flag level outside of "default_args"
What is the slack method you are referring to? The scheduler is parsing your DAG file every heartbeat, so if the slack some function defined in your code, it is going to get run every heartbeat.
A few things you can try:
Define the functions you want to call as PythonOperators and then call them at the task level instead of at the DAG level.
You could also use TriggerRules to set tasks downstream of your ETL task that will trigger based on failure or success of the parent task.
From the docs:
defines the rule by which dependencies are applied for the task to get triggered. Options are: { all_success | all_failed | all_done | one_success | one_failed | dummy}
You can find an example of how this would look here (full disclosure - I'm the author).

APScheduler job is not starting as scheduled

I'm trying to schedule a job to start every minute.
I have the scheduler defined in a scheduler.py script:
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
executors = {
'default': ThreadPoolExecutor(10),
'processpool': ProcessPoolExecutor(5)
}
job_defaults = {
'coalesce': False,
'max_instances': 5
}
scheduler = BackgroundScheduler(executors=executors,job_defaults=job_defaults)
I initialize the scheduler in the __init__.py of the module like this:
from scheduler import scheduler
scheduler.start()
I want to start a scheduled job on a specific action, like this:
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
def TestScheduler():
for i in range(0,29):
starttime = time()
print "test"
sleep(1.0 - ((time() - starttime) % 1.0))
First: when I'm executing the AddJob() function in the python console it starts to run as expected but not in the background, the console is blocked until the TestScheduler function ends after 30 seconds. I was expecting it to run in the background because it's a background scheduler.
Second: the job never starts again even when specifying a repeat interval of 1 minute.
What am I missing?
UPDATE
I found the issue thanks to another thread. The wrong line is this:
scheduler.scheduled_job(func=TestScheduler(),
trigger='interval',
minutes=1,
id=job_id
)
I changed it to:
scheduler.add_job(func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
TestScheduler() becomes TestScheduler. Using TestScheduler() cause the result of the function TestScheduler() to be passed as an argument of the add_job().
The first problem seems to be that you are initializing the scheduler inside the __init__.py, which doesn't seem to be the recommended way.
Code that exists in the __init__.py gets executed the first time a module from the specific folder gets imported. For example, imagine this structure:
my_module
|--__init__.py
|--test.py
with __init__.py:
from scheduler import scheduler
scheduler.start()
the scheduler.start() command gets executed when from my_module import something. So it either doesn't start at all from __init__.py or it starts many times (depending on the rest of your code!).
Another problem must be the use of scheduler.scheduled_job() method. If you read the documentation on adding jobs, you will observe that the recomended way is to use the add_job() method and not the scheduled_job() which is a decorator for convenience.
I would suggest something like this:
Keep my_scheduler.py as is.
Remove the scheduler.start() line from __init__.py.
Change your main file as follows:
from my_scheduler import scheduler
if not scheduler.running: # Clause suggested by #CyrilleMODIANO
scheduler.start()
def AddJob():
dbid = repository.database.GetDbid()
job_id = 'CollectData_{0}'.format(dbid)
scheduler.add_job(
func=TestScheduler,
trigger='interval',
minutes=1,
id=job_id
)
...

Categories

Resources