Get celery task ids in advanced workflow - python

I need to realise following scenario:
Execute task A
Execute multiple task B in parallel with different arguments
Wait for all tasks to finish
Execute multiple task B in parallel with different arguments
Wait for all tasks to finish
Execute task C
I have achieved this by implementing chain of chords, here is simplified code:
# inside run() method of ATask
chord_chain = []
for taskB_group in taskB.groups.all():
tasks = [BTask().si(id=taskB_model.id) for taskB_model in taskB_group.children.all()]
if len(tasks):
chord_chain.append(chord(tasks, _dummy_callback.s()))
chord_chain.append(CTask().si(execution_id))
chain(chord_chain)()
The problem is that I need to have ability to call revoke(terminate=True) on all BTasks in any point of time. The lower level problem is that I can't get to BTask celery ids.
Tried to get BTask ids via chain result = chain(chord_chain)(). But I didn't found that information in returned AsyncResult object. Is it possible to get chain children ids from this object? (result.children is None)
Tried to get BTask ids via ATask AsyncResult, but it seems that children property only contains results of first chord and not the rest of tasks.
>>> r=AsyncResult(#ATask.id#)
>>> r.children
[<GroupResult: 5599ae69-4de0-45c0-afbe-b0e573631abc [#BTask.id#, #BTask.id#]>,
<AsyncResult: #chord_unlock.id#>]

Solved by flagging ATask related model with aborted status flag and adding check at start of BTask.

Related

Celery chord callback isn't always launched

I'm trying to use a chord to launch a report update after the update is completed.
#shared_task(autoretry_for=(Exception,), retry_backoff=True, retry_kwargs {'max_retries': 5})
def upload(df: pd.DataFrame, **kwargs):
ed = EntityDataPoint(df, **kwargs)
uploadtasks, source, subtype = ed.upload_database()
chord(uploadtasks)(final_report.si(logger=ed.logger,
source=source,
subtype=subtype,
index=ed.index))
With uploadtask being :
g = group(unwrap_upload_bulk.s(obj = self, data = self.data.iloc[i:i+chunk_size])
for i in range(0, len(self.data), chunk_size))
When the header of the chord has more than 2 elements, the first two subtasks succeed, and the rest of the tasks in the group and the callback are not launched, without any error being sent anywhere, and without any information in the celery workers logs. After inspecting the workers, with celery inspect active, scheduled, there doesn't seem to be any waiting task in the queue.
If the header (the group) has 2 or less elements, there is no problem, the group tasks finishes, the callback is called.
It does not seem depend on the size of the elements (if each subtask in the group is sending 100 rows, we still have the same behavior for 1000 rows).
If I just launch the group tasks, without the chord and the callback, the tasks succeed without any error.
I tried using different syntaxes for the chord, and it doesn't seem to change anything.
I tried using the group.link feature to see what would happen,and the group seems to finish when doing this, but the callback doesn't happen after all the group tasks are finished ofc since it's not designed for that as I understood from the documentation, so it's not completely the behavior I want.
I'm using Celery 5.2.3 with a Redis 7.0.0 broker and a Django 3.2.13 backend with psql, with python 3.9.
Everything is running on seperate docker containers.
It seems that using the group directly as the header of the chord was creating the problem. It was probably using the first task in the group as the header, and the second as the callback (though I can't understand why that didn't caused some error with the arguments of theses tasks).
Instead of returning :
group(unwrap_upload_bulk.s(obj = self, data = self.data.iloc[i:i+chunk_size])
for i in range(0, len(self.data), chunk_size))
I now return :
[unwrap_upload_bulk.s(obj = self, data = self.data.iloc[i:i+chunk_size])
for i in range(0, len(self.data), chunk_size)]
And it works just as expected.

Can Celery pass a Status Update to a non-Blocking Caller?

I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.

Python Celery: How do I join task results out of order?

I have a simple project where I create a bunch of chunks of work that's not related to each other, create tasks, pass them to Redis, and have a number of workers spread out over a Docker Swarm chew through the queue of long-running tasks. When the workers finish they dump their completed work in an NFS share and send back a text value to the Celery client.
I'm using celery.result.ResultSet's .join() function on the resultset array of asyncresult objects. The join() includes a callback that (for now) simply prints the result.
My problem is join() blocks until it receives each asyncresult value in the order it was given. My swarm is made up of a number of hosts that are vastly different machines, and it's important to me to have results come back as they finish, not in order or once they are all complete.
Is there a way via Celery to properly trigger a callback function as tasks are finished? I've looked at a lot of examples online and seems like my only option is to try my luck with asyncio, but Python is not exactly my strong suite.
Func for creating tasks and ResultSet obj:
def populateQueue(encodeTasks):
r = ResultSet([])
taskHandles = {}
for task in encodeTasks:
try:
ret = encode.delay(task)
r.add(ret)
logging.debug("Task ID: " + str(ret.task_id))
taskHandles[ret.task_id] = ret
except:
logging.info("populateQueue fail: " + str(task.traceback))
logging.info("Tasks queued: " + str(len(taskHandles)))
return taskHandles, r
Part of main() which waits for results:
frameCountTotal = getFrameCount(targetFile)
encodeTasks = buildCmdString(targetFile, frameCountTotal, clientCount)
taskHandles, retSet = populateQueue(encodeTasks)
logging.info("Waiting on tasks...")
retSet.join(callback=testCallback)
Thanks in advance
Found an answer to my own question:
ResultSet has another method called join_native(), which I think uses more specific API calls to the broker as long as that broker is one of several known products (RabbitMQ, Redis, etc). Celery's documentation just says that it gives better performance if you meet the broker requirement. What the docs don't say is that it allows for out-of-order returns (at least on Redis, haven't tried RMQ).

Getting all task IDs from nested chains and chords

I'm using Celery 3.1.9 with a Redis backend. The job that I'm running is made of several subtasks which run in chords and chains. The structure looks like this:
prepare
download data (a chord of 2 workers)
parse and store downloaded data
long-running chord of 4 workers
finalize
generate report
Each item in the list is a subtask, they are all chained together. Steps 2 and 4 are chords. The whole thing is wired up by creating a chord for step 4 whose callback is a chain of 4 -> 6, then a chord is created for step 2, whose callback is 3 -> first chord. Then, finally a chain is created 1 -> second chord. This chain is then started with delay() and its ID is stored in the database.
The problem is two-fold. First I want to be able to revoke the whole thing and second I want to have a custom on_failure on my Task class that does some cleanup, and reports the failure to the user.
Currently I store the chain's task ID. I thought I could use this to revoke the chain. Also, in case of an error I wanted to walk the chain to its root (in the on_failure handler) to retrieve the relevant record from the database. This doesn't work, because when you re-create an instance of AsyncResult with just the ID of the task, its parent attribute is None.
The second thing I tried was to store the result of serializable() called on the outer chain's result. This however, does not return the entire tree of AsyncResult objects, it just returns the IDs of the first level in the chain (so not the IDs of the children in the chords.)
The third thing I tried was to implement serializable() myself, but as it turns out, the reason why the original method doesn't go further than 2 levels is because the chain's children are celery.canvas.chord objects, instead of AsyncResult instances.
An illustration of the problem:
chord([
foo.si(),
foo.si(),
foo.si(),
], bar.si() | bar.si())
res = chord.apply_async()
pprint(res.serializable())
Prints the following:
(('50c9eb94-7a63-49dc-b491-6fce5fed3713',
('d95a82b7-c107-4a2c-81eb-296dc3fb88c3',
[(('7c72310b-afc7-4010-9de4-e64cd9d30281', None), None),
(('2cb80041-ff29-45fe-b40c-2781b17e59dd', None), None),
(('e85ab83d-dd44-44b5-b79a-2bbf83c4332f', None), None)])),
None)
The first ID is the ID of the callback chain, the second ID is from the chord task itself, and the last three are the actual tasks inside the chord. But I can't get at the result from the task inside the callback chain (i.e. the ID of the two bar.si() calls).
Is there any way to get at the actual task IDs?
One hacky way is calling the tasks with apply_async, save the task ids and wait for them manually. In this way you will have complete control of happens but you should only wait for async tasks as last resort. Now you can access task id, return value, etc. For example something like this:
task1 = a_task.apply_async()
task2 = b_task.apply_async()
task3 = c_task.apply_async()
tasks = [task1, task2, task3]
for task in tasks:
task.wait()
I have a nested dag which is a mixture of groups and chains. The following recursive method works well to get the task_ids and their result:
import celery
def get_task_id_result_tuple_list(run_dag, with_result=True):
task_id_result_list = []
# for groups, parents are first task, then iterate over the children
if isinstance(run_dag, celery.result.GroupResult):
entry = (run_dag.parent, run_dag.parent.result) if with_result else run_dag.parent
task_id_result_list.append(entry)
children = run_dag.children
for child in children:
task_id_result_list.extend(get_task_id_result_tuple_list(child, with_result))
# for AsyncResults, append parents in reverse
elif isinstance(run_dag, celery.result.AsyncResult):
ch = run_dag
ch_list = [(ch, ch.result)] if with_result else [ch]
while ch.parent is not None:
ch = ch.parent
entry = (ch, ch.result) if with_result else ch
ch_list.append(entry)
# remember to reverse the list to get the calling order
task_id_result_list.extend(reversed(ch_list))
return task_id_result_list
# dag is the nested celery structure of chains and groups
run_dag = dag.apply_async()
task_id_result_tuples = get_task_id_result_tuple_list(run_dag)
task_id_only = get_task_id_result_tuple_list(run_dag, False)
NOTE: I have not tested this with chords yet but I imagine it would either work as is or would maybe need another conditional branch to handle that.

Reporting yielded results of long-running Celery task

Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.

Categories

Resources