I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.
I would like to create periodic task which makes query to database, get's data from data provider, makes some API requests, formats documents and sends them using another API.
Result of the previous task should be chained to the next task. I've got from the documentation that I have to use chain, group and chord in order to organise chaining.
But, what else I've got from the documentation: don't run subtask from the task, because it might be the reason of deadlocks.
So, the question is: how to run subtasks inside the periodic task?
#app.task(name='envoy_emit_subscription', ignore_result=False)
def emit_subscriptions(frequency):
# resulting queryset is the source for other tasks
return Subscription.objects.filter(definition__frequency=1).values_list('pk', flat=True)
#app.task(name='envoy_query_data_provider', ignore_result=False)
def query_data_provider(pk):
# gets the key from the chain and returns received data
return "data"
#app.task(name='envoy_format_subscription', ignore_result=False)
def format_subscription(data):
# formats documents
return "formatted text"
#app.task(name='envoy_send_subscription', ignore_result=False)
def send_subscription(text):
return send_text_somehow(text)
Sorry for the noob question, but I'm a noob in Celery, indeed.
Something like this maybe?
import time
while True:
my_celery_chord()
time.sleep(...)
We have a distributed architecture based on rabbitMQ and Celery.
We can launch in parallel multiple tasks without any issue. The scalability is good.
Now we need to control the task remotely: PAUSE, RESUME, CANCEL.
The only solution we found is to make in the Celery task a RPC call to another task that replies the command after a DB request. The Celery task and RPC task are not on the same machine and only the RPC task has access to the DB.
Do you have any advice how to improve it and easily communicate with an ongoing task?
Thank you
EDIT:
In fact we would like to do something like in the picture below. It's easy to do the Blue configuration or the Orange, but we don't know how to do both simultaneously.
Workers are subscribing to a common Jobs queue and each worker has its own Admin queue declared on an exchange.
EDIT:
IF this is not possible with Celery, I'am open to a solution with other frameworks like python-rq.
It look like the Control Bus pattern.
For a better scalability and in order to reduce the RPC call, I recommend to reverse the logic. The PAUSE, RESUME, CANCEL command are push to the Celery tasks through a control bus when the state change occurs. The Celery app will store the current state of the Celery app in a store (could be in memory, on the filesystem..). If task states must be kept even after a stop/start of the app, It will involve more work in order to keep both app synchronized (eg. synchronization at startup).
I'd like to demonstrate a general approach to implementing pause-able (and resume-able) ongoing celery tasks through the workflow pattern. Note: Original answer written here. Re-writing here due to this post being very relevant.
Concept
With celery workflows - you can design your entire operation to be divided into a chain of tasks. It doesn't necessarily have to be purely a chain, but it should follow the general concept of one task after another task (or task group) finishes.
Once you have a workflow like that, you can finally define points to pause at throughout your workflow. At each of these points, you can check whether or not the frontend user has requested the operation to pause and continue accordingly. The concept is this:-
A complex and time consuming operation O is split into 5 celery tasks - T1, T2, T3, T4, and T5 - each of these tasks (except the first one) depend on the return value of the previous task.
Let's assume we define points to pause after every single task, so the workflow looks like-
T1 executes
T1 completes, check if user has requested pause
If user has not requested pause - continue
If user has requested pause, serialize the remaining workflow chain and store it somewhere to continue later
... and so on. Since there's a pause point after each task, that check is performed after every one of them (except the last one of course).
But this is only theory, I struggled to find an implementation of this anywhere online so here's what I came up with-
Implementation
from typing import Any, Optional
from celery import shared_task
from celery.canvas import Signature, chain, signature
#shared_task(bind=True)
def pause_or_continue(
self, retval: Optional[Any] = None, clause: dict = None, callback: dict = None
):
# Task to use for deciding whether to pause the operation chain
if signature(clause)(retval):
# Pause requested, call given callback with retval and remaining chain
# chain should be reversed as the order of execution follows from end to start
signature(callback)(retval, self.request.chain[::-1])
self.request.chain = None
else:
# Continue to the next task in chain
return retval
def tappable(ch: chain, clause: Signature, callback: Signature, nth: Optional[int] = 1):
'''
Make a operation workflow chain pause-able/resume-able by inserting
the pause_or_continue task for every nth task in given chain
ch: chain
The workflow chain
clause: Signature
Signature of a task that takes one argument - return value of
last executed task in workflow (if any - othewise `None` is passsed)
- and returns a boolean, indicating whether or not the operation should continue
Should return True if operation should continue normally, or be paused
callback: Signature
Signature of a task that takes 2 arguments - return value of
last executed task in workflow (if any - othewise `None` is passsed) and
remaining chain of the operation workflow as a json dict object
No return value is expected
This task will be called when `clause` returns `True` (i.e task is pausing)
The return value and the remaining chain can be handled accordingly by this task
nth: Int
Check `clause` after every nth task in the chain
Default value is 1, i.e check `clause` after every task
Hence, by default, user given `clause` is called and checked
after every task
NOTE: The passed in chain is mutated in place
Returns the mutated chain
'''
newch = []
for n, sig in enumerate(ch.tasks):
if n != 0 and n % nth == nth - 1:
newch.append(pause_or_continue.s(clause=clause, callback=callback))
newch.append(sig)
ch.tasks = tuple(newch)
return ch
Explanation - pause_or_continue
Here pause_or_continue is the aforementioned pause point. It's a task that will be called at specific intervals (intervals as in task intervals, not as in time intervals). This task then calls a user provided function (actually a task) - clause - to check whether or not the task should continue.
If the clause function (actually a task) returns True, the user provided callback function is called, the latest return value (if any - None otherwise) is passed onto this callback, as well as the remaining chain of tasks. The callback does what it needs to do and pause_or_continue sets self.request.chain to None, which tells celery "The task chain is now empty - everything is finished".
If the clause function (actually a task) returns False, the return value from the previous task (if any - None otherwise) is returned back for the next task to receive - and the chain goes on. Hence the workflow continues.
Why are clause and callback task signatures and not regular functions?
Both clause and callback are being called directly - without delay or apply_async. It is executed in the current process, in the current context. So it behaves exactly like a normal function, then why use signatures?
The answer is serialization. You can't conveniently pass a regular function object to a celery task. But you can pass a task signature. That's exactly what I'm doing here. Both clause and callback should be a regular signature object of a celery task.
What is self.request.chain?
self.request.chain stores a list of dicts (representing jsons as the celery task serializer is json by default) - each of them representing a task signature. Each task from this list is executed in reverse order. Which is why, the list is reversed before passing to the user provided callback function (actually a task) - the user probably expects the order of tasks to be left to right.
Quick note: Irrelevant to this discussion, but if you're using the link parameter from apply_async to construct a chain instead of the chain primitive itself. self.request.callback is the property to be modified (i.e set to None to remove callback and stop chain) instead of self.request.chain
Explanation - tappable
tappable is just a basic function that takes a chain (which is the only workflow primitive covered here, for brevity) and inserts pause_or_continue after every nth task. You can insert them wherever you want really, it is upto you to define pause points in your operation. This is just an example!
For each chain object, the actual signatures of tasks (in order, going from left to right) is stored in the .tasks property. It's a tuple of task signatures. So all we have to do, is take this tuple, convert into a list, insert the pause points and convert back to a tuple to assign to the chain. Then return the modified chain object.
The clause and callback is also attached to the pause_or_continue signature. Normal celery stuff.
That covers the primary concept, but to showcase a real project using this pattern (and also to showcase the resuming part of a paused task), here's a small demo of all the necessary resources
Usage
This example usage assumes the concept of a basic web server with a database. Whenever an operation (i.e workflow chain) is started, it's assigned an id and stored into the database. The schema of that table looks like-
-- Create operations table
-- Keeps track of operations and the users that started them
CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
requester_id INTEGER NOT NULL,
completion TEXT NOT NULL,
workflow_store TEXT,
result TEXT,
FOREIGN KEY (requester_id) REFERENCES user (id)
);
The only field that needs to be known about right now, is completion. It just stores the status of the operation-
When the operation starts and a db entry is created, this is set to IN PROGRESS
When a user requests pause, the route controller (i.e view) modifies this to REQUESTING PAUSE
When the operation actually gets paused and callback (from tappable, inside pause_or_continue) is called, the callback should modify this to PAUSED
When the task is completed, this should be modified to COMPLETED
An example of clause
#celery.task()
def should_pause(_, operation_id: int):
# This is the `clause` to be used for `tappable`
# i.e it lets celery know whether to pause or continue
db = get_db()
# Check the database to see if user has requested pause on the operation
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
return operation["completion"] == "REQUESTING PAUSE"
This is the task to call at the pause points, to determine whether or not to pause. It's a function that takes 2 parameters.....well sort of. The first one is mandatory, tappable requires the clause to have one (and exactly one) argument - so it can pass the previous task's return value to it (even if that return value is None). In this example, the return value isn't required to be used - so we can just ignore it.
The second parameter is an operation id. See, all this clause does - is check a database for the operation (the workflow) entry and see if it has the status REQUESTING PAUSE. To do that, it needs to know the operation id. But clause should be a task with one argument, what gives?
Well, good thing signatures can be partial. When the task is first started and a tappable chain is created. The operation id is known and hence we can do should_pause.s(operation_id) to get the signature of a task that takes one parameter, that being the return value of the previous task. That qualifies as a clause!
An example of callback
import os
import json
from typing import Any, List
#celery.task()
def save_state(retval: Any, chains: dict, operation_id: int):
# This is the `callback` to be used for `tappable`
# i.e this is called when an operation is pausing
db = get_db()
# Prepare directories to store the workflow
operation_dir = os.path.join(app.config["OPERATIONS"], f"{operation_id}")
workflow_file = os.path.join(operation_dir, "workflow.json")
if not os.path.isdir(operation_dir):
os.makedirs(operation_dir, exist_ok=True)
# Store the remaining workflow chain, serialized into json
with open(workflow_file, "w") as f:
json.dump(chains, f)
# Store the result from the last task and the workflow json path
db.execute(
"""
UPDATE operations
SET completion = ?,
workflow_store = ?,
result = ?
WHERE id = ?
""",
("PAUSED", workflow_file, f"{retval}", operation_id),
)
db.commit()
And here's the task to be called when the task is being paused. Remember, this should take the last executed task's return value and the remaining list of signatures (in order, from left to right). There's an extra param - operation_id - once again. The explanation for this is the same as the one for clause.
This function stores the remaining chain in a json file (since it's a list of dicts). Remember, you can use a different serializer - I'm using json since it's the default task serializer used by celery.
After storing the remaining chain, it updates the completion status to PAUSED and also logs the path to the json file into the db.
Now, let's see these in action-
An example of starting the workflow
def start_operation(user_id, *operation_args, **operation_kwargs):
db = get_db()
operation_id: int = db.execute(
"INSERT INTO operations (requester_id, completion) VALUES (?, ?)",
(user_id, "IN PROGRESS"),
).lastrowid
# Convert a regular workflow chain to a tappable one
tappable_workflow = tappable(
(T1.s() | T2.s() | T3.s() | T4.s() | T5.s(operation_id)),
should_pause.s(operation_id),
save_state.s(operation_id),
)
# Start the chain (i.e send task to celery to run asynchronously)
tappable_workflow(*operation_args, **operation_kwargs)
db.commit()
return operation_id
A function that takes in a user id and starts an operation workflow. This is more or less an impractical dummy function modeled around a view/route controller. But I think it gets the general idea through.
Assume T[1-4] are all unit tasks of the operation, each taking the previous task's return as an argument. Just an example of a regular celery chain, feel free to go wild with your chains.
T5 is a task that saves the final result (result from T4) to the database. So along with the return value from T4 it needs the operation_id. Which is passed into the signature.
An example of pausing the workflow
def pause(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "IN PROGRESS":
# Pause only if the operation is in progress
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("REQUESTING PAUSE", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
This employs the previously mentioned concept of modifying the db entry to change completion to REQUESTING PAUSE. Once this is committed, the next time pause_or_continue calls should_pause, it'll know that the user has requested the operation to pause and it'll do so accordingly.
An example of resuming the workflow
def resume(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "PAUSED":
# Resume only if the operation is paused
with open(operation["workflow_store"]) as f:
# Load the remaining workflow from the json
workflow_json = json.load(f)
# Load the chain from the json (i.e deserialize)
workflow_chain = chain(signature(x) for x in serialized_ch)
# Start the chain and feed in the last executed task result
workflow_chain(operation["result"])
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("IN PROGRESS", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
Recall that, when the operation is paused - the remaining workflow is stored in a json. Since we are currently restricting the workflow to a chain object. We know this json is a list of signatures that should be turned into a chain. So, we deserialize it accordingly and send it to the celery worker.
Note that, this remaining workflow still has the pause_or_continue tasks as they were originally - so this workflow itself, is once again pause-able/resume-able. When it pauses, the workflow.json will simply be updated.
I want to send a "pause" signal to a long running task in Celery and I'm trying to figure out the best way to do it. In the view I can pull an instance of the object from the database and tell that to save, but it's not the same as the instance of the object in Celery. The object doesn't check back to see if it's paused.
Polling the database from within the long-running class and task feels weird and impractical so I'm looking at sending my instance a message. I looked at using pubsub but I would prefer to use Django signals as it's already a Django project. I might be approaching this the wrong way.
Here's an example that does not work:
Models.py
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Views.py
def pause_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = True
lrc.save()
return HttpResponse(json.dumps({'is_paused': lrc.is_paused}))
def resume_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = False
lrc.save()
# Pretend this is a Celery task
lrc.long_task()
So if I modify models.py to use signals, I can add these lines but it still does not quite work:
pause_signal = django.dispatch.Signal(providing_args=['is_paused'])
#django.dispatch.receiver(pause_signal)
def pause_callback(sender, **kwargs):
if 'is_paused' in kwargs:
sender.is_paused = kwargs['is_paused']
sender.save()
That doesn't affect the instantiated class that's already running either. How can I tell the instance of my model running within the task to pause?
Celery task is a separate process. Django signals is standard "observer" pattern, which works within one thread, so there is no way to orginize communication betwean threads using signals. You need to load object from database to know if its properties has changed.
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def get_is_paused(self):
db_obj = LongRunningClass.objects.get(pk=self.pk)
return db_obj.is_paused
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.get_is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Not very good by design - you better to move long_task to other place, and operate with newly loaded LongRunningClass instance, but it will do the job. You could add some memcache here - if you don't want to disturb your database so often.
BTW: I'm not 100% sure but you may have another design issue here. This is rather rare case when you have really long running tasks with this kind of cycle. Think about removing cycle from your program (you have queues!). Take a look:
#celery.task(run_every=2minutes) # adding XX files for processing every XX minutes
def scheduled_task(lr_pk):
lr = LongRunningClass.objects.get(pk=lr_pk)
if not lr.is paused:
remaining_files = self.total_files - self.processed_files
for i in xrange(lr.files_per_iteration):
process_file.delay(lr.pk,i)
#celery.task(rate=1/m,queue='process_file') # processing each file
def process_file(lr_pk,i):
# do somthing with i
lr = LongRunningClass.objects.get(pk=lr_pk)
lr.processed_files += 1
lr.save()
You have to set up celerybeat, and create separate queue for this types of tasks, to implement this solution. But as a result you will have a lot of control over your program - speed rates, parallel execution and your code would not hang for sleep(1). If you create another model for each file you could control what files are processed and what are not, handle errors etc,etc.
Take a look at celery.contrib.abortable -- this is an alternate base class for Celery tasks that implements a signal between caller and task to handle terminations, that could also be used to implement a "pause".
When caller calls abort(), a status is marked in the backend. Task calls self.is_aborted() to see if that special status has been set; and then implements whatever action is appropriate (terminate, pause, ignore etc.). The action is under the task's control; this is not automated task termination.
This could be used as-is if it is sensible for the specific task to interpret the ABORT signal as a request for a pause. Or you could extend the class to provide more signals, not just the existing ABORT.
i have to do some long-time (2-3 days i think) tasks with my django ORM data. I look around and didnt find any good solutions.
django-tasks - http://code.google.com/p/django-tasks/ is not well documented, and i dont have any ideas how to use it.
celery - http://ask.github.com/celery/ is excessive for my tasks. Is it good for longtime tasks?
So, what i need to do, i just get all data or parts of data from my database, like:
Entry.objects.all()
And then i need execute same function for each one of QuerySet.
I think it should work around 2-3 days.
So, maybe someone explain for me how to build it.
P.S:at the moment i have only one idea, use cron and database to store process execution timeline.
Use Celery Sub-Tasks. This will allow you to start a long-running task (with many short-running subtasks underneath it), and keep good data on it's execution status within Celery's task result store. As an added bonus, subtasks will be spread across worker proccesses allowing you to take full advantage of multi-core servers or even multiple servers in order to reduce task runtime.
http://ask.github.com/celery/userguide/tasksets.html#task-sets
http://docs.celeryproject.org/en/latest/reference/celery.task.sets.html
EDIT: example:
import time, logging as log
from celery.task import task
from celery.task.sets import TaskSet
from app import Entry
#task(send_error_emails=True)
def long_running_analysis():
entries = list(Entry.objects.all().values('id'))
num_entries = len(entries)
taskset = TaskSet(analyse_entry.subtask(entry.id) for entry in entries)
results = taskset.apply_async()
while not results.ready()
time.sleep(10000)
print log.info("long_running_analysis is %d% complete",
completed_count()*100/num_entries)
if results.failed():
log.error("Analysis Failed!")
result_set = results.join() # brings back results in
# the order of entries
#perform collating or count or percentage calculations here
log.error("Analysis Complete!")
#task
def analyse_entry(id): # inputs must be serialisable
logger = analyse_entry.get_logger()
entry = Entry.objects.get(id=id)
try:
analysis = entry.analyse()
logger.info("'%s' found to be %s.", entry, analysis['status'])
return analysis # must be a dict or serialisable.
except Exception as e:
logger.error("Could not process '%s': %s", entry, e)
return None
If your calculations cannot be seggregated to per-entry tasks, you can always set it up so that one subtask performs tallys, one subtask performs another analysis type. and this will still work, and will still allow you to benifit from parelelleism.