I have two celery nodes on 2 machines (n1, n2) and my task enqueue is on another machine (main).
The main machine may not know the available node names.
My question is whether there is any guarantee that a chain of tasks will run on a single node.
res = chain(generate.s(filePath1, filePath2), mix.s(), sort.s())
the problem is that various tasks are using local data files that are node specific.
My guess is that chain is probably like chords which the doc explicitly says that there is no guarantee to run on a single node.
and if my guess about chain is right, then my next question is would the following be a good solution as an alternative to chains?
single task = guaranteed single node
#app.task
def my_chain_of_tasks():
celery.current_app.send_task('mymodel.tasks.generate', args=[filePath1, filePath2]).get()
celery.current_app.send_task('mymodel.tasks.mix').get()
# do these 2 in parallel:
res1 = celery.current_app.send_task('mymodel.tasks.sort')
res2 = celery.current_app.send_task('mymodel.tasks.email_in_parallel')
res1.get()
return res2.get()
or is this still going to send the tasks to the message queue and cause the same problem?
You are calling a .get() on a task inside another task which is counter productive. Also there is no guarantee that all those tasks will be executed on a single node.
If You want a few tasks to be executed by particular node, you can queue them or route them accordingly.
CELERY_ROUTES = {
'mymodel.task.task1': {'queue': 'queue1'},
'mymodel.task.task2': {'queue': 'queue2'}
}
Now you can start two workers to consume them
celery worker -A your_proj -Q queue1
celery worker -A your_proj -Q queue2
Now all task1 will be executed by worker1 and task2 by worker2.
Docs: http://celery.readthedocs.org/en/latest/userguide/routing.html#manual-routing
Related
I have a task that talks to an external API, the json response is quite large and I have to make this call multiple times followed by further python processing. To make this less time-consuming, I initially tried:
def make_call(*args, **kwargs):
pass
def make_another(*args, **kwargs):
pass
def get_calls():
return make_call, make_another
def task(*args, **kwargs):
procs = [Process(target=get_calls()[i], args=(,),
kwargs={}) for i in range(3)]
_start = [proc.start() for proc in procs]
_join = [proc.join() for proc in procs]
#
transaction.on_commit(lambda: task.delay())
However, I ran into an AssertionError: daemonic processes are not allowed to have children. What would be my best approach to speed up a celery task with additional processes?
A Celery worker already creates many processes. Take advantage of the many worker processes instead of creating child processes. You can delegate work amongst the celery workers instead. This will result in a more stable/reliable execution.
You could either just create many tasks from your client code or make use of celery's primitives like chains or chords to parallelize the work. These can also be composed with other primitives like groups, etc.
For example, in your scenario, you may have two tasks: one to make the API call(s) make_api_call and another to parse the response parse_response. You can chain these together.
# chain another task when a task completes successfully
res = make_api_call.apply_async((0,), link=parse_response.s())
# chain syntax 1
result_1 = chain(make_api_call.s(1), parse_response.s())
# syntax 2 with | operator
result_b = make_api_call.s(2) | parse_response.s()
# can group chains
job = group([
chain(make_api_call.s(i), parse_response.s())
for i in range(3)
]
)
result = job.apply_async()
This is just a generic example. You can create task(s) and compose them to however your workflow needs. See: Canvas: Designing Work-flows for more information.
how to run this kind of celery task properly?
#app.task
def add(x)
x + 1
def some_func():
result = 'result'
for i in range(10):
task_id = uuid()
add.apply_async((i,)), task_id=task_id)
return result
I need all tasks to be performed sequentially after the previous one is completed.
I tried using time.sleep() but in this case returning result waits until all tasks are completed. But I need the result returned and all 10 tasks are running sequentially in the background.
there is a group() in celery, but it runs tasks in parallel
Finally, I solved it by using immutable signature and chain
tasks = [
add.si(x).set(task_id=uuid())
for x in range(10)
]
chain(*tasks).apply_async()
If some_func() is executed outside Celery (say a script is used as "producer" to just send those tasks to be executed), then nothing stops you from calling .get() on AsyncResult to wait for task to finish, and loop that as much as you like.
If, however, you want to execute that loop as some sort of Celery workflow, then you have to build a Chain and use it.
I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.
I've got a couple of tasks in my tasks.py in Celery.
# this should go to the 'math' queue
#app.task
def add(x,y):
uuid = uuid.uuid4()
result = x + y
return {'id': uuid, 'result': result}
# this should go to the 'info' queue
#app.task
def notification(calculation):
print repr(calculation)
What I'd like to do is place each of these tasks in a separate Celery queue and then assign a number of workers on each queue.
The problem is that I don't know of a way to place a task from one queue to another from within my code.
So for instance when an add task finishes execution I need a way to place the resulting python dictionary to the info queue for futher processing. How should I do that?
Thanks in advance.
EDIT -CLARIFICATION-
As I said in the comments the question essentially becomes how can a worker place data retrieved from queue A to queue B.
You can try like this.
Wherever you calling the task,you can assign task to which queue.
add.apply_async(queue="queuename1")
notification.apply_async(queue="queuename2")
By this way you can put tasks in seperate queue.
Worker for seperate queues
celery -A proj -Q queuename1 -l info
celery -A proj -Q queuename2 -l info
But you must know that default queue is celery.So if any tasks without specifying queue name will goto celery queue.So A consumer for celery is needed if any like.
celery -A proj -Q queuename1,celery -l info
For your expected answer
If you want to pass result of one task to another.Then
result = add.apply_async(queue="queuename1")
result = result.get() #This contain the return value of task
Then
notification.apply_async(args=[result], queue="queuename2")
I have a complicated scenario I need to tackle.
I'm using Celery to run tasks in parallel, my tasks involve with HTTP requests and I'm planning to use Celery along with eventlet for such purpose.
Let me explain my scenario:
I have 2 tasks that can run in parallel and third task that needs to work on the output of those 2 tasks therefore I'm using Celery group to run the 2 tasks and Celery chain to pass the
output to the third task to work on it when they finish.
Now it gets complicated, the third task needs to spawn multiple tasks that I would like to run in parallel and I would like to collect all outputs together and to process it in another task.
So I created a group for the multiple tasks together with a chain to process all information.
I guess I'm missing basic information about Celery concurrent primitives, I was having a 1 celery task that worked well but I needed to make it faster.
This is a simplified sample of the code:
#app.task
def task2():
return "aaaa"
#app.task
def task3():
return "bbbb"
#app.task
def task4():
work = group(...) | task5.s(...)
work()
#app.task
def task1():
tasks = [task2.s(a, b), task3.s(c, d)]
work = group(tasks) | task4.s()
return work()
This is how I start this operation:
task = tasks1.apply_async(kwargs=kwargs, queue='queue1')
I save task.id and pull the server every 30 seconds to see if results available by doing:
results = tasks1.AsyncResult(task_id)
if results.ready():
res = results.get()