Can anybody explain me how celery group worked? - python

When I use celery group and chains to schedule the tasks as below
(group([group_task]) | sum_task).apply_async()
the group tasks can be executed in many workers, after all group tasks finished, sum_task begin to execute(maybe in the other worker), so
who can tell me how celery known the group tasks are all finished and then started the sum_task?

You could specify queue for each chained task and group/chord callback task differently.
Snippet like:
#shared_task(name="analyze_atom", queue="atom")
def analyze_atom(image_urls, targetdir=target_path, studentuid=None):
return {}
#shared_task(name="summary_up", queue="summary")
def summary_up(rets, studentuid, images):
return {}
chord(analyze_atom.s([image]) for image in images)(summary_up.s(studentuid, images))
And, when tasks running, you could inspect broker content, assume you are using rabbitmq as broker, you could inspect queue depth by rabbitmq management plugin, or pyrabbit interface snippet here:
from pyrabbit.api import Client
cl = Client('localhost:15672', 'guest', 'guest')
count = cl.get_queue_depth('/', 'summary') # this guy check queue depth
cl.get_messages('/','paperanalyzer') # this guy get messages within queue
And, you should have result backend, you could get every task result by task id.
I think upon skills above, it's easy to inspect how celery task goes on.
Good luck :-)

Related

How to Inspect the Queue Processing a Celery Task

I'm currently leveraging celery for periodic tasks. I am new to celery. I have two workers running two different queues. One for slow background jobs and one for jobs user's queue up in the application.
I am monitoring my tasks on datadog because it's an easy way to confirm my workers a running appropriately.
What I want to do is after each task completes, record which queue the task was completed on.
#after_task_publish.connect()
def on_task_publish(sender=None, headers=None, body=None, **kwargs):
statsd.increment("celery.on_task_publish.start.increment")
task = celery.tasks.get(sender)
queue_name = task.queue
statsd.increment("celery.on_task_publish.increment", tags=[f"{queue_name}:{task}"])
The following function is something that I implemented after researching the celery docs and some StackOverflow posts, but it's not working as intended. I get the first statsd increment but the remaining code does not execute.
I am wondering if there is a simpler way to inspect inside/after each task completes, what queue processed the task.
Since your question says is there a way to inspect inside/after each task completes - I'm assuming you haven't tried this celery-result-backend stuff. So you could check out this feature which is provided by Celery itself : Celery-Result-Backend / Task-result-Backend .
It is very useful for storing results of your celery tasks.
Read through this => https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-result-backend-settings
Once you get an idea of how to setup this result-backend, Search for result_extended key (in the same link) to be able to add queue-names in your task return values.
Number of options are available - Like you can setup these results to go to any of these :
Sql-DB / NoSql-DB / S3 / Azure / Elasticsearch / etc
I have made use of this Result-Backend feature with Elasticsearch and this how my task results are stored :
It is just a matter of adding few configurations in settings.py file as per your requirements. Worked really well for my application. And I have a weekly cron that clears only successful results of tasks - since we don't need the results anymore - and I can see only failed results (like the one in image).
These were main keys for my requirement : task_track_started and task_acks_late along with result_backend

How to get task id from celery in case of scheduled tasks (beat)

To access the info of a celery task, i need the task_id. When a celery task gets manually started i can easily get the id of this task with task.id (and the write it to DB or do something else).
If i use celery-beat, which periodically sends tasks to the worker, that seems not to be possible.
So my question is, how to get the id from the task in the moment beat sends the task to celery's worker?
In the moment the worker receives the task, the console shows the task-id. So my worry is, that in the moment the task got sent by beat to the worker, it has no task id.
Manual case to get task_id:
task = tasks.LongRunningTask.delay(username_from_formTargetsLaden, password_from_formTargetsLaden, url_from_formTargetsLaden)
task_id = task.id
Mayhaps some of you got an idea?
i found an answer for that little issue:
If you need the task id of the tasks which was initially sent by beat, you can simply add an inspect function to your (schedulded) worker task.
Configure Periodic-Task
This is the schedule which "reminds" celery every day at 11:08 am (UTC) to kick off the task.
#celery.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
test = sender.add_periodic_task(crontab(minute=8, hour=11), CheckLists.s(app.config['USR'], app.config['PWD']))
Task to be executed periodically
This is the scheduled task which will be executed by celery after the worker received the "reminder" from beat.
#celery.task(bind=True)
def CheckLists(self, arg1, arg2):
#get task_id von scheduled Task 'Check-List'
i = inspect()
activetasks = i.active()
list_of_tasks = {'activetasks': activetasks}
task_id = list_of_tasks['activetasks']['celery#DESKTOP-XXXXX'][0]['id'] #adapt this section depending on environment (local, webserver, etc...)
task_type = "CHECK_LISTS"
task_id_to_db = Tasks(task_id, task_type)
db.session.add(task_id_to_db)
db.session.commit()
long_runnning_task
[...more task relevant code here...]
So i'm making use out of app.control.inspect which lets you inspect running workers. It uses remote control commands under the hood.
With i.active() you will get a dictionary, which you can easily parse.
As long as i don't find any documentation how to get the task_id from a periodic task more easier, i'll stick to that solution.
After you saved the task id you can easily poll task status etc. via AJAX for instance.
Hope that helps you guys :)

Using Celery to execute method on startup

Currently, we are using Celery & RabbitMQ to perform repeatable tasks on Ubuntu 14.04 servers and everything is working great. Celery picks up tasks from RMQ and executes the correct method. We have 12 Celery workers constantly monitoring RMQ queues. We have a new requirement where we want to execute 1 method in Celery only once or say once a day. Is this possible to do? I don't want to look at possibly other technologies as we are invested in Celery/RMQ at the moment.
Thanks in advance.
For every task, you can store a boolean value which will keep track whether that is executed for the day for not, this data you can store in db or some file store.
Maintain a cron that executes daily that sets every task value to false(assuming false as task not executed for that day).
Create a celery pre_run signal that will return if the task is already done for the day else continues task processing
from django.db import models
class TaskModel(models.Model)
task = models.CharField(max_length=200)
is_executed = models.BooleanField(default=False)
from celery.signals import task_prerun
#task_prerun.connect()
def task_setup(signal=None, sender=None, task_id=None, task=None, args=None, kwargs=None):
# this method executes before every celery task
task_obj = TaskModel.objects.get(task=task.name)
if task_obj.is_executed:
return
Celery beat is made exactly for this requirement: http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html

How can I run a task after other (already started) tasks are finished

I'm new to Celery and I'm trying to understand if it can solve my problem.
I need to start a number of tasks (An) and then run another task (B) after these are done. The problem is that tasks An are added sequentially and I don't want to wait for the last one to be added before I start the first one. Can I configure task B to execute after tasks An are done?
Now to the real scenario:
Task An - Process a file uploaded by user (Added after each file is
uploaded)
Task B - do something with the results of processing all
uploaded files
Alternative solutions are welcome as well
For sure you can do this, celery canvas supports many options, inluding the behaviour you require, running a task after a group of tasks ... it is called "Chords", e.g.:
from celery import chord
from tasks import task_upload1, task_upload2, task_upload3, final_execution
result = chord(task_upload1.s(), task_upload2.s(), task_upload3.s())(final_execution.s())
get_required_result = result.get()
you can refer to this link for more details
With RabbitMQ you can get exact behavior using message acknowledgment and aggregator pattern.
You start worker, that consumes messages (A) and do some work(process a file uploaded by user in your case), but doesn't sent ack when finished. Instead it takes next message form queue and if it's A task again, he is doing the same thing. At some point he will receive task B and could process all previous A's results, all atones and send ack to all of them.
Unfortunately, this scenario can't be done with Celery, because you have to specify all A tasks and final B task(chains, chords, callbacks, etc.) on creating time.
Alternatively, you can save Task.id for each successful A task in separate queue (not Celery queue) and process this messages, when executing B task. Celery can fit for this algorithm.

How to inspect and cancel Celery tasks by task name

I'm using Celery (3.0.15) with Redis as a broker.
Is there a straightforward way to query the number of tasks with a given name that exist in a Celery queue?
And, as a followup, is there a way to cancel all tasks with a given name that exist in a Celery queue?
I've been through the Monitoring and Management Guide and don't see a solution there.
# Retrieve tasks
# Reference: http://docs.celeryproject.org/en/latest/reference/celery.events.state.html
query = celery.events.state.tasks_by_type(your_task_name)
# Kill tasks
# Reference: http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks
for uuid, task in query:
celery.control.revoke(uuid, terminate=True)
There is one issue that earlier answers have not addressed and may throw off people if they are not aware of it.
Among those solutions already posted, I'd use Danielle's with one minor modification: I'd import the task into my file and use its .name attribute to get the task name to pass to .tasks_by_type().
app.control.revoke(
[uuid for uuid, _ in
celery.events.state.State().tasks_by_type(task.name)])
However, this solution will ignore those tasks that have been scheduled for future execution. Like some people who commented on other answers, when I checked what .tasks_by_type() return I had an empty list. And indeed my queues were empty. But I knew that there were tasks scheduled to be executed in the future and these were my primary target. I could see them by executing celery -A [app] inspect scheduled but they were unaffected by the code above.
I managed to revoke the scheduled tasks by doing this:
app.control.revoke(
[scheduled["request"]["id"] for scheduled in
chain.from_iterable(app.control.inspect().scheduled()
.itervalues())])
app.control.inspect().scheduled() returns a dictionary whose keys are worker names and values are lists of scheduling information (hence, the need for chain.from_iterable which is imported from itertools). The task information is in the "request" field of the scheduling information and "id" contains the task id. Note that even after revocation, the scheduled task will still show among the scheduled tasks. Scheduled tasks that are revoked won't get removed from the list of scheduled tasks until their timers expire or until Celery performs some cleanup operation. (Restarting workers triggers such cleanup.)
You can do this in one request:
app.control.revoke([
uuid
for uuid, _ in
celery.events.state.State().tasks_by_type(task_name)
])
As usual with Celery, none of the answers here worked for me at all, so I did my usual thing and hacked together a solution that just inspects redis directly. Here we go...
# First, get a list of tasks from redis:
import redis, json
r = redis.Redis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DATABASES['CELERY'],
)
l = r.lrange('celery', 0, -1)
# Now import the task you want so you can get its name
from my_django.tasks import my_task
# Now, import your celery app and iterate over all tasks
# from redis and nuke the ones that have a matching name.
from my_django.celery_init import app
for task in l:
task_headers = json.loads(task)['headers']
task_name = task_headers["task"]
if task_name == my_task.name:
task_id = task_headers['id']
print("Terminating: %s" % task_id)
app.control.revoke(task_id, terminate=True)
Note that revoking in this way might not revoke prefetched tasks, so you might not see results immediately.
Also, this answer doesn't support prioritized tasks. If you want to modify it to do that, you'll want some of the tips in my other answer that hacks redis.
It looks like flower provides monitoring:
https://github.com/mher/flower
Real-time monitoring using Celery Events
Task progress and history Ability to show task details (arguments,
start time, runtime, and more) Graphs and statistics Remote Control
View worker status and statistics Shutdown and restart worker
instances Control worker pool size and autoscale settings View and
modify the queues a worker instance consumes from View currently
running tasks View scheduled tasks (ETA/countdown) View reserved and
revoked tasks Apply time and rate limits Configuration viewer Revoke
or terminate tasks HTTP API
OpenID authentication

Categories

Resources