Somehow identify celery workers that are shutting down gracefully

Somehow identify celery workers that are shutting down gracefully - python

I have three Celery workers as follows, each running on a different ECS node:
Producer: Keeps generating & sending tasks to the consumer worker. Each task is expected to take several minutes to compute and has a database record.
Consumer: Receives computation tasks and immediately starts execution.
Watchdog: Periodically inspects database records, finds out computation tasks that are executing, and then does celery inspect active to verify whether there is actually a worker carrying out the computation.
We ensured that when the Consumer node is being terminated, the Celery worker on it will begin graceful shutdown, so that the ongoing computation can finish normally. Because Celery will unregister a gracefully stopping worker, the consumer will become invisible to the Watchdog, who will mistakenly think a computation task has mysteriously lost... even though the Consumer is still working on the task.
Is it possible to let a Celery worker broadcast an "I am dying" message upon receiving a warm shutdown signal? Or even better, can we somehow let the Watchdog worker still see shutting workers?

Yes, it is possible. Nodes in Celery cluster I am responsible for are doing something similar. Here is a snippet:
#worker_shutdown.connect
def handle_worker_shutdown(**kwargs):
_handle_worker_shutdown(app, _LOGGER, **kwargs)
#worker_ready.connect
def handle_worker_ready(**kwargs):
_handle_worker_ready(app, _LOGGER, **kwargs)
There are few other, very useful signals that you should have a look, but these two are essential. Maybe the worker_shutting_down is more suitable for your use-case...

Related

Using dask-jobqueue with SGECluster.adapt() and submitting tasks before any worker is active

I have setup an SGECluster scheduler with the correct settings and confirmed I can connect to both the dashboard and submit jobs to my sge queue. I would like to use the adapt method to scale the number of workers dependent upon the incoming task load. These tasks are generally not related so they can be run by individual workers in their own process.
I've noticed that the scheduler does not appear to register tasks (at least in the dashboard) until a worker is available. If that first worker takes some time to become available and I submit tasks to the scheduler, it will not know that it needs to scale and therefore the extra workers will end up at the back of the queue. Is it possible to prompt the scheduler to recognize that tasks have arrived before the first worker has connected to the scheduler, and to put in queue requests for workers appropriately?
I can get the workers to queue if I use scale(n) instead of adapt.
cluster = SGECluster(
queue=queue_name,
memory=maximum_memory,
processes=worker_processes,
env_extra=env_list,
scheduler_options=scheduler_options,
log_directory=log_dir,
job_name=name,
walltime=walltime,
resource_spec=f"{mem_spec}={maximum_memory}",
job_extra=job_extra_list,
)
# if the first worker takes ages to begin running, then only one worker will be requested
# and tasks submitted in the interim do not adjust the scheduler behaviour
# cluster.adapt(minimum=1, maximum=20)
# queues up the requested workers straight away but doesn't adapt to load
cluster.scale(20)

Celery timeout in Django

There are altogether eight tasks running in celery in different periods. All of them are event-driven tasks. After a certain event, they got fired. And the particular task works continuously until certain conditions were satisfied.
I have registered a task which checks for certain conditions for almost two minutes. This task works fine most of the time. But sometimes the expected behavior of the task is not attained.
The signature of the task is as below:
tasks.py
import time
from celery import shared_task
#shared_task()
def some_celery_task(a, b):
main_time_end = time.time() + 120
while time.time() < main_time_end:
...
# some db operations here with given function arguments 'a' and 'b'
# this part of the task get execute most of the time
if time.time() > main_time_end:
...
# some db operations here.
# this part is the part of the task that doesn't get executed sometimes
views.py
# the other part of the view not mentioned here
# only the task invoked part
some_celery_task.apply_async(args=(5, 9), countdown=0)
I am confused about the celery task timeout scenarios. Does that mean the task will stop from where it timeouts or will retry automatically?
It will be a great help if any clear idea about timeout and retries you guys got.
What could be the reason behind the explained scenarios above? Any help on this question will be highly appreciated. Thank you.

Check Celery documentation on Tasks - basics are documented very well.
If task fails or was terminated - task will have states.FAILURE status. It will not be re-tried unless specifically coded. If logging is correctly configured - you might see exception messages in logs in case of timeouts or other code exceptions.
When Celery Task TIME_LIMIT is exceeded - task is terminated right away:
The worker processing the task will be killed and replaced with a new one.
Also, TimeLimitExceeded exception will be raised with message like Task handler raised error: "TimeLimitExceeded(2700)"
If Celery SOFT_TIME_LIMIT is set and is smaller than TIME_LIMIT and is exceeded - than SoftTimeLimitExceeded exception will be raised allowing it to be catched in the task and perform clean-up actions.
When worker consumes message (task) from the broker queue - broker needs to know that the message was consumed successfully. To confirm successful consumpion of message worker acknowledges (ACK) to broker. Until message is not acknowledged it is not deleted from broker but also not available for consumption ("invisible"). In not acknowledged - message will be re-delivered back to broker queue available again for consumption.
Redelivering un-acknowledged messages logic depends on broker:
AMQP (RabbitMQ) broker - tracks connection status with worker, and if connection is lost - returns message back to queue.
Redis or SQS broker has its own timeout after which message will be re-delivered to broker queue if not ACKed.
By default celery worker acknowledges message right at the start of the task.
If ACKS_LATE is set - worker acknowledges to broker only after successfully executing task.
One can RETRY task, by catching exception in the task and sending same task back to the broker for re-execution - then this same task with same id will be queued at broker. Countdown option allows to specify delay before the task will be retried.
Celery Task Execution and other Options can be set globally in settings.py or per task as arguments.
Recommended way it to design tasks / logic with consideration of such events to be totally legit and see them normal (but not actually expected) to happen sometime and be ready:
tasks may fail (next same task may do work for both or checks that specific work was not done and re-fire task)
same task may run again (idempotency)
similar tasks can be run simultaneously (locking)

Celery: why is there a multi-second time gap from when a task is accepted and when it starts to execute?

I have a celery task:
#app.task(bind=True, soft_time_limit=FreeSWITCHConstants.EXECUTE_ATTEMPTS_LOCAL_SOFT_TIME_LIMIT)
def execute_attempt_local(self, attempt_id, provider_id, **kwargs):
print "PERF - entering execute_attempt_local"
...
that is processed by a (remote) worker with the following config:
celery -A mycompany.web.taskapp worker n -Q execute_attempts-2 --autoscale=4,60
This task gets spawned thousands at a time and has historically completed in 1-3s (it's a mostly I/O bound task).
Recently as our app's overall usage has increased, this task's completion time has increased to 5-8s on average and I'm trying to understand what's taking up the extra time. I noticed that for many tasks taking 5-8 seconds, ~4s is taken in the time in between the task being accepted by the thread and executing the first line of the task:
[2019-09-24 13:15:16,627: DEBUG/MainProcess] Task accepted: mycompany.ivr.freeswitch.tasks.execute_attempt_local[d7585570-e0c9-4bbf-b3b1-63c8c5cd88cc] pid:7086
...
[2019-09-24 13:15:22,180: WARNING/ForkPoolWorker-60] PERF - entering execute_attempt_local
What is happening in that 4s? I'm assuming I have a Celery config issue and somewhere there is a lack of resources for these tasks to process quicker. Any ideas what could be slowing them down?

There are several possible reasons why is this happening. It may take some time for the autoscaler to kick-in. So depending on your load, you may not have enough worker-processes to run your tasks when they are sent, so they are waiting in the queue for some time (it can even be minutes or hours) until there are available worker-processes.
You can easily monitor this by looking how many tasks are waiting in the queue. If the queue is always empty that means your tasks are executed immediately. If not, that means you may want to add new workers to your cluster.

Python distributed tasks with multiple queues

So the project I am working on requires a distributed tasks system to process CPU intensive tasks. This is relatively straight forward, spin up celery and throw all the tasks in a queue and have celery do the rest.
The issue I have is that every user needs their own queue, and items within each users queue must be processed synchronously. So it there is a task in a users queue already processing, wait until it is finished before allowing a worker to pick up the next.
The closest I've come to something like this is having a fixed set of queues, and assigning them to users. Then having the users tasks picked off by celery workers fixed to a certain queue with a concurrency of 1.
The problem with this system is that I can't scale my workers to process a backlog of user tasks.
Is there a way I can configure celery to do what I want, or perhaps another task system exists that does what I want?
Edit:
Currently I use the following command to spawn my celery workers with a concurrency of one on a fixed set of queues
celery multi start 4 -A app.celery -Q:1 queue_1 -Q:2 queue_2 -Q:3 queue_3 -Q:4 queue_4 --logfile=celery.log --concurrency=1
I then store a queue name on the user object, and when the user starts a process I queue a task to the queue stored on the user object. This gives me my synchronous tasks.
The downside is when I have multiple users sharing queues causing tasks to build up and never getting processed.
I'd like to have say 5 workers, and a queue per user object. Then have the workers just hop over the queues, but never have more than 1 worker on a single queue at a time.

I use chain doc here condition for execution task in a specific order :
chain = task1_task.si(account_pk) | task2_task.si(account_pk) | task3_task.si(account_pk)
chain()
So, i execute for a specific user task1 when its finished i execute task2 and when finished execute task3.
It will spawm in any worker available :)
For stopping a chain midway:
self.request.callbacks = None
return
And don't forget to bind your task :
#app.task(bind=True)
def task2_task(self, account_pk):

Some confusions regarding celery in python

I have divided celery into following parts
Celery
Celery worker
Celery daemon
Broker: Rabbimq or SQS
Queue
Result backend
Celery monitor (Flower)
My Understanding
When i hit celery task in django e,g tasks.add(1,2). Then celery adds that task to queue. I am confused if thats 4 or 5 in above list
WHen task goes to queue Then worker gets that task and delete from queue
The result of that task is saved in Result Backend
My Confusions
Whats diff between celery daemon and celery worker
Is Rabbitmq doing the work of queue. Does it means tasks gets saved in Rabitmq or SQS
What does flower do . does it monitor workers or tasks or queues or resulst

First, just to explain how it works briefly. You have a celery client running in your code. You call tasks.add(1,2) and a new Celery Task is created. That task is transferred by the Broker to the queue. Yes the queue is persisted in Rabbimq or SQS. The Celery Daemon is always running and is listening for new tasks. When there is a new task in the queue, it starts a new Celery Worker to perform the work.
To answer your questions:
Celery daemon is always running and it's starting celery workers.
Yes Rabitmq or SQS is doing the work of a queue.
With the celery monitor you can monitor how many tasks are running, how many are completed, what is the size of the queue, etc.

I think the answer from nstoitsev has good intention but create some confusion.
So let's try to clarify a bit.
A Celery worker is the celery process responsable of executing the
tasks, when configured to run in background than is often called
celery daemon. So you can consider the two the same thing.
To clarify the confusion of he answer of nstoitsev, each worker can have a concurrency parameter that can be bigger than 1. When this is the case each celery worker is capable of create N child worker till reaching the concurrency parameter to execute the task in parallel, this are often also called worker.
The broker holds queues and exchanges this means that a celery worker is able to connect to to the broker using a protocol called AMQP and publish or consume messages.
Flower is able to monitor a celery cluster using the broker itself. Basically is capable to receive events from all the workers. Flower works also if you have the Result Backend disabled that btw is default behavior with celery Celery result backend.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.