How to reset time limit on Celery Task

How to reset time limit on Celery Task - python

I have no experience with Celery so I'm looking into it if my use case is solvable in Celery.
The client will submit a job to Celery, this job will be executed in CeleryTask. Then the client has to send keepalive every 30 seconds to keep this job active. Once the job is not refreshed by keepalive message, the job will be cancelled.
I can think of two solutions:
Each job-task created will have hard time limit of 30s. When client sends keepalive the router will send message to relevant worker to reset the hard time limit.
Each job-task will have no time limit. For each job-task there will be another special watchdog task launched. The watchdog task will be launched with delay of 30s. If new keepalive arrives from the client the watchdog task is cancelled and recreated. Again with delay of 30 seconds. If the watchdog is executed, it will kill the job-task, thus eliminating it from system.
The 1. is much simpler, but I'm not sure how to reset the task timelimit. The solution 2. seems more correct, but I'm afraid there will be various race conditions. The watchdog task should be probably running in separate queue reserved for watchdogs only.
How is something like this possible? Even one of my solution or some other.

In my understanding, you just want to have a middleware(one to receive keepalive, on to control task), then this link will help youcelery.app.control.Control.terminate.
Get task_id of task when apply_async (destination or all workers)
listen the keepalive from client
if time limit arrived and no keepalive, terminate task app.control.terminate(task_id, reply=True)

Related

How to make Celery waiting for a specific delay before executing a new task

I am using Celery to parallelize the execution of a Python function calling a third-party API.
This API imposes to wait for at least 3 seconds between each call.
Is there a way to specify a Message Broker (RabbitMQ or Redis) to respect this delay between each worker call ?

In Celery, you can use the countdown method. See https://docs.celeryproject.org/en/stable/userguide/calling.html#eta-and-countdown
RabbitMQ has a few different options for supporting delayed messages including the delayed message exchange plugin and dead lettering. Unfortunately, Celery doesn't support either of these methods. What it does instead is delay execution of the message until the countdown time is reached. The message itself is immediately sent to the worker.

Would check out https://docs.celeryproject.org/en/stable/userguide/tasks.html#retrying
Specify the retry delay to be 3 seconds and you can set a limit for the number of retries to be X as part of the task definition as seen in the docs.

Celery timeout in Django

There are altogether eight tasks running in celery in different periods. All of them are event-driven tasks. After a certain event, they got fired. And the particular task works continuously until certain conditions were satisfied.
I have registered a task which checks for certain conditions for almost two minutes. This task works fine most of the time. But sometimes the expected behavior of the task is not attained.
The signature of the task is as below:
tasks.py
import time
from celery import shared_task
#shared_task()
def some_celery_task(a, b):
main_time_end = time.time() + 120
while time.time() < main_time_end:
...
# some db operations here with given function arguments 'a' and 'b'
# this part of the task get execute most of the time
if time.time() > main_time_end:
...
# some db operations here.
# this part is the part of the task that doesn't get executed sometimes
views.py
# the other part of the view not mentioned here
# only the task invoked part
some_celery_task.apply_async(args=(5, 9), countdown=0)
I am confused about the celery task timeout scenarios. Does that mean the task will stop from where it timeouts or will retry automatically?
It will be a great help if any clear idea about timeout and retries you guys got.
What could be the reason behind the explained scenarios above? Any help on this question will be highly appreciated. Thank you.

Check Celery documentation on Tasks - basics are documented very well.
If task fails or was terminated - task will have states.FAILURE status. It will not be re-tried unless specifically coded. If logging is correctly configured - you might see exception messages in logs in case of timeouts or other code exceptions.
When Celery Task TIME_LIMIT is exceeded - task is terminated right away:
The worker processing the task will be killed and replaced with a new one.
Also, TimeLimitExceeded exception will be raised with message like Task handler raised error: "TimeLimitExceeded(2700)"
If Celery SOFT_TIME_LIMIT is set and is smaller than TIME_LIMIT and is exceeded - than SoftTimeLimitExceeded exception will be raised allowing it to be catched in the task and perform clean-up actions.
When worker consumes message (task) from the broker queue - broker needs to know that the message was consumed successfully. To confirm successful consumpion of message worker acknowledges (ACK) to broker. Until message is not acknowledged it is not deleted from broker but also not available for consumption ("invisible"). In not acknowledged - message will be re-delivered back to broker queue available again for consumption.
Redelivering un-acknowledged messages logic depends on broker:
AMQP (RabbitMQ) broker - tracks connection status with worker, and if connection is lost - returns message back to queue.
Redis or SQS broker has its own timeout after which message will be re-delivered to broker queue if not ACKed.
By default celery worker acknowledges message right at the start of the task.
If ACKS_LATE is set - worker acknowledges to broker only after successfully executing task.
One can RETRY task, by catching exception in the task and sending same task back to the broker for re-execution - then this same task with same id will be queued at broker. Countdown option allows to specify delay before the task will be retried.
Celery Task Execution and other Options can be set globally in settings.py or per task as arguments.
Recommended way it to design tasks / logic with consideration of such events to be totally legit and see them normal (but not actually expected) to happen sometime and be ready:
tasks may fail (next same task may do work for both or checks that specific work was not done and re-fire task)
same task may run again (idempotency)
similar tasks can be run simultaneously (locking)

Debugging what uWSGI worker is doing

I have a Django application (API) running in production served by uWSGI, which has 8 processes (workers) running. To monitor them I use uwsgitop. Every day from time to time one worker falls into the BUSY state and stays for like five minutes and consumes all of the memory and kills the whole instance. The problem is, I do not know how to debug what the worker is doing at the particular moment or what function is it executing. Is there a fast and a proper way to find out the function and the request that it is handling?

One can send signal SIGUSR2 to a uwsgi worker, and the current request is printed into the log file, along with a native (sadly not Python) backtrace.

RabbitMQ pika async consumer heartbeat issue after consumer cancellation

Using RabbitMQ and pika (python), I am running a job queuing system that feeds nodes (asynchronous consumers) with tasks. Each message that defines a task is only acknowloedged once that task is completed.
Sometimes I need to perform updates on these nodes and I have created an exit mode, in which the node waits for its tasks to finish, then exits gracefully. I can then perform my maintenance work.
So that a node does not get more messages from RabbitMQ while in this exit mode, I let it call the basic_cancel method before waiting for the jobs to finish.
This effect of this method is described in the pika documentation :
This method cancels a consumer. This does not affect already
delivered messages, but it does mean the server will not send any more
messages for that consumer. The client may receive an arbitrary number
of messages in between sending the cancel method and receiving the
cancel-ok reply. It may also be sent from the server to the client in
the event of the consumer being unexpectedly cancelled (i.e. cancelled
for any reason other than the server receiving the corresponding
basic.cancel from the client). This allows clients to be notified of
the loss of consumers due to events such as queue deletion.
So if you read "already delivered messages" as messages already received, but not necessarily acknowledged, the tasks the exit mode allows to wait for should not be requeued even if the the consumer node that runs it cancels itself out of the queuing system.
My code for the stop function of my async consumer class (taken from the pika example) is similar to this one :
def stop(self):
"""Cleanly shutdown the connection to RabbitMQ by stopping the consumer
with RabbitMQ. When RabbitMQ confirms the cancellation, on_cancelok
will be invoked by pika, which will then closing the channel and
connection. The IOLoop is started again because this method is invoked
when CTRL-C is pressed raising a KeyboardInterrupt exception. This
exception stops the IOLoop which needs to be running for pika to
communicate with RabbitMQ. All of the commands issued prior to starting
the IOLoop will be buffered but not processed.
"""
LOGGER.info('Stopping')
self._closing = True
self.stop_consuming()
LOGGER.info('Waiting for all running jobs to complete')
for index, thread in enumerate(self.threads):
if thread.is_alive():
thread.join()
# also tried with a while loop that waits 10s as long as the
# thread is still alive
LOGGER.info('Thread {} has finished'.format(index))
# also tried moving the call to stop consuming up to this point
if self._connection!=None:
self._connection.ioloop.start()
LOGGER.info('Closing connection')
self.close_connection()
My issue is that after the consumer cancellation, the async consumer appears to not be sending heartbeats anymore, even if I perform the cancellation after the loop where I wait for my tasks (threads) to finish.
I have read about a process_data_events function for BlockingConnections but I could not find such function. Is the ioloop of the SelectConnection class the equivalent for async consumer ?
As my node in exit mode does not send heartbeats anymore, the tasks it is currently performing will be requeued by RabbitMQ once the maximum heartbeat is reached. I would like to keep this heartbeat untouched, as it is anyway not an issue when I am not in exit mode (my heartbeat here is about 100s, and my tasks might take as much as 2 hours to complete).
Looking at the RabbitMQ logs, the heartbeat is indeed the reason :
=ERROR REPORT==== 12-Apr-2017::19:24:23 ===
closing AMQP connection (.....) :
missed heartbeats from client, timeout: 100s
The only workaround I can think of is acknowledging the messages corresponding to the tasks still running when in exit mode, and hoping that these tasks will not fail...
Is there any method from the channel or connection that I can use to send some heartbeats manually while waiting ?
Could the issue be that the time.sleep() or thread.join() method (from the python threading package) act as completely blocking and do not allow some other threads to perform what they need ? I use in other applications and they don't seem to act as such.
As this issue only appears when in exit mode, I guess there is something in the stop function that causes the consumer to stop sending heartbeats, but as I have also tried (without any success) to call the stop_consuming method only after the wait-on-running-tasks loop, I don't see what can be the root of this issue.
Thanks a lot for your help !

turns out the stop_consuming function was calling basic_cancel in an asynchronous manner with a callback on the channel.close() function, resulting in my application to stop its RabbitMQ interaction and RabbitMQ requeuing the unackesdmessages. Actually realized that as the threads trying to later acknowledge the remaining tasks were having an error as the channel was now set to None, and thus did not have a ack method anymore.
Hope it helps someone!

Delay message consumption with SelectConnection

I want to write a consumer with a SelectConnection.
We have several devices in our network infrastructure that close connections after a certain time, therefore I want to use the heartbeat functionality.
As far as I know, the IOLoop runs on the main thread, so heartbeat frames can not be processed while this thread is processing the message.
My idea is to create several worker threads that process the messages so that the main thread can handle the IOLoop. The processing of a message takes a lot of resources, so only a certain amount of the messages should be processed at once. Instead of storing the remaining messages on the client side, I would like to leave them in the queue.
Is there a way to interrupt the consumption of messages, without interrupting the heartbeat?

I am not an expert on SelectConnection for pika, but you could implement this by setting the Consumer Prefetch (QoS) to the wanted number of processes.
This would basically mean that once a message comes in, you offload it to a process or thread, once the message has been processed you then acknowledge that the message has been processed.
As an example, if you set the QoS to 10. The client would pull at most 10 messages, and won't pull any new messages until at least one of those has been acknowledged.
The important part here is that you would need to acknowledge messages only once you are finished processing them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.