Preventing duplicity while scheduling tasks with celery beat - python

I have a task that I run periodically (each minute) via Celery Beat. On occasions, the task will take longer than a minute to finish it's execution, which results in the scheduler adding that task to the queue while the task is already running.
Is there a way I can avoid the scheduler adding tasks to the queue if those tasks are already running?
Edit: I have seen Celery Beat: Limit to single task instance at a time
Note that my question is different. I'm asking how to avoid my task being enqueued, while that question is asking how to avoid the task being ran multiple times.

I haven't had this particular problem but a similar one where I had to avoid tasks being applied when a task of the same kind was already running or queued but without Celery Beat. I went down a similar route, with a locking mechanism, as the answer you've linked here. Unfortunately it won't be that easy here as you want to avoid to queue already.
As far as I know Celery doesn't support anything like this out of the box. I guess your best bet is to write a custom scheduler which inherits from Scheduler and then overwrite the apply_entry method or the apply_async method. In there you'd need a locking mechanism to check if the task is already running, i.e. in the task set and release a lock and in apply_async check for that lock. You could use RedLock if you have a Redis running already.

Related

How to check if celery task is already running before running it again with beat?

I have a periodic task scheduled to run every 10 minutes. Sometimes this task completes in 2-3 minutes, sometimes it takes 20 minutes.
Is there any way using celery beats to not open the task if the previous task hasn't completed yet? I don't see an option for it in the interval settings.
No, Celery Beat knows nothing about the running tasks.
One way to achieve what you are trying to do is to link the task to itself. async_apply() for an example has optional parameter link and link_error which can be used to provide a signature (it can be a single task too) to run if the task finishes successfully (link) or unsuccessfully (link_error).
What I use is the following - I schedule task to run frequently (say every 5 minutes), and I use a distributed lock to make sure I always have only one instance of the task running.
Finally a reminder - you can always implement your own scheduler, and use it in your beat configuration. I was thinking about doing this in the past for exactly the same thing you want, but decided that the solution I already have is good enough for me.
You can try this
It provides you with a singleton base class for your tasks.
I use Celery with Django models and I implemented a boolean has_task_running at the model level. Then with Celery signals I change the state of the flag to True when signal before_task_publish is trigged and False when a task terminates. Not simple but flexible.

What happened when celery task code was changed before prefetched task executed?

Does celery detect the changes of task code even if task already is prefetched as past task code?
No, you must reload the workers.

celery and long running tasks

I just watch a youtube video where the presenter mentioned that one should design his/her celery to be short. Tasks running several minutes are bad.
Is this correct? What I do see is that I have some long running task, which takes say 10 minutes to finish. When these kind of task is scheduled frequently, the queue is swamped and no other tasks get scheduled. Is this the reason?
If so, what should be used for long running tasks?
Long running tasks aren't great but It's by no means appropriate to say they are bad. The best way to handle long running tasks is to create a queue for just those tasks and have them run on a separate worker then the short tasks.
The problem with long running tasks is that you have to wait for them when you're pushing a new software version on your server. If you don't wait, your task may run possibly incompatible code, especially if you pickled some complex object as a parameter (which is strongly discouraged).
As #user2097159 said its a good practice to keep the long running tasks in a dedicate queue. You should do that by routing using "settings.CELERY_ROUTES" more info here
If you could estimate how long a task can be running, I recommend to use soft_time_limit per task, you will be able to handle it.
There is a gist from a talk I gave here
Augment the basic Task definition to optionally treat the task instantiation as a generator, and check for TERM or soft timeout on every iteration through the generator. Generically inject a "state" dict kwarg into tasks that support it. If it's the first time the task is run, allocate a new one in results cache, otherwise look up the existing one from results cache.
In your task, figure out a good place to yield which results in short execution times. Update the state parameter as necessary.
When control returns to the master task class, check for TERM or soft timeout, and if there is one, save off the state object and respond to the signal.

Is there any way to make sure certain tasks are not executed in parallel?

I'm writing a Celery task that will run some tests on the pull requests created in BitBucket.
My problem is that if a pull request is updated before my task finishes it will trigger the task again and so I can end up having two tasks running tests on same pull request at the same time.
Is there any way I can prevent this? And make sure that if a task processing certain pull request is already in progress then I wait for that to finish and then start processing it again (from the new task that was queued)
As I monitor multiple repos each with multiple PRs I would like that if an event is coming but from different repo or different pull request to start it and run it.
I only need to queue it if I already have in progress same pull request from same repo.
Any idea if this is possible with celery?
Simplest way to achieve this is, setting worker concurrency to 1 so that only one task gets executed at a time.
Route the tasks to a seperate queue.
your_task.apply_async(foo, queue='bar')
Then start your worker with concurency of one
celery worker -Q bar -c 1
See also Celery - one task in one second
You are looking for a mutex. For Celery, there is celery_mutex and celery_once. In particular, celery_once claims to be doing what you ask, but I do not have experience with it.
You could also use the Python multiprocessing that has a global mutex implementation, or use a shared storage that you already have.
If the tasks run on the same machine, the operating system has locking mechanisms.

Is there distributed task queue in Python that enables me to kill hanging tasks that are not willing to cooperate

Basically I have a lot of tasks (in batches of about 1000) and execution times of these tasks can vary widely (from less than second to 10 minutes). I know that if a task is executing more than a minute i can kill it. These tasks are steps in optimization of some data mining model (but are independent of each other) and are spending most of the time inside some C extension function so they would not cooperate if I tried to kill them gracefully.
Is there a distributed task queue that fits into that schema --- AFAIK: celery allows aborting tasks that are willing to cooperate. But I might be wrong.
I recently asked similar question about killing hanging functions in pure python Kill hanging function in Python in multithreaded enviorment.
I guess I could subclass celery task so it spawns a new process and then executes its payload aborting it's execution if it takes to long, but then I would be killed by overhead of initialization of new interpreter.
Celery supports time limiting. You can use time limits to kill long running tasks. Beside killing tasks you can use soft limits which enable to handle SoftTimeLimitExceeded exceptions in tasks and terminate tasks cleanly.
from celery.task import task
from celery.exceptions import SoftTimeLimitExceeded
#task
def mytask():
try:
do_work()
except SoftTimeLimitExceeded:
clean_up_in_a_hurry()
Pistil allows multiple process management, including killing uncooperative tasks.
But:
it's beta software, even if it powers gunicorn which is reliable
I don't know how it handle 1000 processes
Communication between process is not included yet, so you'll have to setup your own using for example zeromq
Another possibility is to use the timer signal so it raises an exception in 36000 seconds. But signals are not trigered if somebody acquire the GIL, which you C program might do.
When you revoke a celery task, you can provide it with an optional terminate=True keyword.
task.revoke(terminate=True)
It doesn't exactly fit your requirements since it's not done by the process itself, but you should be able to either extend the task class to be able to commit suicide, or have a reccurring cleanup task or process killing off tasks that have not completed on time.

Categories

Resources