I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.
Related
I am trying to figure out how to configure a periodic task in celery to be scheduled to run on load regardless of interval.
For example,
beat_schedule = {
'my-task': {
'task': 'module.my_task',
'schedule': 60.0,
},
}
will wait 60 seconds after the beat is started to run for the first time.
This is problematic for a longer interval, such as an hour, that can do work that is immediately valuable but is not needed "fresh" at shorter intervals.
This question addresses this issue but neither of the answers are satisfactory:
Adding startup lag for the task to be en-queued is both undesirable performance-wise and bad for maintainability since the initial run and schedule are now separated.
Re-implementing the schedule within the task is bad for maintainability.
This seems to me something that should be obvious, so I am quite surprised that that SO question is all I can find on the matter. I am unable to figure this out from the docs and the celery github issues so I wonder if I am missing something obvious.
Edit:
There seems to be more to the story here, because after trying a different task with an hour interval, it ran immediately as the project celery is started.
If I stop and clear the queue with celery purge -A proj -f then start celery again, the task does not run within the heartbeat interval. This would makes sense because the worker handles the messages but beat has its own schedule record celerybeat-schedule which would be unaffected by the purge.
If I delete celerybeat-schedule and restart beat the task still does not run. Starting celery beat with a non-default schedule db location also does not cause the task to run. The next time the task runs is one hour from the time I started the new beat (14:59) not one hour from the first start time of the task (13:47).
There seems to be some state that is not documented well or is unknown that is the basis of this issue. My question can also be stated as: how do you force beat to clear its record of last runs?.
I am also concerned that while running the worker and beat, running celery -A proj inspect scheduled gives - empty - but presumably the task had to be scheduled at some point because it gets run.
I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?
I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.
First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.
First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.
I want to add a task to the queue at app startup, currently adding a scheduler.queue_task(...) to the main db.py file. This is not ideal as I had to define the task function in this file.
I also want the task to repeat every 2 minutes continuously.
I would like to know what is the best practices for this?
As stated in web2py doc, to rerun task continuously, you just have to specify it at task queuing time :
scheduler.queue_task(your_function,
pargs=your_args,
timeout = 120, # just in case
period=120, # as you want to run it every 2 minutes
immediate=True, # starts task ASAP
repeats=0 # just does the infinite repeat magic
)
To queue it at startup, you might want to use web2py cron feature this simple way:
#reboot root *your_controller/your_function_that_calls_queue_task
Do not forget to enable this feature (-Y, more details in the doc).
There is no real mechanism for this within web2py it seems.
There are a few hacks one could do to continuously repeat tasks or schedule at startup but as far as I can see the web2py scheduler needs alot of work.
Best option is to just abondon this web2py feature and use celery or similar for advanced usage.
Is there any common solution to store and reuse celery task results without executing tasks again? I have many http fetch tasks in my metasearch project and wish to reduce number of useless http requests (they can take long time and return same results) by store results of first one and fire it back without real fetching. Also it will be very useful to does not start new fetch task when the same one is already in progress. Instead of running new job app has to return AsyncResult by id (id is unique and generated by task call args) of already pending task.
Looks like I need to define new apply_async(Celery.send_task) behavior for tasks with same task_id:
if task with given task_id doesn't started yet then start it
if task with given task_id already started return AsyncResult(task_id) without actually run task
#task decorator should accept new ttl
kwarg to determine cache time (only for redis backend?)
Looks like the simplest answer is to store your results in a cache (like a database) and first ask for the result from your cache else fire the http request.
I don't think there's something specific to celery that can perform this.
Edit:
To comply with the fact that you the tasks are sent at the same time an additional thing would be to build a lock for celery task (see Celery Task Lock receipt).
In your case you want to give the lock a name containing the task name and the url name. And you can use whatever system you want for cache if visible by all your workers (Redis in your case?)
I'm using deferred to put tasks in the default queue in a AppEngine app similar to this approach.
Im naming the task with a timestamp that changes every 5 second. During that time a lot of calls are made to the queue with the same name resulting in a TaskAlreadyExistsError which is fine. The problem is when I check the quotas the "Task Queue API Calls" are increasing for every call made, not only those who actually are put in the queue.
I mean if you look at the quota: Task Queue API Calls: 34,017 of 100,000 and compare to the actual queue calls: /_ah/queue/deferred - 2.49K
Here is the code that handles the queue:
try:
deferred.defer(function_call, params, _name=task_name, _countdown=int(interval/2))
except (taskqueue.TaskAlreadyExistsError, taskqueue.TombstonedTaskError):
pass
I suppose that is the way it works. Is there a good way to solve the problem with the quota? Can I use memcache to store the task_name and check if the task has been added besides the above try/catch? Or is there a way to check if the task already exists without using Task Queue Api Calls?
Thanks for pointing out that this is the case, because I didn't realise, but the same problem must be affecting me.
As far as I can see yeah, throwing something into memcache comprised of the taskname should work fine, and then if you want to reduce those hits on memcache you can store the flag locally also within the instance.
The "good way" to solve the quota problem is eliminating destined-to-fail calls to Task Queue API.
_name you are using changes every 5 seconds which might not be a bottleneck if you increase the execution rate of your Task Queue. But you also add Tasks using _countdown.