I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?
I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.
First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.
First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.
Related
I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.
When I add a queue, I give it an unique name, like longprocess-{id}-{timestamp}. ID is the id in the database for that entry to work, and the timestamp ensure I don't have colliding names in the queue.
The issue is that the user can stop/resume the longprocess if he wants, so in the stop request, I'd like to list all tasks that starts with longporcess-1 (for {id} = 1), and stop all of them (expected 1 entry).
I can target a task with :
q = taskqueue.Queue('longprocess')
q.delete_tasks(taskqueue.Task(name='longprocess-{0}'.format(longprocess.id,)))
But of course, this doesn't work because the name is incorrect (missing it's -{timestamp} part).
Is there something like a q.search('longprocess-1-*') that I would loop over and delete ?
Thank you for your help.
No, there is nothing like q.search('longprocess-1-*') and there could not be (nor it's technically impossible but is just not reasonable) due to nature of queues (in principal, otherwise it's just going to be a DB table).
The advantage (and limitations) of queues is that they use FIFO (firs-in-first out) - not strictly, sometimes with some extensions like "delay" parameters for a task. But anyway tasks scheduler/dispatcher/coordinator does not need to care about deleting tasks from the middle of the queue and is concentrated on work with limited number of tasks in the head of queue. From this specialization we gaining speed, cost effectiveness & reliability of the queues concept.
It's your job to handle how do you cancel a task. You have at least 2 options:
Store somewhere a task name and use it to delete the task from a queue.
Store somewhere intention (request) of canceling a task. When the task hits the worker you check the flag and if needed just ignore the task.
You can use combination of this 2 methods for an edge case when a task has been dispatched to a worker but has not been completed yet. But in most cases it does not worth the effort.
By the way lots of message queuing systems does not have "task deletion" at all. As Russian saying says "A word is not a bird - if it's gone you can not put it back".
from celery import Celery
app = Celery('tasks', backend='amqp://guest#localhost//', broker='amqp://guest#localhost//')
a_num = 0
#app.task
def addone():
global a_num
a_num = a_num + 1
return a_num
this is the code I used to test celery.
I hope every time I use addone() the return value should increase.
But it's always 1
why???
Results
python
>> from tasks import addone
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
By default when a worker is started Celery starts it with a concurrency of 4, which means it has 4 processes started to handle task requests. (Plus a process that controls the other processes.) I don't know what algorithm is used to assign task requests to the processes started for a worker but eventually, if you execute addone.delay().get() enough, you'll see the number get greater than 1. What happens is that each process (not each task) gets its own copy of a_num. When I try it here, my fifth execution of addone.delay().get() returns 2.
You could force the number to increment each time by starting your worker with a single process to handle requests. (e.g. celery -A tasks worker -c1) However, if you ever restart your worker, the numbering will be reset to 0. Moreover, I would not design code that works only if the number of processes handling requests is forced to be 1. One day down the road a colleague decides that multiple processes should handle the requests for the tasks and then things break. (Big fat warnings in comments in the code can only do so much.)
At the end of the day, such state should be shared in a cache, like Redis, or a database used as a cache, which would work for the code in your question.
However, in a comment you wrote:
Let's see I want use a task to send something. Instead of connecting every time in task, I want to share a global connection.
Storing the connection in a cache won't work. I would strongly advocate having each process that Celery starts use its own connection rather than try to share it among processes. The connection does not need to be reopened with each new task request. It is opened once per process, and then each task request served by this process reuses the connection.
In many cases, trying to share the same connection among processes (through sharing virtual memory through a fork, for instance) would flat out not work anyway. Connections often carry state with them (e.g. whether a database connection is in autocommit mode). If two parts of the code expect the connection to be in different states, the code will operate inconsistently.
The tasks will run asynchronously so every time it starts a new task a_num will be set to 0. They are run as separate instances.
If you want to work with values I suggest a value store or database of some sort.
I have a task that runs every night that updates users from a external system. How can I prevent my server flooding the external system with request?
My code:
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
Is there any way to "slow down" the forloop, or is it possible to make celery not executing a task, if there is already a task running?
First, you need to specify what exactly you mean with "flooding" the service. Is it the fact that many requests end up being fired concurrently to one server? If that is the case, a common pattern really is to apply a pool of workers, with a fixed size N. Using this approach, there is a guarantee that your service is queried with at most N requests concurrently. That is, at any point in time, there will be no more than N outstanding requests. This effectively throttles your request rate.
You can then play with N and do some benchmarking and see which number is reasonable in your specific case.
You could use a lock on your task, forcing it to execute only once at a time for the whole pool of workers. You can check this celery recipe.
Putting a time.sleep at won't help you because there's a chance that those tasks will execute at the same time, say if there is some delay on the queue.
1You could use time.sleep to delay your task
import time
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
time.sleep(1)
this will delay the for loop for 1 second
Say my app has a page on which people can add comments.
Say after each comment is added a taskqueue worker is added.
So if a 100 comments are added a 100 taskqueue insertions are made.
(note: the above is a hypothetical example to illustrate my question)
Say I wanted to ensure that the number of insertions are kept to a
minimum (so I don't run into the 10k insertion limit)
Could I do something as follows.
a) As each comment is added call taskqueue.add(name="stickytask",
url="/blah")
- Since this is a named taskqueue it will not be re-inserted if a
taskqueue of the same name exists.
b) The /blah worker url reads the newly added comments, processes the
first one and
than if more comments exist to be processed returns a status code
other than 200
- This will ensure that the task is retried and at next try will
process the next comment
and so on.
So all 100 comments are processed with 1 or a few taskqueue insertion.
(Note: If there is a lull
in activity where no new comments are added and all comments are
processed than the
next added comment will result in a new taskqueue insertion. )
However from the docs (see snippet below) it notes that "the system
will back off gradually". Does this mean that on each "non 200" Http
status code returned a delay is inserted into the next retry?
From the docs:
If the execution of a particular Task fails (by returning any HTTP
status code other than 200 OK), App Engine will attempt to retry until
it succeeds. The system will back off gradually so as not to flood
your application with too many requests, but it will retry failed
tasks at least once a day at minimum.
There's no reason to fake a failure (and incur backoff &c) -- that's a hacky and fragile arrangement. If you fear that simply scheduling a task per new comment might exceed the task queues' currently strict limits, then "batch up" as-yet-unprocessed comments in the store (and possibly also in memcache, I guess, for a potential speedup, but, that's optional) and don't schedule any task at that time.
Rather, keep a cron job executing (say) every minute, which may deal with some comments or schedule an appropriate number of tasks to deal with pending comments -- as you schedule tasks from just one cron job, it's easy to ensure you're never scheduling over 10,000 per day.
Don't let task queues make you forget that cron is also there: a good architecture for "batch-like" processing will generally use both cron jobs and queued tasks to simplify its overall design.
To maximize the amount of useful work accomplished in a single request (from ether a queued task or a cron one), consider an approach based on monitoring your CPU usage -- when CPU is the factor limiting the work you can perform per request, this can help you get as many small schedulable units of work done in a single request as is prudently feasible. I think this approach is more solid than waiting for an OverQuotaError, catching it and rapidly closing up, as that may have other consequences out of your app's control.