I have a task that runs every night that updates users from a external system. How can I prevent my server flooding the external system with request?
My code:
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
Is there any way to "slow down" the forloop, or is it possible to make celery not executing a task, if there is already a task running?
First, you need to specify what exactly you mean with "flooding" the service. Is it the fact that many requests end up being fired concurrently to one server? If that is the case, a common pattern really is to apply a pool of workers, with a fixed size N. Using this approach, there is a guarantee that your service is queried with at most N requests concurrently. That is, at any point in time, there will be no more than N outstanding requests. This effectively throttles your request rate.
You can then play with N and do some benchmarking and see which number is reasonable in your specific case.
You could use a lock on your task, forcing it to execute only once at a time for the whole pool of workers. You can check this celery recipe.
Putting a time.sleep at won't help you because there's a chance that those tasks will execute at the same time, say if there is some delay on the queue.
1You could use time.sleep to delay your task
import time
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
time.sleep(1)
this will delay the for loop for 1 second
Related
I've been into Python/Django and things are going good but I'm very curious about how django handle any asynchronous tasks for example if there is a database query that took 10 seconds to execute and another request arrives on the server during these 10 seconds?
def my_api(id):
do_somthing()
user = User.objects.get(id=id) # took 10 sec to execute
do_somthing_with_user()
I've used the queryset method that fetch something from database and let's say it took 10 seconds to complete. what happens when another request arrives on the django server, how django respond to it? will the request be pending until the first query responds or will it be handled in a parallel way? what is the deep concept for it?
How Python Handles these type things under the hood?
Handling requests is the main job of the web server that u are using. Django users by default use WSGI/ASGI.
Any requests coming to those web servers have a unique session and it's a separate thread. So no conflicts but there are race conditions. (OS resources, Queue, Stack...)
So consider we have a function that has a 2 seconds sleep inside.
import time
def function1():
time.sleep(2)
User 1 and user 2 request to this function by API.(e.x GET test/API)
two threads start for each one and if both almost start at time 0:0:0 then both end like 0:0:2.
So How about Async? Async request work concurrently for each request. (Not Parallel)
assume another API should call that sleepy function twice. (e.x Get test/API2)
The first call would take 2 seconds(Async function and await) and the second call 2 seconds too.
if we called this at 0:0:0 this would be like 0:0:2 again.
In sync way that would be 0:0:4.
So finally how about the Database?
there are many ways to handle that and the popular one is using Database pooling. However, generally, each query request makes a new connection (multi-thread) to the database and runs those stuff.
database pool is a thing that reduces a runtime because it has some ready-to-use connections to the database and whenever it has a job to do it goes and runs the query and back to the place where in without that pool the connection lifetime is over usage and end whenever the job has done.
I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.
I want to query an api (which is time consuming) with lots of items (~100) but not all at once. Instead I want a little delay between the queries.
What I currently have is a task that gets executed asynchronously and iterates over the queries and after each iteration waits some time:
#shared_task
def query_api_multiple(values):
delay_between_queries = 1
query_results = []
for value in values:
time.sleep(delay_between_queries)
response = query_api(value)
if response['result']:
query_results.append(response)
return query_results
My question is, when multiple of those requests come in, will the second request gets executed after the first is finished or while the first is still running? And when they are not getting executed at the same time, how can I achieve this?
You should not use time.sleep but rate limit your task instead:
Task.rate_limit
Set the rate limit for this task type (limits the
number of tasks that can be run in a given time frame).
The rate limits can be specified in seconds, minutes or hours by
appending “/s”, “/m” or “/h” to the value. Tasks will be evenly
distributed over the specified time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum
delay of 600ms between starting two tasks on the same worker instance.
So if you want to limit it to 1 query per second, try this:
#shared_task(rate_limit='1/s')
def query_api_multiple(values):
...
Yes, if you create multiple tasks then they may run at the same time.
You can rate limit on a task type basis with celery if you want to limit the number of tasks that run per period of time. Alternatively, you could implement a rate limiting pattern using something like redis, combined with celery retries, if you need more flexibility than what celery provides OOtB.
I am trying to come up with a notification service for a list of events for which the data is available in the database every few minutes and gets updated using some mechanism. 2 minutes before the next event, I need to read this database and send out the data to my subscribers as a reminder that the event is about to start. This times are not fixed. They depend on the event time of the next event.
Right now I am creating a celery worker for every user who subscribes. I make the specific celery worker go to sleep till the next event, at which point it resumes and sends out the messge.
Something like this:
nextEventDelay = events.getTimeToNextEventInSeconds()
sleep(nextEventDelay)
SendEventNotification()
But I know, it is not good. For a single person/ 2 people it's working. But for 1000 users, if it spawns 1000 workers, it will not be good.
So my solution? I am thinking of creating a single worker process which will monitor the database for subscribers and once the notification is to be sent out will read from database and send to them. But, this takes care of only one event. Should I keep this in an infinite for loop to notify about the next event?
I am using Celery for async task management with redis. The appplication is Python flask application. Let me know if you need any more info. Thanks.
Using celery beats you could run a job every x seconds to check if any events are within two minutes of starting. You could then trigger your 'reminder' jobs from that task.
Here is the documentation for periodic celery tasks.
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
I would suggest you stay far away from long running celery tasks as I have not had a great experience with them.
Here is some untested pseudo code to get you started.
from celery import Celery
from celery.schedules import crontab
app = Celery()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
# check for events every 20 seconds
sender.add_periodic_task(20.0, trigger_reminders.s(), name='check for upcoming events')
#app.task
def trigger_reminders(*args, **kwargs):
upcoming_events = get_upcoming_events()
for event in upcoming_events:
send_notification.delay(event)
#app.task
def send_event(*args, **kwargs):
#Send the user notification
I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?
I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.
First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.
First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.