Celery + Python: Queue time consuming tasks within another task

Celery + Python: Queue time consuming tasks within another task - python

I want to query an api (which is time consuming) with lots of items (~100) but not all at once. Instead I want a little delay between the queries.
What I currently have is a task that gets executed asynchronously and iterates over the queries and after each iteration waits some time:
#shared_task
def query_api_multiple(values):
delay_between_queries = 1
query_results = []
for value in values:
time.sleep(delay_between_queries)
response = query_api(value)
if response['result']:
query_results.append(response)
return query_results
My question is, when multiple of those requests come in, will the second request gets executed after the first is finished or while the first is still running? And when they are not getting executed at the same time, how can I achieve this?

You should not use time.sleep but rate limit your task instead:
Task.rate_limit
Set the rate limit for this task type (limits the
number of tasks that can be run in a given time frame).
The rate limits can be specified in seconds, minutes or hours by
appending “/s”, “/m” or “/h” to the value. Tasks will be evenly
distributed over the specified time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum
delay of 600ms between starting two tasks on the same worker instance.
So if you want to limit it to 1 query per second, try this:
#shared_task(rate_limit='1/s')
def query_api_multiple(values):
...

Yes, if you create multiple tasks then they may run at the same time.
You can rate limit on a task type basis with celery if you want to limit the number of tasks that run per period of time. Alternatively, you could implement a rate limiting pattern using something like redis, combined with celery retries, if you need more flexibility than what celery provides OOtB.

Related

Django Celery. How to make tasks execute at the appropriate different times given in variables

I try to make a task that will be done at a certain time. Example:
A customer borrowed a book on 01/01/2023 15:00 so the tasks will do exactly one week from now if he doesn't return it and charge a fee.
How to make it do at certain different times.
I'm trying to use django celery with rabbitmq, but I'm not succeeding in making this task open at different times only schematically e.g. every 60 minutes

You can check Celery beat for your case https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
In general, what you can do is create:
a task to check your DB instances every 60 minutes and check if you have any upcoming notifications in the next hour.
if you have, trigger a celery task to notify the user when you need it.

Check out periodic tasks with celery: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
Just create a task that will be executed every minute or every five minutes and check if a charge should be applied and apply it to all objects.

Google appengine: Task queue performance

I currently have an application running on appengine and I am executing a few jobs using the deferred library, some of these tasks run daily, while some of them are executed once a month. Most of these tasks query Datastore to retrieve documents and then store the entities in an index (Search API). Some of these tables are replaced monthly and I have to run these tasks on all entities (4~5M).
One exemple of such a task is:
def addCompaniesToIndex(cursor=None, n_entities=0, mindate=None):
#get index
BATCH_SIZE = 200
cps, next_cursor, more = Company.query().\
fetch_page(BATCH_SIZE,
start_cursor=cursor)
doc_list = []
for i in range(0, len(cps)):
cp = cps[i]
#create a Index Document using the Datastore entity
#this document has only about 5 text fields and one date field
cp_doc = getCompanyDocument(cp)
doc_list.append(cp_doc)
index = search.Index(name='Company')
index.put(doc_list)
n_entities += len(doc_list)
if more:
logging.debug('Company: %d added to index', n_entities)
#to_put[:] = []
doc_list[:] = []
deferred.defer(addCompaniesToIndex,
cursor=next_cursor,
n_entities=n_entities,
mindate=mindate)
else:
logging.debug('Finished Company index creation (%d processed)', n_entities)
When I run one task only, the execution takes around 4-5s per deferred task, so indexing my 5M entities would take about 35 hours.
Another thing is that when I run an update on another index (eg, one of the daily updates) using a different deferred task on the same queue, both are executed a lot slower. And start taking about 10-15 seconds per deferred call which is just unbearable.
My question is: is there a way to do this faster and scale the push queue to more than one job running each time? Or should I use a different approach for this problem?
Thanks in advance,

By placing the if more statement at the end of the addCompaniesToIndex() function you're practically serializing the task execution: the next deferred task is not created until the current deferred task completed indexing its share of docs.
What you could do is move the if more statement right after the Company.query().fetch_page() call where you obtain (most of) the variables needed for the next deferred task execution.
This way the next deferred task would be created and enqueued (long) before the current one completes, so their processing can potentially be overlapping/staggered. You will need some other modifications as well, for example handling the n_entities variable which loses its current meaning in the updated scenario - but that's more or less cosmetic/informational, not essential to the actual doc indexing operation.
If the number of deferred tasks is very high there is a risk of queueing too many of them simultaneously, which could cause an "explosion" in the number of instances GAE would spawn to handle them. In such case is not desired you can "throttle" the rate at which the deferred tasks are spawned by delaying their execution a bit, see https://stackoverflow.com/a/38958475/4495081.

I think I finally managed to get around this issue by using two queues and idea proposed by the previous answer.
On the first queue we only query the main entities (with keys_only). And launch another task on a second queue for those keys. The first task will then relaunch itself on queue 1 with the next_cursor.
The second queue gets the entity keys and does all the queries and inserts on Full text search/BigQuery/PubSub. (this is slow ~ 15s per group of 100 keys)
I tried using only one queue as well but the processing throughput was not as good. I believe that this might come from the fact that we have slow and fast tasks running on the same queue and the scheduler might not work as well in this case.

limited number of user-initiated background processes

I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?

I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.

First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.

First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.

apscheduler scheduler timeout

i have a problem regarding to pythons' apscheduler.
i'm running a task that includes pulling data from db. The dbs' response time varies, because of different operations on it, from different sources, and predicting when the dbs' response time will be low, is not possible.
for example when running
scheduler.add_interval_job(self.readFromDb, start_date = now(), seconds=60)
The seconds parameter stops the task, if it didn't finish, and starts the next task
is there a way of changing the seconds parameter dynamically? or should i use the default value of 0?
cheers

The "seconds" parameter does not in any way limit how long the job can take, and it certainly does not terminate it prematurely. However, it will, with the default settings, prevent another instance of the job from being spawned if the previous instance is taking longer than the specified interval (60 seconds here). The way I see it, you have two options here:
Ignore the fact that a new instance of the task sometimes fails to start
Increase the max_instances parameter from the default of 1 so more than one instance of the task can run concurrently

Prevent python flooding server with task from for loop

I have a task that runs every night that updates users from a external system. How can I prevent my server flooding the external system with request?
My code:
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
Is there any way to "slow down" the forloop, or is it possible to make celery not executing a task, if there is already a task running?

First, you need to specify what exactly you mean with "flooding" the service. Is it the fact that many requests end up being fired concurrently to one server? If that is the case, a common pattern really is to apply a pool of workers, with a fixed size N. Using this approach, there is a guarantee that your service is queried with at most N requests concurrently. That is, at any point in time, there will be no more than N outstanding requests. This effectively throttles your request rate.
You can then play with N and do some benchmarking and see which number is reasonable in your specific case.

You could use a lock on your task, forcing it to execute only once at a time for the whole pool of workers. You can check this celery recipe.
Putting a time.sleep at won't help you because there's a chance that those tasks will execute at the same time, say if there is some delay on the queue.

1You could use time.sleep to delay your task
import time
#task()
def update_users():
#Get all users
users = User.objects.all()
for userobject in users:
#send to update task:
update_user.apply_async(args=[userobject.username,], countdown=15)
time.sleep(1)
this will delay the for loop for 1 second

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.