I'm looking at Celery to perform a defined set of tasks spread over multiple machines. Each machine can process any one of several tasks, but some of the tasks will require more machine resources than others. Is there a way to manage these resources using Celery?
Celery doesn't provide a means of measuring current/past resource utilization of workers and adjusting the amount of work they perform based on those measurements. However, you do have a few knobs to turn with Celery that can result in more predictable and more evenly distributed resource utilization (YMMV).
If you have tasks that have no performance requirement, you might consider limiting the number of tasks that can be performed over a given period of time with rate limiting.
Another option is to use celery queues to your advantage. Depending on your needs, you might create a queue for light tasks and one for heavy tasks and then have workers with more horsepower listen to the heavy queue and those with less listen to the light queue (or more workers listening on heavy, less on light).
Related
I implemented a script in which every day I process several urls and make many I/O operations, and I am subclassing threading.Thread and starting a number of threads (say 32).
The workload varies day by day but as soon as the processing starts I am sure that no more tasks will be added to the input queue.
Also, my script is not supporting any front-end (at least for now).
I feel though that this solution will not be so easily scalable in the case of multiple processes / machines and would like to give Celery (or any distributed task queue) a shot, but I always read that it’s better suited for long-running tasks running in the background to avoid blocking a UI.
On the other hand, I have also read that having many small tasks is not a problem with Celery.
What’s your thought on this? Would be easy to scale Celery workers possibly across processes / machines?
I have a few Celery workers that perform tasks that are not always that fast. The tasks are usually a bunch of HTTP requests and DB queries (using pyscopg2 behind SQLAlchemy). I'm running in Kubernetes and the CPU usage is always fairly low (0.01 or so). Celery automatically set the concurrency to 2 (number of cores of a single node), but I was wondering whether it would make sense to manually increase this number.
I always read that the concurrency (processes?) should be the same as the number of cores, but if the worker does not use a whole core, couldn't it be more? Like concurrency=10 ? Or that would make no difference and I'm just missing the point of processes and concurrency?
I couldn't find information on that. Thanks.
Everything is true. Celery automatically sets the number of cores as concurrency, as it assumes that you will the entire core (CPU intensive task).
Sounds like you can increase the concurrency, as your tasks are doing more I/O bound tasks (and the CPU is idle).
To be on the safe side, I would do it gradually and increase to 5 first, monitor, ensure that CPU are fine and then to 10..
I was wondering whether it is possible to set different prefetch multiplier for queues.
I have 2 queues, one has really short running tasks, other slightly longer. Queue for shorter tasks needs to be prioritized over other one.
To ensure that prioritization work reliable, this has to be set in celery config:
task_acks_late = True
worker_prefetch_multiplier = 1
However, that really hurts performance for fast task queue. Would it be possible to configure so that if worker is fetching from fast task queue, worker_prefetch_multiplier is 4 and if worker is fetching from slow task queuem worker_prefetch_multiplier is 1 ?
I am not sure if it is possible to define different prefetch limits per queue since the Celery documentation seems to set these limits per worker.
However, we are solving this issue by starting a different worker for each queue. You can define different prefetch limits per worker - if one worker only uses one queue you can thus also define different prefetch limits as well as worker concurrencies per queue. This also has the added benefit that your long-running tasks would not block worker processing time for the short-running tasks.
If you by any chance are thinking about using celery-batches to speed up processing for the short-running tasks even further, the queue separation into different workers becomes even more important since you want to then have quite high prefetch limits defined for that worker (note: you will eventually be running out of memory if your prefetch limit is 0 and you have a very full queue).
In our case, we are running our workers in a contianerized environment. This enables us to even define the resource allocation (memory / cpu) independent for each worker / queue.
I have a Celery cluster made up of machines with 8-core processors. Each machine has one worker that is set to a concurrency factor of 8 (-c8).
I often see nodes with a lot of reserved tasks, but only one or two are running simultaneously. My tasks are often long-running with a lot of compute and I/O.
Any ideas as to why this is happening, and what I can do to increase the number of tasks simultaneously running? Does celery throttle the number of active tasks based on system load? I looked through the documentation but came up short.
Thanks to banana, I think I found the answer.
Some of my tasks were spawning subprocesses, which Celery counts in its concurrency.
I have a python (2.6.5 64-bit, Windows 2008 Server R2) app that launches worker processes. The parent process puts jobs in a job queue, from which workers pick them up. Similarly it has a results queue. Each worker performs its job by querying a server. CPU usage by the workers is low.
When the number of workers grows, CPU usage on the servers actually shrinks. The servers themselves are not the bottleneck, as I can load them up further from other applications.
Anyone else seen similar behavior? Is there an issue with python multiprocessing queues when a large number of processes are reading or writing to the same queues?
Two different ideas for performance constraints:
The bottleneck is the workers fighting each other and the parent for access to the job queue.
The bottleneck is connection rate-limits (syn-flood protection) on the servers.
Gathering more information:
Profile the amount of work done: tasks completed per second, use this as your core performance metric.
Use packet capture to view the network activity for network-level delays.
Have your workers document how long they wait for access to the job queue.
Possible improvements:
Have your workers use persistent connections if available/applicable (e.g. HTTP).
Split the tasks into multiple job queues fed to pools of workers.
Not exactly sure what is going on unless you provide all the details.
However, remember that the real concurrency is bounded by the actual number of hardware threads. If the number of processes launched is much larger than the actual number of hardware threads, at some point the context-switching overhead will be more than the benefit of having more concurrent processes.
Creating of new thead is very expensive operation.
One of the simplest ways for controling a lot of paralell network connections is to use stackless threads with support of asyncronical sockets. Python had great support and a bunch of libraries for that.
My favorite one is gevent, which has a great and comletely transparent monkey-patching utility.