Does the number of celeryd processes depend on the --concurrency setting?

Does the number of celeryd processes depend on the --concurrency setting? - python

We are running Celery behind Supervisor and start it with
celeryd --events --loglevel=INFO --concurrency=2
This, however, creates a process graph that is up to three layers deep and contains up to 7 celeryd processes (Supervisor spawns one celeryd, which spawns several others, which again spawn processes). Our machine has two CPU cores.
Are all of these processes working on tasks? Are maybe some of them just worker pools? How is the --concurrency setting connected to the number of processes actually spawned?

You shouldn't have 7 processes if --concurrency is 2.
The actual processes started is:
The main consumer process
Delegates work to the worker pool
The worker pool (this is the number that --concurrency decides)
So that is 3 processes with a concurrency of two.
In addition a very lightweight process used to clean up semaphores is started
if force_execv is enabled (which it is by default i you're using some other transport
than redis or rabbitmq).
NOTE that in some cases process listings also include threads.
the worker may start several threads if using transports other than rabbitmq/redis,
including one Mediator thread that is always started unless CELERY_DISABLE_RATE_LIMITS is enabled.

Related

Tasks getting duplicated when using multiple celery workers with same queue

I'm using celery to run tasks that are small and big in nature.
Setup:
I'm using separate queues to handle small, medium, and large tasks independently.
There are different celery workers catering to each of the different queues.
Celery 5.2.7, Python 3.8.10
Using Redis as the broker.
Late ack set to True
Prefetch count set to 1
Visibility timeout set to max.
Celery worker started with: celery -A celeryapp worker --concurrency=1 -Ofair -l INFO -E -Q bigtask-queue -n big#%h
I'm facing an issue where the tasks are getting duplicated across multiple workers of the same type. I'm auto-scaling based on the load on the CPU.
For e.g, when I have 4 tasks with a maximum of 4 workers, each of those 4 tasks is being queued up for execution on each of the 4 workers. I.e, each task is getting executed 4 times, once on each machine sequentially.
What I want is for them to execute just once. If one worker has picked up 1 task from the queue, the same shouldn't be picked by another worker. A new task should be picked only once the new node is up.
I have played with existing answers where setting visibility timeout to the maximum value, setting prefetch task to 1 along with late ack set to True. Nothing has helped.
What am I missing?
Does celery not recognize that the same task has already been picked up by the other worker?
Will using a flag on Redis for each task status work? Will there not be a race condition if multiple workers are already running?
Are there any other solutions?

Do you have celery beat worker running?
something like this:
celery -A run.celery worker --loglevel=info --autoscale=5,2 -n app#beatworker --beat
We had the same problem, but now I don't remember how was it resolved. Try adding this separate worker with --beat option. there should be only one --beat running

how to configure celery executing tasks concurrently from on queue

In an environment with 8 cores, celery should be able to process 8 incoming tasks in parallel by default. But sometimes when new tasks are received celery place them behind a long running process.
I played around with default configuration, letting one worker consume from one queue.
celery -A proj worker --loglevel=INFO --concurrency=8
Is my understanding wrong, that one worker with a concurrency of 8 is able to process 8 tasks from one queue in parallel?
How is the preferred way to setup celery to prevent such behaviour described above?

To put it simply concurrency is the number of jobs running on a worker. Prefetch is the number of job sitting in a queue on a worker itself. You have 1 of 2 options here. The first is to set the prefetch multiplier down to 1. This will mean the worker will only keep, in your case, 8 additional jobs in it's queue. The second which I would recommend would be to create 2 different queues one for your short running tasks and another for your long running tasks.

How to implement a multiprocessing pool using Celery

In python multiprocessing, am able to create a multiprocessing pool of say 30 processes to process some long running equation on some IDs. The below code spawns 30 processes on an 8 core machine and the load_average never exceeds 2.0. In fact, the 30 consumers is a limit given that the server where the postgresql database that hosts the IDs has got 32 cores so I know I can spawn more processes if my database could handle it.
from multiprocessing import Pool
number_of_consumers = 30
pool = Pool(number_of_consumers)
I have taken the time to setup Celery but am unable to recreate the 30 processes. I thought setting the concurrency e.g. -c 30 would create 30 processes but if am not wrong that means I have 32 processors which I intend to use which is wrong as I only have 8! Also, am seeing the load_average hitting 10.0 on an 8 core machine which is bad..
[program:my_app]
command = /opt/apps/venv/my_app/bin/celery -A celery_conf.celeryapp worker -Q app_queue -n app_worker --concurrency=30 -l info
So, when using Celery, how can I recreate my 30 processes on a 8 core machine?
Edit: Qualifying the Confusion
I thought I'd attach an image to illustrate my confusion on server load when discussing Celery and Python Multiprocessing. The server am using has 8 cores. Using Python Multiprocessing and spawning 30 processes, the load average as seen in the attached diagram is at 0.22 meaning -if my linux knowledge serves me right- that my script is using one core to spawn the 30 processes hence a very low load_average.
My understanding of the --concurrency=30 option in celery is that it instructs Celery how many cores it will use rather than how many processes it is required to spawn. Am I right on that? Is there a way to instruct Celery to use 2 cores and for each core spawn 15 processes giving me a total of 30 concurrent processes so that my server load remains low?

A Celery worker consists of:
Message consumer
Worker Pool
The message consumer fetches the tasks from the broker and sends them to the workers in the pool.
The --concurrency or -c argument specifies the number processes in that pool, so if you're using the prefork pool which is the default then you already have 30 processes in the pool using --concurrency=30, you can check by looking at the worker output when it starts, it should have something like:
concurrency: 30 (prefork)
A note from the docs on concurrency:
Number of processes (multiprocessing/prefork pool)
More pool processes are usually better, but there’s a cut-off point where adding more pool processes affects performance in negative ways. There is even some evidence to support that having multiple worker instances running, may perform better than having a single worker. For example 3 workers with 10 pool processes each. You need to experiment to find the numbers that works best for you, as this varies based on application, work load, task run times and other factors.
If you want to start multiple worker instances you should look at celery multi, or start them manually using celery worker.

Celery task subprocesses fill up concurrency slots?

I am running a series of long-running heavy-weight Celery tasks (which spawn multiple subprocesses) in a queue with CELERYD_CONCURRENCY = 4. Initially, 4 tasks are started as they should. However, as tasks finish no new tasks are started until more finish and soon Celery keeps the amount of active tasks down to 1 or 2 until all tasks are complete (confirmed by Celery Flower).
When I only run simple tasks such as the default Celery add function everything works as expected.
Does the subprocesses started by Celery tasks (with same process group ID as the task) count to fill up the concurrency slots? Is there any way to make sure Celery only counts the tasks themselves?

Celery uses prefork as the default execution pool, and every time you spawn a subprocess (another fork), it counts up to the number of concurrent processes running, i.e. the number in CELERYD_CONCURRENCY.
The way to avoid this are by using eventlet, which will allow you to spawn multiple async calls on each task, as long as your tasks don't have any calls that block, like the subprocess.communicate.
To further optimize, you can try splitting the tasks that use subprocess.communicate into a different queue that has a worker using prefork and everything else that is doesn't block in a worker with eventlet.

What goes into an Eventlet+Gunicorn worker thread?

If I deploy Django using Gunicorn with the Eventlet worker type, and only use one process, what happens under the hood to service the 1000 (by default) worker connections? What parts of Django are copied into each thread? Are any parts copied?

If you set workers = 1 in your gunicorn configuration, two processes will be created: 1 master process and 1 worker process.
When you use worker_class = eventlet, the simultaneous connections are handled by green threads. Green threads are not like real threads. In simple terms, green threads are functions (coroutines) that yield whenever the function encounters I/O operation.
So nothing is copied. You just need to worry about making every I/O operation 'green'.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.