How to implement a multiprocessing pool using Celery

How to implement a multiprocessing pool using Celery - python

In python multiprocessing, am able to create a multiprocessing pool of say 30 processes to process some long running equation on some IDs. The below code spawns 30 processes on an 8 core machine and the load_average never exceeds 2.0. In fact, the 30 consumers is a limit given that the server where the postgresql database that hosts the IDs has got 32 cores so I know I can spawn more processes if my database could handle it.
from multiprocessing import Pool
number_of_consumers = 30
pool = Pool(number_of_consumers)
I have taken the time to setup Celery but am unable to recreate the 30 processes. I thought setting the concurrency e.g. -c 30 would create 30 processes but if am not wrong that means I have 32 processors which I intend to use which is wrong as I only have 8! Also, am seeing the load_average hitting 10.0 on an 8 core machine which is bad..
[program:my_app]
command = /opt/apps/venv/my_app/bin/celery -A celery_conf.celeryapp worker -Q app_queue -n app_worker --concurrency=30 -l info
So, when using Celery, how can I recreate my 30 processes on a 8 core machine?
Edit: Qualifying the Confusion
I thought I'd attach an image to illustrate my confusion on server load when discussing Celery and Python Multiprocessing. The server am using has 8 cores. Using Python Multiprocessing and spawning 30 processes, the load average as seen in the attached diagram is at 0.22 meaning -if my linux knowledge serves me right- that my script is using one core to spawn the 30 processes hence a very low load_average.
My understanding of the --concurrency=30 option in celery is that it instructs Celery how many cores it will use rather than how many processes it is required to spawn. Am I right on that? Is there a way to instruct Celery to use 2 cores and for each core spawn 15 processes giving me a total of 30 concurrent processes so that my server load remains low?

A Celery worker consists of:
Message consumer
Worker Pool
The message consumer fetches the tasks from the broker and sends them to the workers in the pool.
The --concurrency or -c argument specifies the number processes in that pool, so if you're using the prefork pool which is the default then you already have 30 processes in the pool using --concurrency=30, you can check by looking at the worker output when it starts, it should have something like:
concurrency: 30 (prefork)
A note from the docs on concurrency:
Number of processes (multiprocessing/prefork pool)
More pool processes are usually better, but there’s a cut-off point where adding more pool processes affects performance in negative ways. There is even some evidence to support that having multiple worker instances running, may perform better than having a single worker. For example 3 workers with 10 pool processes each. You need to experiment to find the numbers that works best for you, as this varies based on application, work load, task run times and other factors.
If you want to start multiple worker instances you should look at celery multi, or start them manually using celery worker.

Related

Does number of CPU threads limits locust USERS?

I'm using python + locust for performance testing. I mostly use java and in java 1 cpu thread = java thread. So if i have VM with 12 threads, I can perform only 12 actions in parallel.
But locust has parameter USERS which stands for "Peak number of concurrent Locust users". Does it work the same way? If i put USERS = 25 but VM has only 12 threads, will it mean that simultaneously it will execute only 12 actions in parallel and the rest will wait until any thread finishes?

Locust uses gevent which makes I/O asyncronous. A single Locust/Python process can only use one CPU thread (a slight oversimplification), but it can make concurrent HTTP requests: When a request is made by one user, control is immediately handed over to other running users, which can in turn trigger other requests.
This is fundamentally different from Java (which is threaded but often synchronous), but similar to JavaScript.
As long as you run enough Locust worker processes, this is a very efficient approach, and a single process can handle thousands of concurrent users (in fact, the number of users is almost never a limitation - the number of requests per second is the limiting factor)
See Locust's documentation (https://docs.locust.io/en/stable/running-locust-distributed.html)
Because Python cannot fully utilize more than one core per process (see GIL), you should typically run one worker instance per processor core on the worker machines in order to utilize all their computing power.

Is it common to run 20 python workers which uses Redis as Queue ?

This program listen to Redis queue. If there is data in Redis, worker start to do their jobs. All these jobs have to run simultaneously that's why each worker listen to one particular Redis queue.
My question is : Is it common to run more than 20 workers to listen to Redis ?
python /usr/src/worker1.py
python /usr/src/worker2.py
python /usr/src/worker3.py
python /usr/src/worker4.py
python /usr/src/worker5.py
....
....
python /usr/src/worker6.py

Having multiple worker processes (and when I mean "multiple" I'm talking hundreds or more), possibly running on different machines, fetching jobs from a job queue is indeed a common pattern nowadays. There even are whole packages/frameworks devoted to such workflows, like for example Celery.
What is less common is trying to write the whole task queues system from scratch in a seemingly ad-hoc way instead of using a dedicated task queues system like Celery, ZeroMQ or something similar.

If your worker need to do a long task with data, it's a solution. but each data must be treated by a single worker.
By this way, you can easly (without thread,etc..) distribute your tasks, it's better if your worker doesn't work in the same server

How to understand how workers are being used up in gunicorn

I basically would like to know how gunicorn workers work. I have a server with 4 workers on a machine, with 4GB RAM, and 2 CPU and having nginx as the frontend serving requests and reverse proxy. I have simultaneous requests being sent to the server.
I wish to know how the workers are being used up, if they are four requests, are they load balanced across the four workers as If 1 requests for each 1 worker?
Also how to check how much memory is used up for each worker? I have set the max requests to 100. So using this 100 max requests. will it reload the entire 4 workers even if 1 worker has reached 100 requests.
How to get more insight of the workers and how the workers memory and no of requests currently in each worker.

Short answer: Depends on the worker type and gunicorn configuration.
Long answer:
Yes, as long as there are workers available. When gunicorn is started, the -w option configures the number of workers, implementation of which varies depending on worker type. Some worker types use threads, others use event loops and are asynchronous. Performance varies depending on the type - in general async via event loop is preferred as it is lighter on resources and more performant.
Each worker is forked from the main gunicorn process. Memory use can be seen with ps thread output, for example ps -fL -p <gunicorn pid>. Max connections is per worker from documentation so only the worker that reaches 100 connections will be reloaded.
There is a stats collecting library for gunicorn though I have not used it myself.

how to configure celery executing tasks concurrently from on queue

In an environment with 8 cores, celery should be able to process 8 incoming tasks in parallel by default. But sometimes when new tasks are received celery place them behind a long running process.
I played around with default configuration, letting one worker consume from one queue.
celery -A proj worker --loglevel=INFO --concurrency=8
Is my understanding wrong, that one worker with a concurrency of 8 is able to process 8 tasks from one queue in parallel?
How is the preferred way to setup celery to prevent such behaviour described above?

To put it simply concurrency is the number of jobs running on a worker. Prefetch is the number of job sitting in a queue on a worker itself. You have 1 of 2 options here. The first is to set the prefetch multiplier down to 1. This will mean the worker will only keep, in your case, 8 additional jobs in it's queue. The second which I would recommend would be to create 2 different queues one for your short running tasks and another for your long running tasks.

Does the number of celeryd processes depend on the --concurrency setting?

We are running Celery behind Supervisor and start it with
celeryd --events --loglevel=INFO --concurrency=2
This, however, creates a process graph that is up to three layers deep and contains up to 7 celeryd processes (Supervisor spawns one celeryd, which spawns several others, which again spawn processes). Our machine has two CPU cores.
Are all of these processes working on tasks? Are maybe some of them just worker pools? How is the --concurrency setting connected to the number of processes actually spawned?

You shouldn't have 7 processes if --concurrency is 2.
The actual processes started is:
The main consumer process
Delegates work to the worker pool
The worker pool (this is the number that --concurrency decides)
So that is 3 processes with a concurrency of two.
In addition a very lightweight process used to clean up semaphores is started
if force_execv is enabled (which it is by default i you're using some other transport
than redis or rabbitmq).
NOTE that in some cases process listings also include threads.
the worker may start several threads if using transports other than rabbitmq/redis,
including one Mediator thread that is always started unless CELERY_DISABLE_RATE_LIMITS is enabled.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.