Python multiprocessing queue scaling to large numbers of workers

Python multiprocessing queue scaling to large numbers of workers - python

I have a python (2.6.5 64-bit, Windows 2008 Server R2) app that launches worker processes. The parent process puts jobs in a job queue, from which workers pick them up. Similarly it has a results queue. Each worker performs its job by querying a server. CPU usage by the workers is low.
When the number of workers grows, CPU usage on the servers actually shrinks. The servers themselves are not the bottleneck, as I can load them up further from other applications.
Anyone else seen similar behavior? Is there an issue with python multiprocessing queues when a large number of processes are reading or writing to the same queues?

Two different ideas for performance constraints:
The bottleneck is the workers fighting each other and the parent for access to the job queue.
The bottleneck is connection rate-limits (syn-flood protection) on the servers.
Gathering more information:
Profile the amount of work done: tasks completed per second, use this as your core performance metric.
Use packet capture to view the network activity for network-level delays.
Have your workers document how long they wait for access to the job queue.
Possible improvements:
Have your workers use persistent connections if available/applicable (e.g. HTTP).
Split the tasks into multiple job queues fed to pools of workers.

Not exactly sure what is going on unless you provide all the details.
However, remember that the real concurrency is bounded by the actual number of hardware threads. If the number of processes launched is much larger than the actual number of hardware threads, at some point the context-switching overhead will be more than the benefit of having more concurrent processes.

Creating of new thead is very expensive operation.
One of the simplest ways for controling a lot of paralell network connections is to use stackless threads with support of asyncronical sockets. Python had great support and a bunch of libraries for that.
My favorite one is gevent, which has a great and comletely transparent monkey-patching utility.

Related

Celery vs threading for short tasks (5 seconds each)

I implemented a script in which every day I process several urls and make many I/O operations, and I am subclassing threading.Thread and starting a number of threads (say 32).
The workload varies day by day but as soon as the processing starts I am sure that no more tasks will be added to the input queue.
Also, my script is not supporting any front-end (at least for now).
I feel though that this solution will not be so easily scalable in the case of multiple processes / machines and would like to give Celery (or any distributed task queue) a shot, but I always read that it’s better suited for long-running tasks running in the background to avoid blocking a UI.
On the other hand, I have also read that having many small tasks is not a problem with Celery.
What’s your thought on this? Would be easy to scale Celery workers possibly across processes / machines?

Celery not using all concurrent slots

I have a Celery cluster made up of machines with 8-core processors. Each machine has one worker that is set to a concurrency factor of 8 (-c8).
I often see nodes with a lot of reserved tasks, but only one or two are running simultaneously. My tasks are often long-running with a lot of compute and I/O.
Any ideas as to why this is happening, and what I can do to increase the number of tasks simultaneously running? Does celery throttle the number of active tasks based on system load? I looked through the documentation but came up short.

Thanks to banana, I think I found the answer.
Some of my tasks were spawning subprocesses, which Celery counts in its concurrency.

How to tell uWSGI to prefer processes to threads for load balancing

I've installed Nginx + uWSGI + Django on a VDS with 3 CPU cores. uWSGI is configured for 6 processes and 5 threads per process. Now I want to tell uWSGI to use processes for load balancing until all processes are busy, and then to use threads if needed. It seems uWSGI prefer threads, and I have not found any config option to change this behaviour. First process takes over 100% CPU time, second one takes about 20%, and another processes are mostly not used.
Our site receives 40 r/s. Actually even having 3 processes without threads is anough to handle all requests usually. But request processing hangs from time to time for various reasons like locked shared resources, etc. In such cases we have -1 process. Users don't like to wait and click the link again and again. As a result all processes hangs and all users have to wait.
I'd add even more threads to make the server more robust. But the problem is probably python GIL. Threads wan't use all CPU cores. So multiple processes work much better for load balancing. But threads may help a lot in case of locked shared resources and i/o wait delays. A process may do much work while one of it's thread is locked.
I don't want to decrease time limits until there is no another solution. It is possible to solve this problem with threads in theory, and I don't want to show error messages to user or to make him waiting on every request until there is no another choice.

So, the solution is:
Upgrade uWSGI to recent stable version (as roberto suggested).
Use --thunder-lock option.
Now I'm running with 50 threads per process and all requests are distributed between processes equally.

Every process is effectively a thread, as threads are execution contexts of the same process.
For such a reason there is nothing like "a process executes it instead of a thread". Even without threads your process has 1 execution context (a thread). What i would investigate is why you get (perceived) poor performances when using multiple threads per process. Are you sure you are using a stable (with solid threading support) uWSGI release ? (1.4.x or 1.9.x)
Have you thought about dynamically spawning more processes when the server is overloaded ? Check the uWSGI cheaper modes, there are various algorithm available. Maybe one will fit your situation.
The GIL is not a problem for you, as from what you describe the problem is the lack of threads for managing new requests (even if from your numbers it looks you may have a too much heavy lock contention on something else)

Green-threads and thread in Python

As Wikipedia states:
Green threads emulate multi-threaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support.
Python's threads are implemented as pthreads (kernel threads),
and because of the global interpreter lock (GIL), a Python process only runs one thread at a time.
[QUESTION]
But in the case of Green-threads (or so-called greenlet or tasklets),
Does the GIL affect them? Can there be more than one greenlet
running at a time?
What are the pitfalls of using greenlets or tasklets?
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to
ulimit(-s, -v) set in your *ix system.)
I need a little insight, and it would help if someone could share their experience, or guide me to the right path.

You can think of greenlets more like cooperative threads. What this means is that there is no scheduler pre-emptively switching between your threads at any given moment - instead your greenlets voluntarily/explicitly give up control to one another at specified points in your code.
Does the GIL affect them? Can there be more than one greenlet running
at a time?
Only one code path is running at a time - the advantage is you have ultimate control over which one that is.
What are the pitfalls of using greenlets or tasklets?
You need to be more careful - a badly written greenlet will not yield control to other greenlets. On the other hand, since you know when a greenlet will context switch, you may be able to get away with not creating locks for shared data-structures.
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to umask limit set in your *ix system.)
With regular threads, the more you have the more scheduler overhead you have. Also regular threads still have a relatively high context-switch overhead. Greenlets do not have this overhead associated with them. From the bottle documentation:
Most servers limit the size of their worker pools to a relatively low
number of concurrent threads, due to the high overhead involved in
switching between and creating new threads. While threads are cheap
compared to processes (forks), they are still expensive to create for
each new connection.
The gevent module adds greenlets to the mix. Greenlets behave similar
to traditional threads, but are very cheap to create. A gevent-based
server can spawn thousands of greenlets (one for each connection) with
almost no overhead. Blocking individual greenlets has no impact on the
servers ability to accept new requests. The number of concurrent
connections is virtually unlimited.
There's also some further reading here if you're interested:
http://sdiehl.github.io/gevent-tutorial/

I assume you're talking about evenlet/gevent greenlets
1) There can be only one greenlet running
2) It's cooperative multithreading, which means that if a greenlet is stuck in an infinite loop, your entire program is stuck, typically greenlets are scheduled either explicitly or during I/O
3) A lot more than threads, it depends of the amount of RAM available

Celery and multithreaded tasks

I'm looking at Celery to perform a defined set of tasks spread over multiple machines. Each machine can process any one of several tasks, but some of the tasks will require more machine resources than others. Is there a way to manage these resources using Celery?

Celery doesn't provide a means of measuring current/past resource utilization of workers and adjusting the amount of work they perform based on those measurements. However, you do have a few knobs to turn with Celery that can result in more predictable and more evenly distributed resource utilization (YMMV).
If you have tasks that have no performance requirement, you might consider limiting the number of tasks that can be performed over a given period of time with rate limiting.
Another option is to use celery queues to your advantage. Depending on your needs, you might create a queue for light tasks and one for heavy tasks and then have workers with more horsepower listen to the heavy queue and those with less listen to the light queue (or more workers listening on heavy, less on light).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.