I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.
Related
In my little understanding, it is the performance factor that drives programming for multi-threading in most cases but not all. (irrespective of Java or Python).
I was reading this enlightening article on GIL in SO. The article summarizes that python adopts GIL mechanism; i.e only a single Thread can execute python byte code at any given time.
This makes single thread application really faster.
My question is as follows:
Since if only one Thread is served at a given point, does multiprocessing or thread module provides a way to overcome this limitation imposed by GIL? If not, what features does they provide for doing a real multi-task work
There was a question asked in the comments section of the above post in the accepted answer,but no answer has been made? I had this question in my mind too
^so at any time point of time, only one thread will be serving content to client...
so no point of actually using multithreading to improve performance. right?
You're right about the GIL, there is no point to use multithreading to do CPU-bound computation, as the CPU will only be used by one thread.
But that previous statement may have enlighted you: If your computation is not CPU bound, you may take advantage of multithreading.
A typical example is when your application take most of its time waiting for something.
One of many many examples of not-CPU bound program:
Say you want to build a web crawler, you have to crawl many many websites, and store them in a database, what does cost times ? Waiting for the servers to send data, actually downloading the data, and storing it in the database, nothing CPU bound here. Here you may get a faster crawler using a pool of crawlers instead of one single crawler. Typically in the case one website is almost down and very slow to respond (~30s), during this time, a single-threaded application will wait for the website, you're stuck. In a multithreaded application, other threads will continue crawling, and that's cool.
On the other hand, as there is one GIL per process, you may use multiprocessing to do CPU-bound computation.
As a side note, it exists some more or less partial implementations of Python without the GIL, I'd like to mention one that I think is in a great way to achieve something cool: pypy STM. You'll easily find, searching "get rid of the GIL" a lot of threads about the subject.
Multiprocessing side-steps the GIL issue because code runs in a separate process while the GIL is only concerned with a single process. Within a process, multithreading may be faster to the extent that threads are waiting for some relatively slow resource like the disk or network.
A quick google search yielded this informative slideshow. http://www.dabeaz.com/python/UnderstandingGIL.pdf
But what it fails to present it the fact that all threads are contained within a process. And a process by default can only run on one CPU (or core). So while the GIL on a per process basis does manage the threads in said process and doesn't always deliver the expected performance, it should at large scales perform better than single threaded operations.
GIL is always a hot topic in python but usually meaningless. It makes most programs much more safe. If you want real computational performance, try PyOpenCL. Any modern real-world high performance number crunching should be done on GPUs (also openCL runs happily on CPUs). It has no GIL issues.
If you want to do multithreading in python to improve I/O bound performance, GIL is not an issue there.
Lastly if you want to utilize multiple CPUs to increase performance of your pure number crunching, and in a pythonic fashion, use multiprocessing.
But its still not as fast as coding your multithreaded application in assembly. Good luck not making typos.
The current Python application that I'm working on has a need to utilize 1000+ threads (Pythons threading module). Not that any single thread is working at max cpu cycles, this is just a web server load test app I'm creating. I.E. emulate 200 firefox clients all longing into web server and downloading small web components, basically emulating humans that operate in seconds as opposed to microseconds.
So, I was reading through the various topics such as "how many threads does python support on Linux / windows, etc, and I saw a lot of varied answers. One users said its all about memory and the Linux kernel by default only sets aside 8Meg for threads, if it exceeds that then threads start being killed by the Kernel.
One guy stated this is a non issue for CPython because only 1 thread is running at a time anyway (because of the GIL) so we can specify a gazillion threads??? What's the actual truth on this?
"One thread is running at a time because of the GIL." Well, sort of. The GIL means that only one thread can be executing Python code at a time. However, any number of threads could be doing IO, various other syscalls, or other code that doesn't hold the GIL.
It sounds like your threads will be doing mostly network I/O, and any number of threads can do I/O simultaneously. The GIL competition might be pretty fierce with 1000 threads, but you can always create multiple Python processes and divide the I/O threads between them (i.e., fork a couple times before you start).
"The Linux kernel by default only sets aside 8Meg for threads." I'm not sure where you heard that. Maybe what you actually heard was "On Linux, the default stack size is often 8 MiB," which is true. Each thread will use up 8 MiB of address space for stack (no problem on 64-bit) plus kernel resources for the additional memory maps and the thread process itself. You can change the stack size using the threading.stack_size library function, which helps if you have a lot of threads that don't make deep calls.
>>> import threading
>>> threading.stack_size()
0 # platform default, probably 8 MiB
>>> threading.stack_size(64*1024) # 64 KiB stack size for future threads
Others in this thread have suggested using an asynchronous / nonblocking framework. Well, you can do that. However, on the modern Linux kernel, a multithreaded model is competitive with asynchronous (select/poll/epoll) I/O multiplexing techniques. Rewriting your code to use an asynchronous model is a non-trivial amount of work, so I'd only do it if I couldn't get the required performance from a threaded model. If your threads are really trying to simulate human latency (e.g., spend most of their time sleeping), there are a lot of scenarios in which the asynchronous approach is actually slower. I'm not sure if this applies to Python, where the reduced GIL contention alone might merit the switch.
Both of those are partially true:
Each thread does have a stack, and you can run out of address space for the stack if you create enough threads.
Python also does have something called a GIL, which only allows one Python thread to run at a time. However, once Python code calls into C code, that C code can run while a different Python thread runs. However, threads in Python are still physical, and there is still the stack space limit.
If you're planning on having many connections, rather than using many threads, consider using an asynchronous design. Twisted would probably work well here.
We're all aware of the horrors of the GIL, and I've seen a lot of discussion about the right time to use the multiprocessing module, but I still don't feel that I have a good intuition about when threading in Python (focusing mainly on CPython) is the right answer.
What are instances in which the GIL is not a significant bottleneck? What are the types of use cases where threading is the most appropriate answer?
Threading really only makes sense if you have a lot of blocking I/O going on. If that's the case, then some threads can sleep while other threads work. If threads are CPU-bound, you're not likely to see much benefit from multithreading.
Note that the multiprocessing module, while more difficult to code for, makes use of separate processes and therefore doesn't suffer the downsides of the GIL.
Since you seem to be looking for examples, here are some off the top of my head and grabbed from searching for CPU-bound and I/O-bound examples (I can't seem to find many). I am no expert, so please feel free to correct anything I've miscategorized. It's also worth noting that advancing technology could move a problem from one category to another.
CPU Bound Tasks (use multiprocessing)
Numerical methods/approximations for mathematical functions (calculating digits of pi, etc.)
Image processing
Performing convolutions
Calculating transforms for graphics programming (possibly handled by GPU)
Audio/video compression/decompression
I/O Bound Tasks (threading is probably OK)
Sending data across a network
Writing to/reading from the disk
Asking for user input
Audio/video streaming
The GIL prevents python from running multiple threads.
If your code releases the GIL before jumping into a C extension, other python threads can continue while the C code runs. Like with the blocking IO, that other people have mentioned.
Ctypes does this automatically, and so does numpy. So if your code uses them a lot, it may not be significantly restricted by the GIL.
Besides the CPU bound and I/O bound tasks, there is still more use cases. For example, thread enables concurrent tasks. A lot of GUI programming fall into this category. The main loop have to be responsive to mouse events. So anytime you have a task that take a while and you don't want to freeze the UI, you do it on a separate thread. It is less about performance and more about parallelism.
I currently have a couple of small applications < 500 lines. I am intending to eventually run them on a small LINUX ARM box. Is it better to combine these applications and use threading, or continue to have them as two separate applications?
These applications plus a very small website use a small sqlite database, though only one of the applications write everything else currently does reads. Due to constraints of the target box I am using Python 2.6.
I am using SQLite to prevent data loss over several days of use. There is no direct interaction between the two application though there is the potential for database locking issue especially during period data maintenance. Stopping these issue are a concern also the foot print of the two applications as the target devices are pretty small.
Depends on whether you need them to share data and how involved the sharing is. Other than that, from a speed point of view, for a multiprocessing machine, threading won't give you much of an advantage over separate processes.
If sharing can easily take place via a flat file or database then just let them be separate rather than complicating via threading.
For performance purpose, I will suggest you to use threads, process consumes much more resources than threads, it will be faster to create and need less memory (usefull in embedded environment), but of course, you'll have to deal with the common traps of multithreading programmation (concurent access solved by locks, but locks may lead to interlocking...)
If you plan to use many libraries that make low level calls, maybe with C extension developped that could not release properkly the GIL (Global Interpreter Lock), in this case, processes can be a better solution, to allow your applications to run even when one is blocked by GIL.
If you need to pass data between the two, you could use the Queues and other mechanisms in the multiprocessing module.
It's often much simpler to use multiprocessing rather than sharing memory or objects using the threading module.
If you don't need to pass data between your programs, just run them separately (and release any locks on files or databases as soon as possible to reduce contention).
I have decided to adopt a process rather than a threaded approach to resolving this issue. The primary factor in this decision is simplicity. The second factor is whilst one of these applications will be carrying out data acquisition the other will be communicating with a modem on an ad-hoc basis (receiving calls) I don't have control over the calling application but based on my investigations, there is the potential for a lot to go wrong.
There are a couple of factor which may change the approach further down the line primarily the need for the two processes to interact to prevent data contention on the database. Secondly if resources (memory/disk space/cpu) become an issue (due to the size of the device) one application should give me the ability to manage these problems.
That said the data acquisition application is already threaded. This allows the parent thread to manage the worker when exceptions arise, as the device will not be in a managed environment.
I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.
If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.
Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically
For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.
You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.
Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.
If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)
IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.
You may want to look at Twisted. It is designed for asynchronous network tasks.