We're all aware of the horrors of the GIL, and I've seen a lot of discussion about the right time to use the multiprocessing module, but I still don't feel that I have a good intuition about when threading in Python (focusing mainly on CPython) is the right answer.
What are instances in which the GIL is not a significant bottleneck? What are the types of use cases where threading is the most appropriate answer?
Threading really only makes sense if you have a lot of blocking I/O going on. If that's the case, then some threads can sleep while other threads work. If threads are CPU-bound, you're not likely to see much benefit from multithreading.
Note that the multiprocessing module, while more difficult to code for, makes use of separate processes and therefore doesn't suffer the downsides of the GIL.
Since you seem to be looking for examples, here are some off the top of my head and grabbed from searching for CPU-bound and I/O-bound examples (I can't seem to find many). I am no expert, so please feel free to correct anything I've miscategorized. It's also worth noting that advancing technology could move a problem from one category to another.
CPU Bound Tasks (use multiprocessing)
Numerical methods/approximations for mathematical functions (calculating digits of pi, etc.)
Image processing
Performing convolutions
Calculating transforms for graphics programming (possibly handled by GPU)
Audio/video compression/decompression
I/O Bound Tasks (threading is probably OK)
Sending data across a network
Writing to/reading from the disk
Asking for user input
Audio/video streaming
The GIL prevents python from running multiple threads.
If your code releases the GIL before jumping into a C extension, other python threads can continue while the C code runs. Like with the blocking IO, that other people have mentioned.
Ctypes does this automatically, and so does numpy. So if your code uses them a lot, it may not be significantly restricted by the GIL.
Besides the CPU bound and I/O bound tasks, there is still more use cases. For example, thread enables concurrent tasks. A lot of GUI programming fall into this category. The main loop have to be responsive to mouse events. So anytime you have a task that take a while and you don't want to freeze the UI, you do it on a separate thread. It is less about performance and more about parallelism.
Related
I’ve got a burning question. Recently I’ve been learning Asyncio in Python and found it very useful and efficient but here is my question: is it efficient to use it for “normal” things?
It’s obvious that using asynchronous operations for making requests, handling requests (in apis), working on files will give us performance gain. But how I put other operations? For example, if I want to do a lot of complicated mathematical operations or just standard operations (without files and web), would asyncio help me anyway? Is there any reason why we should use it outside our apps where we are not making requests and doing all this web or files stuff?
I’m wondering because in college teachers never mentioned that we couldn’t get any better by using it for just math or standard (local?, non-file, non-web) operations and I thought that we benefit from it (almost always). Am I totally wrong? Is it that way just in python or in every other language ?
asyncio in the first place is a convenient way to run multiple execution flows (compared to common alternatives like callbacks and threads).
Why would someone want to run multiple execution flows? Usually to gain performance, for example:
You don't want to waste time waiting one network request finished, so you starting another concurrently gaining performance
You don't want to waste time waiting one OS thread finished, so you starting another concurrently. In Python due to GIL you won't gain performance with threads for CPU-bound operations. But they can still be useful for network stuff or specifically in asyncio as a common way to run something blocking without freezing event loop.
You don't want to waste time waiting one OS process finished, so you starting another concurrently.
Last item is a way to gain performance even for purely CPU-bound operations (if machine have multiple cores). You can see example here (third option). asyncio here, again, is just a tool for convenient managing execution flows. Nothing stops you from using pure ProcessPoolExecutor and de-facto callbacks as shown here.
I have been having no problems with performance with Python's Global Interpreter Lock. I've had to make a few things thread-safe - despite common advice, the GIL does NOT automatically guarantee thread-safety - but I've got a program commonly running upwards of 10 threads, where all of them can be active at any time, including together. It is a somewhat complex asynchronous messaging system.
I understand multiprocessing and am even using Celery in this program, but the solution would have to be very convoluted to work through multiprocessing for this problem set.
I'm running 2.7 and using recursive locks despite their performance penalties.
My question is this: will I run into scaling problems with the GIL? I have seen no performance problems with it so far. Measuring this is...problematic. Is there a number of threads or something similar that you hit and it just starts choking? Does GIL performance differ significantly from executing multi-threaded code on a single-core CPU?
Thanks!
The GIL is a complex topic and the exact behavior in your case is hard to explain without your code. So I can not tell you if you will run into troubles in future. I can just advise to bring you project to a recent version of Python 3 if posdible. There have been many improvents made to the GIL in Python 3.
The is nothing like a magic number of threads at which Python will break. The general rule is just: The more threads, the more problem. Most complicated is going from one to two.
The GIL is released in some situations, especially when C code is executed or I/O is done. This allows code to run in parallel. With the advanced featured of modern CPUs is wouldn't be wise to limit your code to just one CPU.
In my little understanding, it is the performance factor that drives programming for multi-threading in most cases but not all. (irrespective of Java or Python).
I was reading this enlightening article on GIL in SO. The article summarizes that python adopts GIL mechanism; i.e only a single Thread can execute python byte code at any given time.
This makes single thread application really faster.
My question is as follows:
Since if only one Thread is served at a given point, does multiprocessing or thread module provides a way to overcome this limitation imposed by GIL? If not, what features does they provide for doing a real multi-task work
There was a question asked in the comments section of the above post in the accepted answer,but no answer has been made? I had this question in my mind too
^so at any time point of time, only one thread will be serving content to client...
so no point of actually using multithreading to improve performance. right?
You're right about the GIL, there is no point to use multithreading to do CPU-bound computation, as the CPU will only be used by one thread.
But that previous statement may have enlighted you: If your computation is not CPU bound, you may take advantage of multithreading.
A typical example is when your application take most of its time waiting for something.
One of many many examples of not-CPU bound program:
Say you want to build a web crawler, you have to crawl many many websites, and store them in a database, what does cost times ? Waiting for the servers to send data, actually downloading the data, and storing it in the database, nothing CPU bound here. Here you may get a faster crawler using a pool of crawlers instead of one single crawler. Typically in the case one website is almost down and very slow to respond (~30s), during this time, a single-threaded application will wait for the website, you're stuck. In a multithreaded application, other threads will continue crawling, and that's cool.
On the other hand, as there is one GIL per process, you may use multiprocessing to do CPU-bound computation.
As a side note, it exists some more or less partial implementations of Python without the GIL, I'd like to mention one that I think is in a great way to achieve something cool: pypy STM. You'll easily find, searching "get rid of the GIL" a lot of threads about the subject.
Multiprocessing side-steps the GIL issue because code runs in a separate process while the GIL is only concerned with a single process. Within a process, multithreading may be faster to the extent that threads are waiting for some relatively slow resource like the disk or network.
A quick google search yielded this informative slideshow. http://www.dabeaz.com/python/UnderstandingGIL.pdf
But what it fails to present it the fact that all threads are contained within a process. And a process by default can only run on one CPU (or core). So while the GIL on a per process basis does manage the threads in said process and doesn't always deliver the expected performance, it should at large scales perform better than single threaded operations.
GIL is always a hot topic in python but usually meaningless. It makes most programs much more safe. If you want real computational performance, try PyOpenCL. Any modern real-world high performance number crunching should be done on GPUs (also openCL runs happily on CPUs). It has no GIL issues.
If you want to do multithreading in python to improve I/O bound performance, GIL is not an issue there.
Lastly if you want to utilize multiple CPUs to increase performance of your pure number crunching, and in a pythonic fashion, use multiprocessing.
But its still not as fast as coding your multithreaded application in assembly. Good luck not making typos.
Following up on David Beazley's paper regarding Python and GIL, would it be a good practice to limit a Python program (CPython with GIL and all) to a single CPU in a Windows based multi-core system?
Would it boost performance?
UPDATE: Assume multiple threads are used (not sure if it makes a difference)
The paper does indeed imply that limiting a program to a single core would improve performance in that particular case. However, there are a number of concerns that you would need to deal with:
His tests are mainly for compute intensive threads rather than IO bound threads. If the threads you are using often block voluntarily (such as in a web server waiting for a client) then you don't run into GIL issues at all.
The GIL issues deal specifically with threads and not processes. I may be reading your question wrong, but you seem to be asking about restricting all Python programs to a single core. Programs using processes for parallelism don't suffer from GIL issues and restricting them to a single core will make them slower.
The GIL is drastically different in Python 3.2 (as David mentions in this video. The GIL was changed explicitly to deal with such issues. While it still has problems, it no longer has this problem.
In summary, the only time you want to complicate your life by forcing the OS to restrict the program to a single core is when you are running a:
Multithreaded
Compute Intensive
Lower than Python 3.2
program on a multicore machine.
Bias : For parallel computing involving heavy CPU processing, I much
prefer message passing and cooperating processes to thread programming
(of course, it depends on the problem)
You shouldn't limit your programs to one core. Beazley was just demonstrating a specific problem that performed poorly under those unque conditions (those conditions being an IO bound thread and a CPU bound thread competing against each other). Ideally you want to avoid those conditions by using a different method (import multiprocessing).
I think the best solution is to put your CPU bound tasks in other processes using the multiprocessing module so that they utilize their own cores, and IO bound tasks in threads (or microthreads/coroutines, if you read his interesting paper on that: http://www.dabeaz.com/coroutines/) since the GIL is best suited for those types of tasks.
Conclusion: Python threads are best suited for IO bound tasks, NOT CPU bound.
I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.
If you are truly compute bound, using the multiprocessing module is probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
If you are I/O bound, using the threading module will usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
PyPy is focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
Stackless Python is also a nice idea. Stackless has portability issues as indicated above. Unladen Swallow was promising, but is now defunct. Pyston is another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.
Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
Setup multiprocess may be very easy:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically
For small collections of data, simply create subprocesses with subprocess.Popen.
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
When the subprocesses have all finished (or timed out), you simply merge the output files.
Very simple.
You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
In Stackless, you don't have threads, but tasklets or greenlets which are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.
Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
There is much more information available in this related question.
If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)
IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.
You may want to look at Twisted. It is designed for asynchronous network tasks.