I watched an excellent presentation on the GIL, and how when running in the interpreter only 1 single thread can run at a time. It also seemed that python is not very intelligent about switching between threads.
If i am threading some operation that only runs in the interpreter, and it is not particularly CPU heavy, and I use a thread lock where only 1 thread can run at a time for this relatively short interpreter-bound operation, will that lock actually make anything run slower? as opposed to if the lock were not necessary and all threads could run concurrently.
If all but 1 threads are locked, will the python interpreter know not to context switch?
Edit:
by 'making things run slower' I mean if python is context switching to a bunch of locked threads, that will (maybe) be a performance decrease even if the threads don't actually run
Larry Hastings (a core CPython Developer) has a great talk that covers this subject called "Python's Infamous GIL". If you skip to 11:40ish he gives the answer to your question.
From the talk: The way Python threads work with the GIL is with a simple counter. With every 100 byte codes executed the GIL is supposed to be released by the thread currently executing in order to give other threads a chance to execute code. This behavior is essentially broken in Python 2.7 because of the thread release/acquire mechanism. It has been fixed in Python 3.
When you use a thread lock Python will only execute the threads that are not locked. So if you have several threads sharing 1 lock, then only one thread will execute at the same time. Python will not start executing a locked thread until the thread can acquire the lock. Locks are there so you can have shared state between threads without introducing bugs.
If you have several threads and only 1 can run at a time because of a lock, then in theory your program will take longer to execute. In practice you should benchmark, because the results will surprise you.
python is not very intelligent about switching between threads
Python threads work a certain way :-)
if I use a thread lock where only 1 thread can run at a time... will that lock actually make anything run slower
Err, no because there is nothing else runnable, so nothing else could run slower.
If all but 1 threads are locked, will the python interpreter know not to context switch?
Yes. The kernel knows which threads are runnable. If no other threads can run then logically speaking (as far as the thread is concerned) the python interpreter won't context switch away from the only runnable thread. The thread doesn't know when it has been switched away from (how can it, it isn't running).
Related
I got the following code and when I run it, the first function returns Done after 5 seconds and the second one after 10 seconds, not 15. This, logically speaking, means they both run at the same time, yet everyone says threading is not parallel running. Can someone shed some light on what's happening on the background, please?
import threading
import time
def dummy(param):
time.sleep(param)
print('done')
param1 = 10
param2 = 5
thread1 = threading.Thread(target=dummy, args=(param1,))
thread2 = threading.Thread(target=dummy, args=(param2,))
thread1.start()
thread2.start()
thread1.join()
thread2.join()
I don't think that this is a good test. Sleeping gives up the CPU as far as I know, so the first sleep releases the CPU to the second thread and then that one starts sleeping. You aren't doing work on multiple threads at once. Both threads are sleeping, not running and doing work.
People say that you can't use multithreading to run code in parallel because in CPython, the Global Interpreter Lock (GIL) prevents multiple bytecode instructions from running simultaneously in different threads in the same process. That means that two threads can't do work at precisely the same time.
You can however have I/O tasks running in parallel, since, for example, waiting for a socket to return data doesn't require heavy work on the CPU. I believe for the purposes of the explanation here, sleeping the thread can be thought of as closer to waiting on long-running I/O than having the CPU do work. That means that yes, the two sleeps can happen in parallel.
Carcigenicate gave a good answer, much of 'what is happening in the background' asked about in the question. I try to open it a bit too.
Threads are started in the background of your main execution. No matter whether there are multiple cores or not, or whether the GIL is active or not.
Your thread.start() calls return immediately and are ran right one after the other, practically at the same time in your example. So after 10 seconds both are done. Threads always work like that.
If there is only one core, the operating system gives each thread some time almost all the time, like every millisecond maybe. If you use Python with GIL (the default official from python.org, called CPython), multiple cores are not used at the exact same time for Python code that sets the lock. It is possible to release the lock in C code for Python, and AFAIK e.g. libraries for reading from disk or network do that. For your Python code, maybe it runs one line from one thread, then the other, but it's still practically simultaneous on your gigahertz range processor.
Now, if you want to test performance benefit of running multiple threads, for example a worker thread per core, that you must test with a test function that does some work. Even just counts numbers. Then if you run many in parallel, vs sequentially, you'll see differences depending on number of cores and whether GIL is there or not. I thought PyPy doesn't have GIL but apparently it does, https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil-why . IronPython and Jython do not, I'd test IronPython for a non-GIL Python, https://ironpython.net/
I'm adding python scripting support to an application.
This application has an API which is not thread safe, and I cannot change this aspect.
One requirement I have is being able to run multiple independent scripts, thus I have to run sub-interpreters in separate threads.
Although, due to the GIL in CPython, no more than one thread runs concurrently, whatever thread holds the GIL will still run concurrently with the main thread, and this will cause problems due to the thread-unsafe API of the application.
To summarize: I'm looking for a way to run all python code (__main__, threads, every sub-interpreter) in the main thread.
How can this be solved?
Should the main thread always hold the GIL, and have a function that -in a cooperative-multitasking fashion- would release it and reacquire it x milliseconds later, thus allowing the interpreter to do some work? This doesn't look right: such function will consume x milliseconds also when python has no work to do.
I am running an HTTP server (homemade, in C++) that embeds a Python interpreter for server-side scripting. This is a forking server, but I don't use any threading in any parent process. I don't do any weird things with the Python interpreter (other than the forks).
In one of the scripts, however, in another thread, a call to time.sleep(0.1) can take up to one minute, especially the first call.
while not self.should_stop():
# other code
print "[PYTHON]: Sleeping"
time.sleep(0.1)
print "[PYTHON]: Slept, checking should_stop"
I know that this is where it's hanging, because the logs show only the first print, and the second much, much later.
Additional information:
the CPU is not pegged (~5%)
this is Python 2.7 on Ubuntu
These are threading threads; I do use locks and events where necessary.
I don't import threading in any process that will ever do a fork
Python is initialized before the forks; this works great elsewhere (no problems in the last 6 months)
Python can run only one threading.Thread at a time, so if there are many threads, the interpreter has to constantly switch between them, so one thread can run while the others get freezed or, in other words, interrupted.
But an interrupted thread isn't told that it's freezed, it's sort of falls unconscious for a while and then is woken up and continues its work from where it has been interrupted. So, 0.5 seconds for one particular thread may in fact turn out to be longer in real life.
Fixed!
As it turns out, the main thread (the one embedding the interpreter, in C++) doesn't actually release the GIL when it's not executing Python code (as I imagined). You actually have to release the GIL manually, with Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS, as specified here.
This makes the runtime release the GIL so other threads can run during IO-intensive tasks (like, in my case, reading or writing to/from the network). No running Python code while doing that, though.
I was wondering if python threads run concurrently or in parallel?
For example, if I have two tasks and run them inside two threads will they be running simultaneously or will they be scheduled to run concurrently?
I'm aware of GIL and that the threads are using just one CPU core.
This is a complicated question with a lot of explication needed. I'm going to stick with CPython simply because it's the most widely used and what I have experience with.
A Python thread is a system thread that requires the Python interpreter natively to execute its contents into bytecode at runtime. The GIL is an interpreter-specific (in this case, CPython) lock that forces each thread to acquire a lock on the interpreter, preventing two threads from running at the same time no matter what core they're on.
No CPU core can run more than one thread at a time. You need multiple cores to even talk sensibly about parallelism. Concurrency is not the same as parallelism - the former implies operations between two threads can be interleaved before either are finished but where neither thread need not start at the same time, while the latter implies operations that can be started at the same time. If that confuses you, better descriptions about the difference are here.
There are ways to introduce concurrency in a single-core CPU - namely, have threads that suspend (put themselves to sleep) and resume when needed - but there is no way to introduce parallelism with a single core.
Because of these facts, as a consequence, it depends.
System threads are inherently designed to be concurrent - there wouldn't be much point in having an operating system otherwise. Whether or not they are actually executed this way depends on the task: is there an atomic lock somewhere? (As we shall see, there is!)
Threads that execute CPU-bound computations - where there is a lot of code being executed, and concurrently the interpreter is dynamically invoked for each line - obtain a lock on the GIL that prevents other threads from executing the same. So, in that circumstance, only one thread works at a time across all cores, because no other thread can acquire the interpreter.
That being said, threads don't need to keep the GIL until they are finished, instead acquiring and releasing the lock as/when needed. It is possible for two threads to interleave their operations, because the GIL could be released at the end of a code block, grabbed by the other thread, released at the end of that code block, and so on. They won't run in parallel - but they can certainly be run concurrently.
I/O bound threads, on the other hand, spend a large amount of their time simply waiting for requests to complete. These threads don't acquire the GIL - why would they, when there's nothing to interpret? - so certainly you can have multiple I/O-waiting threads run in parallel, one core per thread. The minute code needs to be compiled to bytecode, however, (maybe you need to handle your request?) up goes the GIL again.
Processes in Python survive the GIL, because they're a collection of resources bundled with threads. Each process has its own interpreter, and therefore each thread in a process only has to compete with its own immediate process siblings for the GIL. That is why process-based parallelism is the recommended way to go in Python, even though it consumes more resources overall.
The Upshot
So two tasks in two threads could run in parallel provided they don't need access to the CPython interpreter. This could happen if they are waiting for I/O requests or are making use of a suitable other language (say, C) extension that doesn't require the Python interpreter, using a foreign function interface.
All threads can run concurrently in the sense of interleaved atomic operations. Exactly how atomic these interleavings can be - is the GIL released after a code block? After every line? - depends on the task and the thread. Python threads don't have to execute serially - one thread finishes, and then the other starts - so there is concurrency in that sense.
In CPython, the threads are real OS threads, and are scheduled to run concurrently by the operating system. However, as you noted the GIL means that only one thread will be executing instructions at a time.
Let me explain what all that means. Threads run inside the same virtual machine, and hence run on the same physical machine. Processes can run on the same physical machine or in another physical machine. If you architect your application around threads, you’ve done nothing to access multiple machines. So, you can scale to as many cores are on the single machine (which will be quite a few over time), but to really reach web scales, you’ll need to solve the multiple machine problem anyway.
So I just finished watching this talk on the Python Global Interpreter Lock (GIL) http://blip.tv/file/2232410.
The gist of it is that the GIL is a pretty good design for single core systems (Python essentially leaves the thread handling/scheduling up to the operating system). But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.
So since the GIL limits us to basically executing a Python program on one CPU my thought is why not accept this and simply use taskset on Linux to set the affinity of the program to a certain core/cpu on the system (especially in a situation with multiple Python apps running on a multi-core system)?
So ultimately my question is this: has anyone tried using taskset on Linux with Python applications (especially when running multiple applications on a Linux system so that multiple cores can be used with one or two Python applications bound to a specific core) and if so what were the results? is it worth doing? Does it make things worse for certain workloads? I plan to do this and test it out (basically see if the program takes more or less time to run) but would love to hear from others as to your experiences.
Addition: David Beazley (the guy giving the talk in the linked video) pointed out that some C/C++ extensions manually release the GIL lock and if these extensions are optimized for multi-core (i.e. scientific or numeric data analysis/etc.) then rather than getting the benefits of multi-core for number crunching the extension would be effectively crippled in that it is limited to a single core (thus potentially slowing your program down significantly). On the other hand if you aren't using extensions such as this
The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box since a thread fires off an HTTP request and then since it's waiting on I/O gives up the GIL and another thread can do it's thing, so that part of the program can easily run 100+ threads without hurting the CPU much and let me actually use the network bandwidth that is available. As for stackless Python/etc I'm not overly interested in rewriting the program or replacing my Python stack (availability would also be a concern).
[*] Only the main thread can receive signals so if you send a ctrl-C the Python interpreter basically tries to get the main thread to run so it can handle the signal, but since it doesn't directly control which thread is run (this is left to the operating system) it basically tells the OS to keep switching threads until it eventually hits the main thread (which if you are unlucky may take a while).
Another solution is:
http://docs.python.org/library/multiprocessing.html
Note 1: This is not a limitation of the Python language, but of CPython implementation.
Note 2: With regard to affinity, your OS shouldn't have a problem doing that itself.
I have never heard of anyone using taskset for a performance gain with Python. Doesn't mean it can't happen in your case, but definitely publish your results so others can critique your benchmarking methods and provide validation.
Personally though, I would decouple your I/O threads from the CPU bound threads using a message queue. That way your front end is now completely network I/O bound (some with HTTP interface, some with message queue interface) and ideal for your threading situation. Then the CPU intense processes can either use multiprocessing or just be individual processes waiting for work to arrive on the message queue.
In the longer term you might also want to consider replacing your threaded I/O front-end with Twisted or some thing like eventlets because, even if they won't help performance they should improve scalability. Your back-end is now already scalable because you can run your message queue over any number of machines+cpus as needed.
An interesting solution is the experiment reported by Ryan Kelly on his blog: http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2/
The results seems very satisfactory.
I've found the following rule of thumb sufficient over the years: If the workers are dependent on some shared state, I use one multiprocessing process per core (CPU bound), and per core a fix pool of worker threads (I/O bound). The OS will take care of assigining the different Python processes to the cores.
The Python GIL is per Python interpreter. That means the only to avoid problems with it while doing multiprocessing is simply starting multiple interpreters (i.e. using seperate processes instead of threads for concurrency) and then using some other IPC primitive for communication between the processes (such as sockets). That being said, the GIL is not a problem when using threads with blocking I/O calls.
The main problem of the GIL as mentioned earlier is that you can't execute 2 different python code threads at the same time. A thread blocking on a blocking I/O call is blocked and hence not executin python code. This means it is not blocking the GIL. If you have two CPU intensive tasks in seperate python threads, that's where the GIL kills multi-processing in Python (only the CPython implementation, as pointed out earlier). Because the GIL stops CPU #1 from executing a python thread while CPU #0 is busy executing the other python thread.
Until such time as the GIL is removed from Python, co-routines may be used in place of threads. I have it on good authority that this strategy has been implemented by two successful start-ups, using greenlets in at least one case.
This is a pretty old question but since everytime I search about information related to python and performance on multi-core systems this post is always on the result list, I would not let this past before me an do not share my thoughts.
You can use the multiprocessing module that rather than create threads for each task, it creates another process of cpython compier interpreting your code.
It would make your application to take advantage of multicore systems.
The only problem that I see on this approach is that you will have a considerable overhead by creating an entire new process stack on memory. (http://en.wikipedia.org/wiki/Thread_(computing)#How_threads_differ_from_processes)
Python Multiprocessing module:
http://docs.python.org/dev/library/multiprocessing.html
"The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box..."
About this, I guess that you can have also a pool of process too: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers
Att,
Leo