I was wondering if python threads run concurrently or in parallel?
For example, if I have two tasks and run them inside two threads will they be running simultaneously or will they be scheduled to run concurrently?
I'm aware of GIL and that the threads are using just one CPU core.
This is a complicated question with a lot of explication needed. I'm going to stick with CPython simply because it's the most widely used and what I have experience with.
A Python thread is a system thread that requires the Python interpreter natively to execute its contents into bytecode at runtime. The GIL is an interpreter-specific (in this case, CPython) lock that forces each thread to acquire a lock on the interpreter, preventing two threads from running at the same time no matter what core they're on.
No CPU core can run more than one thread at a time. You need multiple cores to even talk sensibly about parallelism. Concurrency is not the same as parallelism - the former implies operations between two threads can be interleaved before either are finished but where neither thread need not start at the same time, while the latter implies operations that can be started at the same time. If that confuses you, better descriptions about the difference are here.
There are ways to introduce concurrency in a single-core CPU - namely, have threads that suspend (put themselves to sleep) and resume when needed - but there is no way to introduce parallelism with a single core.
Because of these facts, as a consequence, it depends.
System threads are inherently designed to be concurrent - there wouldn't be much point in having an operating system otherwise. Whether or not they are actually executed this way depends on the task: is there an atomic lock somewhere? (As we shall see, there is!)
Threads that execute CPU-bound computations - where there is a lot of code being executed, and concurrently the interpreter is dynamically invoked for each line - obtain a lock on the GIL that prevents other threads from executing the same. So, in that circumstance, only one thread works at a time across all cores, because no other thread can acquire the interpreter.
That being said, threads don't need to keep the GIL until they are finished, instead acquiring and releasing the lock as/when needed. It is possible for two threads to interleave their operations, because the GIL could be released at the end of a code block, grabbed by the other thread, released at the end of that code block, and so on. They won't run in parallel - but they can certainly be run concurrently.
I/O bound threads, on the other hand, spend a large amount of their time simply waiting for requests to complete. These threads don't acquire the GIL - why would they, when there's nothing to interpret? - so certainly you can have multiple I/O-waiting threads run in parallel, one core per thread. The minute code needs to be compiled to bytecode, however, (maybe you need to handle your request?) up goes the GIL again.
Processes in Python survive the GIL, because they're a collection of resources bundled with threads. Each process has its own interpreter, and therefore each thread in a process only has to compete with its own immediate process siblings for the GIL. That is why process-based parallelism is the recommended way to go in Python, even though it consumes more resources overall.
The Upshot
So two tasks in two threads could run in parallel provided they don't need access to the CPython interpreter. This could happen if they are waiting for I/O requests or are making use of a suitable other language (say, C) extension that doesn't require the Python interpreter, using a foreign function interface.
All threads can run concurrently in the sense of interleaved atomic operations. Exactly how atomic these interleavings can be - is the GIL released after a code block? After every line? - depends on the task and the thread. Python threads don't have to execute serially - one thread finishes, and then the other starts - so there is concurrency in that sense.
In CPython, the threads are real OS threads, and are scheduled to run concurrently by the operating system. However, as you noted the GIL means that only one thread will be executing instructions at a time.
Let me explain what all that means. Threads run inside the same virtual machine, and hence run on the same physical machine. Processes can run on the same physical machine or in another physical machine. If you architect your application around threads, you’ve done nothing to access multiple machines. So, you can scale to as many cores are on the single machine (which will be quite a few over time), but to really reach web scales, you’ll need to solve the multiple machine problem anyway.
Related
I was learning about multi-processing and multi-threading.
From what I understand, threads run on the same core, so I was wondering if I create multiple processes inside a child thread will they be limited to that single core too?
I'm using python, so this is a question about that specific language but I would like to know if it is the same thing with other languages?
I'm not a pyhton expert but I expect this is like in other languages, because it's an OS feature in general.
Process
A process is executed by the OS and owns one thread which will be executed. This is in general your programm. You can start more threads inside your process to do some heavy calculations or whatever you have to do.
But they belong to the process.
Thread
One or more threads are owned by a process and execution will be distributed across all cores.
Now to your question
When you create a given number of threads these threads should in general be distributed across all your cores. They're not limited to the core who's executing the phyton interpreter.
Even when you create a subprocess from your phyton code the process can and should run on other cores.
You can read more about the gernal concept here:
Preemptive multitasking
There're some libraries in different languages who abstract a thread to something often called a Task or something else.
For these special cases it's possible that they're just running inside the thread they were created in.
For example. In the DotNet world there's a Thread and a Task. Often people are misusing the term thread when they're talking about a Task, which in general runns inside the thread it was created.
Every program is represented through one process. A process is the execution context one or multiple threads operate in. All threads in one process share the same tranche of virtual memory assigned to the process.
Python (refering to CPython, e.g. Jython and IronPython have no GIL) is special because it has the global interpreter lock (GIL), which prevents threaded python code from being run on multiple cores in parallel. Only code releasing the GIL can operate truely parallel (I/O operations and some C-extensions like numpy). That's why you will have to use the multiprocessing module for cpu-bound python-code you need to run in parallel. Processes startet with the multiprocessing module then will run it's own python interpreter instance so you can process code truely parallel.
Note that even a single threaded python-application can run on different cores, not in parallel but sequentially, in case the OS re-schedules execution to another core after a context switch took place.
Back to your question:
if I create multiple processes inside a child thread will they be limited to that single core too?
You don't create processes inside a thread, you spawn new independent python-processes with the same limitations of the original python process and on which cores threads of the new processes will execute is up to the OS (...as long you don't manipulate the core-affinity of a process, but let's not go there).
AFAIK, the threading.Thread instances can't actually run in parallell, due to the Global Interpreter Lock, which forces only one thread to be able to run at any time (except for when blocking on I/O operations).
ParalellPython uses the threading module.
If I however submit multiple local jobs to it, it DOES execute them in parallel, or at least so it would seem.
I have 8 cores, and if I start 8 jobs to simply run empty loops, they all take up 12-13% of the CPU (meaning they each get executed on one core, and I can see this in my task manager)
Does anyone know how this can happen?
As the linked page says,
PP module overcomes this limitation and provides a simple way to write parallel python applications. Internally ppsmp uses processes and IPC (Inter Process Communications) to organize parallel computations
So the actual parallelism must be due to invoking multiple processes, as one would expect.
How do I know to which process my Python process has been bound?
Alone these same lines, are child processes going to execute on the same core (i.e. CPU) that the parent is currently executing?
Processes and native OS threads are only bound to specific processors if somebody specifically requests for that to happen. By default, processes and threads can (and will) be scheduled on any available processor.
Modern operating systems use pre-emptive multi-threading and can interrupt a thread's execution at any moment. When that thread is next scheduled to run, it can be executed on a different processor. This is known as a context switch. The thread's entire execution context is stored away by the operating system and then when the thread is re-scheduled, the execution context is restored.
Because of all this, it makes no real sense to ask what processor your thread is executing on since the answer can change at any moment. Even during the execution of the function that queried which the current thread's processor.
Again, by default, there's no relationship between the processors that two separate processes execute on. The two processes could execute on the same processor, or different processors. It all depends on how the OS decides to schedule the different threads.
In the comments you state:
The Python process will execute on only one core due to the GIL lock.
That statement is simply incorrect. For example, a section of Python code would claim the GIL, get context switched around all the available processors, and then release the GIL.
Right at the start of the answer I said alluded to the possibility of binding a process or thread to a particular processor. For example, on Windows you can use SetProcessAffinityMask and SetThreadAffinityMask to do this. However, it is unusual to do this. I can only recall ever doing this once, and that was to ensure that an execution of CPUID run on a specific processor. In the normal run of things, processes and threads have affinity with all processors.
In another comment you say:
I am creating the child processes to use multi cores of the CPU.
In which case you have nothing to worry about. Typically you would create as many processes as there are logical processors. The OS scheduler is sensible and will schedule each different process to run on a different processor. And thus make the optimal use of the available hardware resources.
As Wikipedia states:
Green threads emulate multi-threaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support.
Python's threads are implemented as pthreads (kernel threads),
and because of the global interpreter lock (GIL), a Python process only runs one thread at a time.
[QUESTION]
But in the case of Green-threads (or so-called greenlet or tasklets),
Does the GIL affect them? Can there be more than one greenlet
running at a time?
What are the pitfalls of using greenlets or tasklets?
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to
ulimit(-s, -v) set in your *ix system.)
I need a little insight, and it would help if someone could share their experience, or guide me to the right path.
You can think of greenlets more like cooperative threads. What this means is that there is no scheduler pre-emptively switching between your threads at any given moment - instead your greenlets voluntarily/explicitly give up control to one another at specified points in your code.
Does the GIL affect them? Can there be more than one greenlet running
at a time?
Only one code path is running at a time - the advantage is you have ultimate control over which one that is.
What are the pitfalls of using greenlets or tasklets?
You need to be more careful - a badly written greenlet will not yield control to other greenlets. On the other hand, since you know when a greenlet will context switch, you may be able to get away with not creating locks for shared data-structures.
If I use greenlets, how many of them can a process can handle? (I am wondering because in a single process you can open threads up to umask limit set in your *ix system.)
With regular threads, the more you have the more scheduler overhead you have. Also regular threads still have a relatively high context-switch overhead. Greenlets do not have this overhead associated with them. From the bottle documentation:
Most servers limit the size of their worker pools to a relatively low
number of concurrent threads, due to the high overhead involved in
switching between and creating new threads. While threads are cheap
compared to processes (forks), they are still expensive to create for
each new connection.
The gevent module adds greenlets to the mix. Greenlets behave similar
to traditional threads, but are very cheap to create. A gevent-based
server can spawn thousands of greenlets (one for each connection) with
almost no overhead. Blocking individual greenlets has no impact on the
servers ability to accept new requests. The number of concurrent
connections is virtually unlimited.
There's also some further reading here if you're interested:
http://sdiehl.github.io/gevent-tutorial/
I assume you're talking about evenlet/gevent greenlets
1) There can be only one greenlet running
2) It's cooperative multithreading, which means that if a greenlet is stuck in an infinite loop, your entire program is stuck, typically greenlets are scheduled either explicitly or during I/O
3) A lot more than threads, it depends of the amount of RAM available
So I just finished watching this talk on the Python Global Interpreter Lock (GIL) http://blip.tv/file/2232410.
The gist of it is that the GIL is a pretty good design for single core systems (Python essentially leaves the thread handling/scheduling up to the operating system). But that this can seriously backfire on multi-core systems and you end up with IO intensive threads being heavily blocked by CPU intensive threads, the expense of context switching, the ctrl-C problem[*] and so on.
So since the GIL limits us to basically executing a Python program on one CPU my thought is why not accept this and simply use taskset on Linux to set the affinity of the program to a certain core/cpu on the system (especially in a situation with multiple Python apps running on a multi-core system)?
So ultimately my question is this: has anyone tried using taskset on Linux with Python applications (especially when running multiple applications on a Linux system so that multiple cores can be used with one or two Python applications bound to a specific core) and if so what were the results? is it worth doing? Does it make things worse for certain workloads? I plan to do this and test it out (basically see if the program takes more or less time to run) but would love to hear from others as to your experiences.
Addition: David Beazley (the guy giving the talk in the linked video) pointed out that some C/C++ extensions manually release the GIL lock and if these extensions are optimized for multi-core (i.e. scientific or numeric data analysis/etc.) then rather than getting the benefits of multi-core for number crunching the extension would be effectively crippled in that it is limited to a single core (thus potentially slowing your program down significantly). On the other hand if you aren't using extensions such as this
The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box since a thread fires off an HTTP request and then since it's waiting on I/O gives up the GIL and another thread can do it's thing, so that part of the program can easily run 100+ threads without hurting the CPU much and let me actually use the network bandwidth that is available. As for stackless Python/etc I'm not overly interested in rewriting the program or replacing my Python stack (availability would also be a concern).
[*] Only the main thread can receive signals so if you send a ctrl-C the Python interpreter basically tries to get the main thread to run so it can handle the signal, but since it doesn't directly control which thread is run (this is left to the operating system) it basically tells the OS to keep switching threads until it eventually hits the main thread (which if you are unlucky may take a while).
Another solution is:
http://docs.python.org/library/multiprocessing.html
Note 1: This is not a limitation of the Python language, but of CPython implementation.
Note 2: With regard to affinity, your OS shouldn't have a problem doing that itself.
I have never heard of anyone using taskset for a performance gain with Python. Doesn't mean it can't happen in your case, but definitely publish your results so others can critique your benchmarking methods and provide validation.
Personally though, I would decouple your I/O threads from the CPU bound threads using a message queue. That way your front end is now completely network I/O bound (some with HTTP interface, some with message queue interface) and ideal for your threading situation. Then the CPU intense processes can either use multiprocessing or just be individual processes waiting for work to arrive on the message queue.
In the longer term you might also want to consider replacing your threaded I/O front-end with Twisted or some thing like eventlets because, even if they won't help performance they should improve scalability. Your back-end is now already scalable because you can run your message queue over any number of machines+cpus as needed.
An interesting solution is the experiment reported by Ryan Kelly on his blog: http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2/
The results seems very satisfactory.
I've found the following rule of thumb sufficient over the years: If the workers are dependent on some shared state, I use one multiprocessing process per core (CPU bound), and per core a fix pool of worker threads (I/O bound). The OS will take care of assigining the different Python processes to the cores.
The Python GIL is per Python interpreter. That means the only to avoid problems with it while doing multiprocessing is simply starting multiple interpreters (i.e. using seperate processes instead of threads for concurrency) and then using some other IPC primitive for communication between the processes (such as sockets). That being said, the GIL is not a problem when using threads with blocking I/O calls.
The main problem of the GIL as mentioned earlier is that you can't execute 2 different python code threads at the same time. A thread blocking on a blocking I/O call is blocked and hence not executin python code. This means it is not blocking the GIL. If you have two CPU intensive tasks in seperate python threads, that's where the GIL kills multi-processing in Python (only the CPython implementation, as pointed out earlier). Because the GIL stops CPU #1 from executing a python thread while CPU #0 is busy executing the other python thread.
Until such time as the GIL is removed from Python, co-routines may be used in place of threads. I have it on good authority that this strategy has been implemented by two successful start-ups, using greenlets in at least one case.
This is a pretty old question but since everytime I search about information related to python and performance on multi-core systems this post is always on the result list, I would not let this past before me an do not share my thoughts.
You can use the multiprocessing module that rather than create threads for each task, it creates another process of cpython compier interpreting your code.
It would make your application to take advantage of multicore systems.
The only problem that I see on this approach is that you will have a considerable overhead by creating an entire new process stack on memory. (http://en.wikipedia.org/wiki/Thread_(computing)#How_threads_differ_from_processes)
Python Multiprocessing module:
http://docs.python.org/dev/library/multiprocessing.html
"The reason I am not using the multiprocessing module is that (in this case) part of the program is heavily network I/O bound (HTTP requests) so having a pool of worker threads is a GREAT way to squeeze performance out of a box..."
About this, I guess that you can have also a pool of process too: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers
Att,
Leo