I'm reading through Grok The GIL and it has the following statement in the discussion about locking.
So long as no thread holds a lock while it sleeps, does I/O, or some other GIL-dropping operation, you should use the coarsest, simplest locks possible. Other threads couldn't have run in parallel anyway.
It comes just after a discussion about preemptive multitasking. What prevents the preemptive dropping of the GIL from happening while you have a lock? Or is that not what this statement is referring to?
I asked the author of the piece and it comes down to the difference between dropping the GIL because you are waiting on an external operation vs an internal preemtion: https://opensource.com/article/17/4/grok-gil#comment-136186
Hi! Nothing prevents a thread from preemptively dropping the GIL while
it holds a lock. Let's call that Thread A, and let's say there's also
a Thread B. If Thread A holds a lock and gets preempted, then maybe
Thread B could run instead of Thread A.
If Thread B is waiting for the lock that Thread A is holding, then Thread B is not waiting for the GIL. In that case Thread A reacquires the GIL immediately after dropping it, and Thread A continues.
If Thread B is not waiting for
the lock that Thread A is holding, then Thread B might acquire the GIL
and run.
My point about coarse locks, however, is this: no two threads
can ever execute Python in parallel, because of the GIL. So using
fine-grained locks doesn't improve throughput. This is in contrast to
a language like Java or C, where fine-grained locks allow greater
parallelism, and therefore greater throughput.
I still needed some clarification, and he did confirm this:
If I'm understanding you correctly, the intent of the statement I referenced was to avoid using locks around external operations, where you could then block multiple threads, if they all depended on that lock.
For the preemptive example, Thread A isn't blocked by anything externally, so the processing just goes back and forth similar to cooperative multitasking.
Related
I understand the purpose of joining a thread, I'm asking about resource use. My specific use-case here is that I have a long-running process that needs to spawn many threads, and during operation, checks if they have terminated and then cleans them up. The main thread waits on inotify events and spawns threads based on those, so it can't block on join() calls, because it needs to block on inotify calls.
I know that with pthreads, for instance, not joining a terminated thread will cause a resource leak:
PTHREAD_JOIN(3): Failure to join with a thread that is joinable (i.e., one that is not detached), produces a "zombie thread". Avoid doing this, since each zombie thread consumes some system resources, and when enough zombie threads have accumulated, it will no longer be possible to create new threads (or processes).
Python's documentation says no such thing, though, but it also doesn't specify that the join() can be disregarded without issue if many threads are expected to end on their own without being joined during normal operation.
I'm wondering, can I simply take my thread list and do the following:
threads = [thread for thread in threads if thread.is_alive()]
For each check, or will this leak? Or must I do the following?
alive_threads = list()
for thread in threads:
if thread.is_alive():
alive_threads.append(thread)
else:
thread.join()
threads = alive_threads
TLDR: No. Thread cleans up the underlying resources by itself.
Thread.join merely waits for the thread to end, it does not perform cleanup. Basically, each Thread has a lock that is released when the thread is done and subsequently cleaned up. Thread.join just waits for the lock to be released.
There is some minor cleanup done by Thread.join, namely removing the lock and setting a flag to mark the thread as dead. This is an optimisation to avoid needlessly waiting for the lock. These are internal, however, and also performed by all other public methods relying on the lock and flag. Finally, this cleanup is functionally equivalent to a Thread being cleaned up automatically due to garbage collection.
My python program is definitely cpu bound but 40% to 55% of the time spent is performed in C code in the z3 solver (which doesn’t knows anything against the gil) where each single call to the C function (z3_optimize_check) take almost a minute to complete (so far the parallel_enable parameter still result in this function working in single thread mode and blocking the main thread).
I can’t use multiprocessing as z3_objects aren’t serializable friendly (except if someone here can prove otherwise). As they are several tasks (where each tasks adds more z3 work in a dict for other tasks), I initially set up mulithreading directly. But the Gil definitely hurts performance more than there is a benefit (especially with hyperthreading) despite the huge time spent in the solver.
But if I set up a blocking mutex manually (through threading.Lock.aquire()) in the z3py module just after the switch from C code which would allows an other thread running only if all other threads are performing solver work, would this remove the gil performance penalty (since their would be only 1 thread at time executing python code and it would always be the same one until the lock is released before z3_optimize_check)?
I mean would using threading.Lock.aquire() triggers calls to PyEval_SaveThread() as if z3 was doing it directly?
so far the parallel_enable parameter still result in this function working in single thread mode and blocking the main thread
I think you are misunderstanding that. z3 running in parallel mode means that you call it from a single Python thread, and then it spawns multiple OS-level threads for itself, doing the job, cleaning up the threads and returning the result for you. It does not miraculously enable Python running without GIL.
From the viewpoint of Python, it still does one thing at a time, and that one thing is making the call to z3. And it is holding GIL for the entire time. So if you see more than one CPU core/thread utilized while the calculation is running, that is the effect of parallel mode of z3, internally branching to multiple threads.
There is another thing, releasing GIL, like what blocking I/O operations do. It does not happen by magic, there is a call-pair for that:
PyThreadState* PyEval_SaveThread()
Release the global interpreter lock (if it has been created) and reset the thread state to NULL, returning the previous thread state (which is not NULL). If the lock has been created, the current thread must have acquired it.
void PyEval_RestoreThread(PyThreadState *tstate)
Acquire the global interpreter lock (if it has been created) and set the thread state to tstate, which must not be NULL. If the lock has been created, the current thread must not have acquired it, otherwise deadlock ensues.
These are C calls, so they are accessible for extension developers. When developers know that the code will run for a long time, without the need for accessing Python internals, PyEval_SaveThread() can be used, and then Python can proceed with other Python threads. And when the long whatever is done, the thread can re-introduce itself and apply for GIL using PyEval_RestoreThread().
But, these things happen only if developers make them happen. And with z3 it might not be the case.
To provide a direct answer to your question: no, Python code can not release GIL and keep it released, as GIL is the lock what a Python thread has to hold when it proceeds. So whenever a Python "instruction" returns, GIL is held again.
Apparently somehow I managed to not include the link I wanted to, so they are on page https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock (and the linked paragraph discusses what I shortly summarized).
Z3 is open source (https://github.com/Z3Prover/z3), and the source code does not contain neither PyEval_SaveThread, nor the wrapper-shortcut Py_BEGIN_ALLOW_THREADS character sequences.
But, it has a parallel Python example, btw. https://github.com/Z3Prover/z3/blob/master/examples/python/parallel.py, with
from multiprocessing.pool import ThreadPool
So I would assume that it might be tested and working with multiprocessing.
I understand the purpose of joining a thread, I'm asking about resource use. My specific use-case here is that I have a long-running process that needs to spawn many threads, and during operation, checks if they have terminated and then cleans them up. The main thread waits on inotify events and spawns threads based on those, so it can't block on join() calls, because it needs to block on inotify calls.
I know that with pthreads, for instance, not joining a terminated thread will cause a resource leak:
PTHREAD_JOIN(3): Failure to join with a thread that is joinable (i.e., one that is not detached), produces a "zombie thread". Avoid doing this, since each zombie thread consumes some system resources, and when enough zombie threads have accumulated, it will no longer be possible to create new threads (or processes).
Python's documentation says no such thing, though, but it also doesn't specify that the join() can be disregarded without issue if many threads are expected to end on their own without being joined during normal operation.
I'm wondering, can I simply take my thread list and do the following:
threads = [thread for thread in threads if thread.is_alive()]
For each check, or will this leak? Or must I do the following?
alive_threads = list()
for thread in threads:
if thread.is_alive():
alive_threads.append(thread)
else:
thread.join()
threads = alive_threads
TLDR: No. Thread cleans up the underlying resources by itself.
Thread.join merely waits for the thread to end, it does not perform cleanup. Basically, each Thread has a lock that is released when the thread is done and subsequently cleaned up. Thread.join just waits for the lock to be released.
There is some minor cleanup done by Thread.join, namely removing the lock and setting a flag to mark the thread as dead. This is an optimisation to avoid needlessly waiting for the lock. These are internal, however, and also performed by all other public methods relying on the lock and flag. Finally, this cleanup is functionally equivalent to a Thread being cleaned up automatically due to garbage collection.
The python
threading
documentation states that "...threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously",
apparently because I/O-bound processes can avoid the GIL that prevents
threads from concurrent execution in CPU-bound tasks.
But what I dont understand is that an I/O task still uses the CPU. So
how could it not encounter the same issues? Is it because the I/O
bound task will not require memory management?
All of Python's blocking I/O primitives release the GIL while waiting for the I/O block to resolve -- it's as simple as that! They will of course need to acquire the GIL again before going on to execute further Python code, but for the long-in-terms-of-machine-cycles intervals in which they're just waiting for some I/O syscall, they don't need the GIL, so they don't hold on to it!
The GIL in CPython1 is only concerned with Python code being executed. A thread-safe C extension that uses a lot of CPU might release the GIL as long as it doesn't need to interact with the Python runtime.
As soon as the C code needs to 'talk' to Python (read: call back into the Python runtime) then it needs to acquire the GIL again - that is, the GIL is to establish protection/atomic behavior for the "interpreter" (and I use the term loosely) and is not to prevent native/non-Python code from running concurrently.
Releasing the GIL around I/O (blocking or not, using CPU or not) is the same thing - until the data is moved into Python there is no reason to acquire the GIL.
1 The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
I'm curious. I've been programming in Python for years. When I run a command that blocks on I/O (whether it's a hard-disk read or a network request), or blocks while waiting on a lock to be released, how is that implemented? How does the thread know when to reacquire the GIL and start running again?
I wonder whether this is implemented by constantly checking ("Is the output here now? Is it here now? What about now?") which I imagine is wasteful, or alternatively in a more elegant way.
There is no need to repeatedly check for either I/O completion or for lock release.
An I/O completion, signaled by a hardware interrupt to a driver, or a lock release as signaled by a software interrupt from another thread, will make threads waiting on those operations ready 'immediately', and quite possibly running, and quite possibly preempting another thread when being made running. Essentially, after either a software or hardware interrupt, the OS can decide to interrupt-return to a different thread than the one interrupted.
The high I/O performance of this mechanism, eliminating any polling or checking, is 99% of the reason for putting up with the pain of premptive multitaskers.