Handling GIL when calling python lambda from C++ function

Handling GIL when calling python lambda from C++ function - python

The question
Is pybind11 somehow magically doing the work of PyGILState_Ensure() and PyGILState_Release()? And if not, how should I do it?
More details
There are many questions regarding passing a python function to C++ as a callback using pybind11, but I haven't found one that explains the use of the GIL with pybind11.
The documentation is pretty clear about the GIL:
[...] However, when threads are created from C (for example by a third-party library with its own thread management), they don’t hold the GIL, nor is there a thread state structure for them.
If you need to call Python code from these threads (often this will be part of a callback API provided by the aforementioned third-party library), you must first register these threads with the interpreter by creating a thread state data structure, then acquiring the GIL, and finally storing their thread state pointer, before you can start using the Python/C API.
I can easily bind a C++ function that takes a callback:
py::class_<SomeApi> some_api(m, "SomeApi");
some_api
.def(py::init<>())
.def("mode", &SomeApi::subscribe_mode, "Subscribe to 'mode' updates.");
With the corresponding C++ function being something like:
void subscribe_mode(const std::function<void(Mode mode)>& mode_callback);
But because pybind11 cannot know about the threading happening in my C++ implementation, I suppose it cannot handle the GIL for me. Therefore, if mode_callback is called by a thread created from C++, does that mean that I should write a wrapper to SomeApi::subscribe_mode that uses PyGILState_Ensure() and PyGILState_Release() for each call?
This answer seems to be doing something similar, but still slightly different: instead of "taking the GIL" when calling the callback, it seems like it "releases the GIL" when starting/stopping the thread. Still I'm wondering if there exists something like py::call_guard<py::gil_scoped_acquire>() that would do exactly what I (believe I) need, i.e. wrapping my callback with PyGILState_Ensure() and PyGILState_Release().

In general
pybind11 tries to do the Right Thing and the GIL will be held when pybind11 knows that it is calling a python function, or in C++ code that is called from python via pybind11. The only time that you need to explicitly acquire the GIL when using pybind11 is when you are writing C++ code that accesses python and will be called from other C++ code, or if you have explicitly dropped the GIL.
std::function wrapper
The wrapper for std::function always acquires the GIL via gil_scoped_acquire when the function is called, so your python callback will always be called with the GIL held, regardless which thread it is called from.
If gil_scoped_acquire is called from a thread that does not currently have a GIL thread state associated with it, then it will create a new thread state. As a side effect, if nothing else in the thread acquires the thread state and increments the reference count, then once your function exits the GIL will be released by the destructor of gil_scoped_acquire and then it will delete the thread state associated with that thread.
If you're only calling the function once from another thread, this isn't a problem. If you're calling the callback often, it will create/delete the thread state a lot, which probably isn't great for performance. It would be better to cause the thread state to be created when your thread starts (or even easier, start the thread from Python and call your C++ code from python).

Related

Would setting a mutex manually improve performance?

My python program is definitely cpu bound but 40% to 55% of the time spent is performed in C code in the z3 solver (which doesn’t knows anything against the gil) where each single call to the C function (z3_optimize_check) take almost a minute to complete (so far the parallel_enable parameter still result in this function working in single thread mode and blocking the main thread).
I can’t use multiprocessing as z3_objects aren’t serializable friendly (except if someone here can prove otherwise). As they are several tasks (where each tasks adds more z3 work in a dict for other tasks), I initially set up mulithreading directly. But the Gil definitely hurts performance more than there is a benefit (especially with hyperthreading) despite the huge time spent in the solver.
But if I set up a blocking mutex manually (through threading.Lock.aquire()) in the z3py module just after the switch from C code which would allows an other thread running only if all other threads are performing solver work, would this remove the gil performance penalty (since their would be only 1 thread at time executing python code and it would always be the same one until the lock is released before z3_optimize_check)?
I mean would using threading.Lock.aquire() triggers calls to PyEval_SaveThread() as if z3 was doing it directly?

so far the parallel_enable parameter still result in this function working in single thread mode and blocking the main thread
I think you are misunderstanding that. z3 running in parallel mode means that you call it from a single Python thread, and then it spawns multiple OS-level threads for itself, doing the job, cleaning up the threads and returning the result for you. It does not miraculously enable Python running without GIL.
From the viewpoint of Python, it still does one thing at a time, and that one thing is making the call to z3. And it is holding GIL for the entire time. So if you see more than one CPU core/thread utilized while the calculation is running, that is the effect of parallel mode of z3, internally branching to multiple threads.
There is another thing, releasing GIL, like what blocking I/O operations do. It does not happen by magic, there is a call-pair for that:
PyThreadState* PyEval_SaveThread()
Release the global interpreter lock (if it has been created) and reset the thread state to NULL, returning the previous thread state (which is not NULL). If the lock has been created, the current thread must have acquired it.
void PyEval_RestoreThread(PyThreadState *tstate)
Acquire the global interpreter lock (if it has been created) and set the thread state to tstate, which must not be NULL. If the lock has been created, the current thread must not have acquired it, otherwise deadlock ensues.
These are C calls, so they are accessible for extension developers. When developers know that the code will run for a long time, without the need for accessing Python internals, PyEval_SaveThread() can be used, and then Python can proceed with other Python threads. And when the long whatever is done, the thread can re-introduce itself and apply for GIL using PyEval_RestoreThread().
But, these things happen only if developers make them happen. And with z3 it might not be the case.
To provide a direct answer to your question: no, Python code can not release GIL and keep it released, as GIL is the lock what a Python thread has to hold when it proceeds. So whenever a Python "instruction" returns, GIL is held again.
Apparently somehow I managed to not include the link I wanted to, so they are on page https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock (and the linked paragraph discusses what I shortly summarized).
Z3 is open source (https://github.com/Z3Prover/z3), and the source code does not contain neither PyEval_SaveThread, nor the wrapper-shortcut Py_BEGIN_ALLOW_THREADS character sequences.
But, it has a parallel Python example, btw. https://github.com/Z3Prover/z3/blob/master/examples/python/parallel.py, with
from multiprocessing.pool import ThreadPool
So I would assume that it might be tested and working with multiprocessing.

Python interpreter yielding control back to C caller on asynchronous operation

This is for a networking daemon, where each incoming request runs through an interpreter, with a lightweight request-specific stack. The interpreter allows the request to yield control when waiting on blocking I/O operations. In this way the requests operate very similarly to coroutines in other languages. A single POSIX thread may have several thousands requests in yielded or runnable states, but only a single request actively making progress.
With other embedded languages such as Lua, it's possible to yield control back to the C caller. This is one of the reasons why NGINX utilises Lua for its embedded scripting language.
I'm wondering if there's a way to achieve something similar with Python, when a python thread is waiting for a condition to be asynchronously satisfied.
I don't think it's realistic for Python to expose the details of the asynchronous condition to the C caller, and have the C caller notify the Python interpreter when the condition was satisfied. But even if Python returned control with no information regarding the asynchronous condition, it may allow the C caller to utilise multiple Python thread states as green threads.
The idea would be to attach a thread state to each request, and have the python interpreter inform the C caller when a particular thread and therefore request, was runnable. The most obvious (but likely worst/most naive) way of doing this would be for the C caller to poll the Python interpreter, allowing Python to check if any async conditions had been satisfied, and returning a list of runnable thread states. The C caller would then swap in a runnable thread state, and call the Python interpreter to continue execution.
I'd be grateful for any ideas on this. Even knowing whether it's possible for a Python coroutine to yield to a C caller, and have the C caller resume the coroutine would be useful.
EDIT
No points for suggesting running Python in a separate process and sending requests to it via a pipe or network socket. That's cheating.
EDIT 2
Looks like someone else implemented a similar mechanism as I was suggesting between for Emscripten and Python.
https://github.com/emscripten-core/emscripten/issues/9279

One potential solution is using asyncio's run_coroutine_threadsafe() function.
For every application thread, you have a shadow Python interpreter thread. These are separate OS threads that share an interpreter, but with separate PyThreadStates.
In the Python thread, you create a new event loop, write out a reference to the loop object to a shared variable, and call loop.run_forever() after installing an appropriate mechanism to stop the loop gracefully.
In the application thread, you wrap module calls to the Python script you want to run in a coroutine and use asyncio.run_coroutine_threadsafe() to submit them to the Python interpreter thread (using the handle from the shared variable). The application thread adds a callback to the Future it receives via the add_done_callback.
The application request is then yielded, which means its execution is suspended and the application thread can process a new application request.
The add_done_callback callback calls an application C function which signals the application thread that processing of a particular application request is complete. The application request is then placed back into the application's runnable queue for execution to continue.
I'll update the answer after I have a complete, polished solution, and i've fully tested the questionably thread unsafe aspects. But for now, this does seem like a viable solution.

Calling Py_Initialize() in multiple threads

I am embedding Python in a multi-threaded C++ application, is it safe to call
Py_Initialize() in multiple threads? Or should I call it in the main thread?

The Py_Initialize() code contains:
if (initialized)
return;
initialized = 1;
The documentation for the function also says:
https://docs.python.org/2/c-api/init.html#c.Py_Initialize
This is a no-op when called for a second time (without calling Py_Finalize() first).
My recommendation though is you only do it from the main thread, although depending on what you are doing, it can get complicated.
The problem is that signal handlers are only triggered in context of the main Python thread. That is, whatever thread was the one to call Py_Initialize(). So if that is a transient thread and is only used once and then discarded, then no chance to ever have signal handlers called. So you have to give some thought as to how you handle signals.
Also be careful of using lots of transient threads created in C code using native thread API and calling into Python interpreter as each will create data in the Python interpreter. That will accumulate if keep creating and discarding these external threads. You should endeavour to use a thread pool instead if calling in from external threads, and keep reusing prior threads.

Can you race condition in Python while there is a GIL?

My understanding is that due to the Global Interpreter Lock (GIL) in cPython, only one thread can ever be executed at any one time. Does this or does this not automatically protected against race conditions, such as the lost update problem?

Due to the GIL, there is only ever one thread per process active to execute Python bytecode; the bytecode evaluation loop is protected by it.
The lock is released every sys.getswitchinterval() seconds, at which point a thread switch can take place. This means that for Python code, a thread switch can still take place, but only between byte code instructions. Any code that relies on thread safety needs to take this into account. Actions that can be done in one bytecode can be thread safe, everything else is not.
Even a single byte code instruction can trigger other Python code; for example the line object[index] can trigger a __getitem__ call on a custom class, implemented itself in Python. Thus a single BINARY_SUBSCR opcode is not necessarily thread safe, depending on the object type.

How can I check whether a thread currently holds the GIL?

I tried to find a function that tells me whether the current thread has the global interpreter lock or not.
The Python/C-API documentation does not seem to contain such a function.
My current solution is to just acquire the lock using PyGILState_Ensure() before releasing it using PyEval_SaveThread to not try releasing a lock that wasn't acquired by the current thread.
(btw. what does "issues a fatal error" mean?)
Background of this question: I have a multithreaded application which embeds Python. If a thread is closed without releasing the lock (which might occur due to crashes), other threads are not able to run any more. Thus, when cleaning up/closing the thread, I would like to check whether the lock is held by this thread and release it in this case.
Thanks in advance for answers!

If you are using (or can use) Python 3.4, there's a new function for the exact same purpose:
if (PyGILState_Check()) {
/* I have the GIL */
}
https://docs.python.org/3/c-api/init.html?highlight=pygilstate_check#c.PyGILState_Check
Return 1 if the current thread is holding the GIL and 0 otherwise. This function can be called from any thread at any time. Only if it has had its Python thread state initialized and currently is holding the GIL will it return 1. This is mainly a helper/diagnostic function. It can be useful for example in callback contexts or memory allocation functions when knowing that the GIL is locked can allow the caller to perform sensitive actions or otherwise behave differently.
In python 2, you can try something like the following:
int PyGILState_Check2(void) {
PyThreadState * tstate = _PyThreadState_Current;
return tstate && (tstate == PyGILState_GetThisThreadState());
}
It seems to work well in the cases i have tried.
https://github.com/pankajp/pygilstate_check/blob/master/_pygilstate_check.c#L9

I dont know what you are looking for ... but just you should consider the use of the both macros Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS, with this macros you can make sure that the code between them doesn't have the GIL locked and random crashes inside them will be sure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.