I'd like people's opinion on which direction to choose between different solutions to implement inter-thread named-pipe communication.
I'm working on a solution for the following:
A 3rd party binary on AIX calls a shared object.
I build this shared object using the python 2.7.5 api, so I have a python thread (64 bit).
So the stack is:
3rd p binary -> my shared object / dll 'python-bridge' -> python 2.7.5 interpreter (persistent)
From custom code inside the 3rd party binary (in a propriatary language), I initialize the python interpreter through the python-bridge, precompile python code blocks through the python-bridge, and execute these bits of code using PyEval_EvalCode in the bridge.
The python interpreter stays alive during the session, and is closed just before the session ends.
Simple sequential python code works fine, and fast. After the call to the shared object method, python references are all decreased (inside the method) and no garbage remains. The precompiled python module stays in memory, works fine. However, I also need to interact with streaming data of the main executable. That executable (of which I don't have the source code) supports fifo through a named pipe, which I want to use for inter-thread communication.
Since the named pipe is blocking, I need a separate thread.
I came up with 3 or 4 alternatives (feel free to give more suggestions)
Use the multiprocess module within python
Make my own C thread, using pthread_create, and use python in there (carefully, I know about the non-threadsafe issues)
Make my own C thread, using pthread_create, parse the named pipe from C, and calling the python interpreter main thread from there
(maybe possible?) use the simpler Threading module of python (which isn't 'pure' threading), and release the GIL at the end of the API call to the bridge. (haven't dared to do this, need someone with insight here. Simple test with Threading and sleep shows it's working within the python call, but the named pipe Thread does nothing after returning to the main non-python process)
What do you suggest?
I'm trying option 1 at the moment, with some success, but it 'feels' a bit bloated to spawn a new process just for parsing a named pipe.
Thanks for your help, Tijs
Answering my own question:
I've implemented this (a while back) using option 4. Works good, very stable.
Releasing the GIL wasn't happening in my first attempt, because I didn't initialize threading.
After that, smooth sailing.
Related
I am working on some changes to a library which embeds Python which require me to utilize sub-interpreters in order to support resetting the python state, while avoiding calling Py_Finalize (since calling Py_Initialize afterwards is a no-no).
I am only somewhat familiar with the library, but I am increasingly discovering places where PyGILState_Ensure and other PyGILState_* functions are being used to acquire the GIL in response to some external callback. Some of these callbacks originate from outside Python, so our thread certainly doesn't hold the GIL, but sometimes the callback originates from within Python, so we definitely hold the GIL.
After switching to sub-interpreters, I almost immediately saw a deadlock on a line calling PyGILState_Ensure, since it called PyEval_RestoreThread even though it was clearly already being executed from within Python (and so the GIL was held):
For what it's worth, I have verified that a line that calls PyEval_RestoreThread does get executed before this call to PyGILState_Ensure (it's well before the first call into Python in the above picture).
I am using Python 3.8.2. Clearly, the documentation wasn't lying when it says:
Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but mixing multiple interpreters and the PyGILState_* API is unsupported.
It is quite a lot of work to refactor the library so that it tracks internally if the GIL is held or not, and seems rather silly. There should be a way to determine if the GIL is held! However, the only function I can find is PyGILState_Check, but that's a member of the forbidden PyGILState API. I'm not sure it'll work. Is there a canonical way to do this with sub-interpreters?
I've been pondering this line in the documentation:
Also note that combining this functionality with PyGILState_* APIs is delicate, because these APIs assume a bijection between Python thread states and OS-level threads, an assumption broken by the presence of sub-interpreters.
I suspect that the issue was that there's something involving thread local storage on the PyGILState_* API.
I've come to think that it's actually not really possible to tell if the GIL is held by the application. There's no central static place where Python stores that the GIL is held, because it's either held by "you" (in your external code) or by the Python code. It's always held by someone. So the question of "is the GIL held" isn't the question the PyGILState API is asking. It's asking "does this thread hold the GIL", which makes it easier to have multiple non-Python threads interacting with the interpreter.
I overcame this issue by restoring the bijection as best I could by creating a separate thread per sub-interpreter, with the order of operations being very strictly as follows:
Grab the GIL in the main thread, either explicitly or with Py_Initialize (if this is the first time). Be very careful, the thread state from Py_Initialize must only ever be used in the main thread. Don't restore it to another thread: Some module might use the PyGILState_* API and the deadlock will happen again.
Create the thread. I just used std::thread.
Spawn the subinterpreter with Py_NewInterpreter. Be very careful, this will give you a new thread state. As with the main thread state, this thread state must only be used from this thread.
Release the GIL in the new thread when you're ready for Python to do its thing.
Now, there's some gotchas I discovered:
asyncio in Python 3.8-3.9 has a use-after-free bug where the first interpreter loading it manages some resources. So if that interpreter is ended (releasing those resources) and a new interpreter grabs asyncio, there will be a segfault. I overcame this by manually loading asyncio through the C API in the main interpreter, since that one lives forever.
Many libraries, including numpy, lxml, and several networking libraries will have trouble with multiple subinterpreters. I believe that Python itself is enforcing this: An ImportError results when importing any of these libraries with: Interpreter change detected - This module can only be loaded into one interpreter per process. This so far seems to be an insurmountable issue for me since I do require numpy in my application.
I am using Python languages and I use CPU threads from the threading thread.Threading wrapper. In some way, the Python interpreter converts my code into PYC byte code with its JIT. (Please provide a reference to Python bytecode standard, but as far as I know standard does not exist. As well it does not exist a standard for a language).
Then these virtual commands are executed. The real commands for Intel'd CPUs are x86/x64 instructions, and for ARM's CPU are AArch64/AArch32 instructions.
My problem - I want to make an action within the Python programming language, that enforces an ordering constraint between the memory operations.
What I want to know:
Q1: How I can emit a command
mfence if Python program is running in x86/x64 CPU
Or instruction like atomic_thread_fence() for LLVM-IR
Q2: How I can specify that some memory is volatile and should not be put into the CPU register for optimization purposes.
CPython does not have a JIT - though it may do one day.
So your code is only converted into bytecode, which will be interpreted, and not into actual Intel/etc. machine code.
Additionally, Python has what's known as the GIL - Global Interpreter Lock - meaning that even if you have multiple Python threads in a process, only one of them can be interpreting code at once - though this may also change one day. Threads were frequently useful for doing I/O, because I/O operations are able to happen at the same time, but these days asyncio is a good competitor for doing that.
So, in response to Q1, it doesn't make any sense to "put an mfence in Python code" (or the like).
Instead, what you probably want to do, if you want to enforce ordering constraints between one bit of code being executed and another, is use more high-level strategies, such as Python threading.Lock, queue.Queue, or similar equivalents in the asyncio world.
As for Q2, are you thinking of the C/C++ keyword volatile? This is frequently mistakenly thought of as a way to make memory access atomic for use in multithreading - it isn't. All that C/C++ volatile does is ensure that memory reads and writes happen exactly as specified rather than being possibly optimised out. What is your use case? There are all sorts of strategies one can use in Python to optimise code, but it's an impossible question to answer without knowing what you're actually trying to do.
Answers to comments
The CPU executes instructions. Somebody should emit this instruction. I'm calling a JIT a part inside a Python interpreter that emits instructions at the end of the day.
CPython is an interpreter - it does not emit instructions. JITs do emit instructions, but as stated above, CPython does not have a JIT. When you run a Python script, CPython will compile your text-based .py file into bytecode, and then it will spend the rest of its time working through the bytecode and doing what the bytecode says. The only machine instructions being executed are those that are emitted by whoever compiled CPython.
If you compile a Python script to a .pyc and then execute that, CPython will do exactly the same, it will just skip the "compile your text-based .py file into bytecode" part as it's already done it - the result of that step is stored in the .pyc file.
I was a bit vague in naming. Do you mean in Python, the instruction is reexecuted each time the interpreter meats the instruction?
A real CPU executing machine code will "re-execute" each instruction as it reads it, sure. CPython will do the same thing with Python bytecode. (CPython will execute a whole bunch of real machine code instructions each time it reads one Python instruction.)
Thanks, I have found this notice https://docs.python.org/3/extending/extending.html "CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once ". Ok, so when Python thread will go with C++/C bindings into native code then what will happen? Q1-A - can in that time Python another thread be executed? Q1-B - If inside C++ code I will create another thread what will happen?
Native code can release the GIL but must lock it again before returning to Python code.
Typically, native code that does some CPU-intensive work or does some I/O that requires waiting would release the GIL while it does that work or waits on that I/O. This allows Python code on another Python thread to run at the same time. But at no point does Python code run on two threads at once. That's why it makes no sense to put native ordering instructions and the like in Python code.
Native code that needs to use native ordering instructions for its own purposes will obviously do that, but that is C/C++ code, not Python code. If you want to know how to put native ordering instructions in a C/C++ Python extension, then look up how to do it in any C/C++ code - the fact that it's a Python extension is irrelevant.
Basically, either write Python code and use high-level strategies like what I mentioned above, or write a native extension in C/C++ and do whatever you need to do in that language.
I need to learn more about GIL and seems that there is a good study of GIL from David Beazley https://dabeaz.com/python/UnderstandingGIL.pdf But #Keiji - you maybe can be wrong with Q1 - CPython Threads seems to be a real thread, and if implementor of C++/C extensions (Almost all libraries for Python) will decide to release GIL lock - it's possible to do... So Q1 still has sense...
I've covered this above. Python code can't interact with native code in a way that would require putting native ordering instructions in Python code.
Back to the question - I mean volatile in sense of C++ getting rid of compiler optimization to not allow optimized variables to be put into the register. In C++ it does not guarantee atomicity and it does not guarantee memory fence. So question regarding volatile how I can specify for integer variable or user-defined type?
If you want to make something in C/C++ be volatile, use the volatile keyword. If you're writing Python code, it doesn't make any sense to make something volatile.
About Python Threads:
The Python thread first of all is tricky. Interpreters use real POSIX/WinAPI threads. Threads thread.Threading under the hood is the real threads.
The thread execution model is pretty specific and can be called "cooperative multitasking" due to one enthusiast (David Beazley, https://www.dabeaz.com/about.html)
As David Beazley stated https://www.dabeaz.com/python/UnderstandingGIL.pdf
When a thread is in the I/O waiting for the Thread release global lock (called GIL lock). David Beazley stated that there is a way to release the lock after the system call.
Next, there is a "tick" instruction in Python VM instructions. If some thread is CPU-bound then the thread will execute that "tick" instruction. (Roughly speaking it occurs every 100ms)
In tick, each thread tries to release GIL and acquire "tick" one more time.
There is no thread scheduler in Python
Multithreading in Python is in fact hurts performance.
I have a really specific need :
I want to create a python console with a Qt Widget, and to be able to have several independent interpreters.
Now let me try to explain where are my problems and all tries I did, in order of the ones I'd most like to make working to those I can use by default
The first point is that all functions in the Python C API (PyRun[...], PyEval[...] ...) need the GIL locked, that forbid any concurrent code interpretations from C ( or I'd be really glad to be wrong !!! :D )
Therefore, I tried another approach than the "usual way" : I made a loop in python that call read() on my special file and eval the result. This function (implemented as a built extension) blocks until there is data to read. (Actually, it's currently a while in C code rather than a pthread based condition)
Then, with PyRun_simpleString(), I launch my loop in another thread. This is where the problem is : my read function, in addition to block the current thread (that is totally normal), it blocks the whole interpreter, and PyRun_simpleString() doesn't return...
Finally I've this last idea which risks to be relatively slow : To have a dedicated thread in C++ which run the interpreter, and do every thing in python to manage input/output. This could be a loop which creates the jobs when there is a console needing to execute a command. Seems not to be really hard to do, but I prefer ask you : is there a way to make the above possibilities to work, or is there another way I didn't think about or is my last idea the best ?
One alternative is to just re-use code from IPython and its Qt Console. This assumes by independent interpreters you imply they won't share memory. IPythons run the Python interpreter in multiple processes and communicates with them over TCP or Unix domain sockets with the help of ZeroMQ.
Also, from your question I'm not sure if you're aware of the common blocking I/O idiom in Python C extensions:
Py_BEGIN_ALLOW_THREADS
... Do some blocking I/O operation ...
Py_END_ALLOW_THREADS
This releases the GIL so that other threads can execute Python code while your function is blocking. See Python/C API Reference Manual: Thread State and the Global Interpreter Lock.
If your main requirement is to have several interpreters independent from each other, you'd probably better suited doing fork() and exec() than doing multithreading.
This way each of the interpreters would live in it's own address space not disturbing one of the others.
In one of my recent projects I happened to have in the same process (a Python program, its a multithreaded application):
a Boost::Python module to a library that linked against the AVT PvAPI SDK, i.e. in the widest sense a driver to get image frames from a camera. This library (the PvApi SDK) produces a SIGALRM internally every some milliseconds.
another plain Python module that was intended to do some serial I/O using pyserial. This in turn uses Python's wrapper select.select to POSIX select. Which turned out to be interrupted (errno == EINTR) everytime the signal is produced by the other module.
the same issue could be observed on any call of Python's time.sleep, resp. the POSIX sleep, that is used internally.
These issues are apparantly not present in Windows, as sleep and select will not be interrupted by any signal there (according to the documentation). And these issues are not much of a problem in C/C++, as you can (and should, though) restart the calls if they have been interrupted.
However, as the Python implementation (source code/Modules/selectmodule.c) doesn't handle this case (EINTR), am I forced to implement my own C/C++ serial driver and sleep function to use in Python? Or to move away from Python for this project? As Python makes programming so much easier I'm very interested if anyone has had similar problems and found a nice workaround or simple fix for this. Right now I don't have the capacity to work out the necessary fixes for the Python modules myself. Or maybe I have missed some other option?
Any ideas?
Try using signal.sigintterupt to make the SIGALRM signal restart system calls automatically. Or you could use signal.signal(signal.SIGALRM, signal.SIG_IGN) to ignore the alarm signal, assuming you're not using it for anything.
In the meantime a PEP was published to address this issue (PEP 475).
As it looks, the behavior was fixed since Python 3.5 in 2015.
As title say, does Python cStringIO protect their internal structures for multithreading use?
Thank you.
Take a look at an excellent work on explaining GIL, then note that cStringIO is written purely in C, and its calls don't release GIL.
It means that the running thread won't voluntarily switch during read()/write() (with current virtual machine implementation). (The OS will preempt the thread, however other Python threads won't be able to acquire GIL.)
Taking a look at the source: Python-2.7.1/Modules/cStringIO.c there is no mention about internals protection. When in doubt, look at source :)
I assume you are talking about the CPython implementation of Python.
In CPython there is a global interpreter lock which means that only a single thread of Python code can execute at a time. Code written in C will therefore also be effectively single threaded unless it explicitly releases the global lock.
What that means is that if you have multiple Python threads all using cStringIO simultaneously there won't be any problem as only a single call to a cStringIO method can be active at a time and cStringIO never releases the lock. However if you call it directly from C code which is running outside the locked environment you will have problems. Also if you do anything more complex than just reading or writing you will have issues, e.g. if you start using seek as your calls may overlap in unexpected ways.
Also note that some methods such as writelines can invoke Python code from inside the method so in that case you might get other output interleaved inside a single call to writelines.
That is true for most of the standard Python objects: you can safely use objects from multiple threads as the individual operations won't break, but the order in which things happen won't be defined.
It is as "thread-safe", as file operations can be (which means — not much). The Python implementation you're using has Global Interpreter Lock (GIL), which will guarantee that each individual file operation on cStringIO will not be interrupted by another thread. That does not however guarantee, that concurrent file operations from multiple threads won't be interleaved.
No it is not currently thread safe.