python threading: memory model and visibility

python threading: memory model and visibility - python

Does python threading expose issues of memory visibility and statement reordering as Java does? Since I can't find any reference to a "Python Memory Model" or anything like that, despite the fact that lots of people are writing multithreaded Python code, I'm guessing that these gotchas don't exist here. No volatile keyword, for instance. But it doesn't seem to be stated explicitly anywhere that, for instance, a change in a variable in one thread is immediately visible to all other threads.
Maybe this stuff is all very obvious to Python programmers, but as a fearful Java programmer, I require a little extra reassurance :)

There is no formal model for Python's threading (hey, after all, there wasn't one for Java's for years... hopefully, one will also eventually be written for Python).
In practice, no Python implementation performs any advanced optimization such as statement reordering or temporarily treating shared variables as thread-local ones -- and you can count on these semantics constraints even though they are not formally assured.
CPython in particular, as #Rawheiser's mention, uses a global interpreter lock; other implementations (PyPy, IronPython, Jython, ...) do not (so they can use multiple cores effectively with a threading model, while CPython requires multi-processing for the same purpose), so you should not count on that if you want to write code that's portable throughout Python implementations. (So, you shouldn't count on the "atomicity" of operations that only happen to be atomic in CPython because of the GIL, such as dictionary accesses -- in other Python implementations, multiple threads might be modifying a dict at once and cause errors unless you protect the dict with a lock or the like).

Related

How to insert Memory Fence and specify that memory is volatile in the Python program?

I am using Python languages and I use CPU threads from the threading thread.Threading wrapper. In some way, the Python interpreter converts my code into PYC byte code with its JIT. (Please provide a reference to Python bytecode standard, but as far as I know standard does not exist. As well it does not exist a standard for a language).
Then these virtual commands are executed. The real commands for Intel'd CPUs are x86/x64 instructions, and for ARM's CPU are AArch64/AArch32 instructions.
My problem - I want to make an action within the Python programming language, that enforces an ordering constraint between the memory operations.
What I want to know:
Q1: How I can emit a command
mfence if Python program is running in x86/x64 CPU
Or instruction like atomic_thread_fence() for LLVM-IR
Q2: How I can specify that some memory is volatile and should not be put into the CPU register for optimization purposes.

CPython does not have a JIT - though it may do one day.
So your code is only converted into bytecode, which will be interpreted, and not into actual Intel/etc. machine code.
Additionally, Python has what's known as the GIL - Global Interpreter Lock - meaning that even if you have multiple Python threads in a process, only one of them can be interpreting code at once - though this may also change one day. Threads were frequently useful for doing I/O, because I/O operations are able to happen at the same time, but these days asyncio is a good competitor for doing that.
So, in response to Q1, it doesn't make any sense to "put an mfence in Python code" (or the like).
Instead, what you probably want to do, if you want to enforce ordering constraints between one bit of code being executed and another, is use more high-level strategies, such as Python threading.Lock, queue.Queue, or similar equivalents in the asyncio world.
As for Q2, are you thinking of the C/C++ keyword volatile? This is frequently mistakenly thought of as a way to make memory access atomic for use in multithreading - it isn't. All that C/C++ volatile does is ensure that memory reads and writes happen exactly as specified rather than being possibly optimised out. What is your use case? There are all sorts of strategies one can use in Python to optimise code, but it's an impossible question to answer without knowing what you're actually trying to do.
Answers to comments
The CPU executes instructions. Somebody should emit this instruction. I'm calling a JIT a part inside a Python interpreter that emits instructions at the end of the day.
CPython is an interpreter - it does not emit instructions. JITs do emit instructions, but as stated above, CPython does not have a JIT. When you run a Python script, CPython will compile your text-based .py file into bytecode, and then it will spend the rest of its time working through the bytecode and doing what the bytecode says. The only machine instructions being executed are those that are emitted by whoever compiled CPython.
If you compile a Python script to a .pyc and then execute that, CPython will do exactly the same, it will just skip the "compile your text-based .py file into bytecode" part as it's already done it - the result of that step is stored in the .pyc file.
I was a bit vague in naming. Do you mean in Python, the instruction is reexecuted each time the interpreter meats the instruction?
A real CPU executing machine code will "re-execute" each instruction as it reads it, sure. CPython will do the same thing with Python bytecode. (CPython will execute a whole bunch of real machine code instructions each time it reads one Python instruction.)
Thanks, I have found this notice https://docs.python.org/3/extending/extending.html "CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once ". Ok, so when Python thread will go with C++/C bindings into native code then what will happen? Q1-A - can in that time Python another thread be executed? Q1-B - If inside C++ code I will create another thread what will happen?
Native code can release the GIL but must lock it again before returning to Python code.
Typically, native code that does some CPU-intensive work or does some I/O that requires waiting would release the GIL while it does that work or waits on that I/O. This allows Python code on another Python thread to run at the same time. But at no point does Python code run on two threads at once. That's why it makes no sense to put native ordering instructions and the like in Python code.
Native code that needs to use native ordering instructions for its own purposes will obviously do that, but that is C/C++ code, not Python code. If you want to know how to put native ordering instructions in a C/C++ Python extension, then look up how to do it in any C/C++ code - the fact that it's a Python extension is irrelevant.
Basically, either write Python code and use high-level strategies like what I mentioned above, or write a native extension in C/C++ and do whatever you need to do in that language.
I need to learn more about GIL and seems that there is a good study of GIL from David Beazley https://dabeaz.com/python/UnderstandingGIL.pdf But #Keiji - you maybe can be wrong with Q1 - CPython Threads seems to be a real thread, and if implementor of C++/C extensions (Almost all libraries for Python) will decide to release GIL lock - it's possible to do... So Q1 still has sense...
I've covered this above. Python code can't interact with native code in a way that would require putting native ordering instructions in Python code.
Back to the question - I mean volatile in sense of C++ getting rid of compiler optimization to not allow optimized variables to be put into the register. In C++ it does not guarantee atomicity and it does not guarantee memory fence. So question regarding volatile how I can specify for integer variable or user-defined type?
If you want to make something in C/C++ be volatile, use the volatile keyword. If you're writing Python code, it doesn't make any sense to make something volatile.

About Python Threads:
The Python thread first of all is tricky. Interpreters use real POSIX/WinAPI threads. Threads thread.Threading under the hood is the real threads.
The thread execution model is pretty specific and can be called "cooperative multitasking" due to one enthusiast (David Beazley, https://www.dabeaz.com/about.html)
As David Beazley stated https://www.dabeaz.com/python/UnderstandingGIL.pdf
When a thread is in the I/O waiting for the Thread release global lock (called GIL lock). David Beazley stated that there is a way to release the lock after the system call.
Next, there is a "tick" instruction in Python VM instructions. If some thread is CPU-bound then the thread will execute that "tick" instruction. (Roughly speaking it occurs every 100ms)
In tick, each thread tries to release GIL and acquire "tick" one more time.
There is no thread scheduler in Python
Multithreading in Python is in fact hurts performance.

Why does python not lock only the mutable data? [duplicate]

I'm hoping someone can provide some insight as to what's fundamentally different about the Java Virtual Machine that allows it to implement threads nicely without the need for a Global Interpreter Lock (GIL), while Python necessitates such an evil.

Python (the language) doesn't need a GIL (which is why it can perfectly be implemented on JVM [Jython] and .NET [IronPython], and those implementations multithread freely). CPython (the popular implementation) has always used a GIL for ease of coding (esp. the coding of the garbage collection mechanisms) and of integration of non-thread-safe C-coded libraries (there used to be a ton of those around;-).
The Unladen Swallow project, among other ambitious goals, does plan a GIL-free virtual machine for Python -- to quote that site, "In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001)."

The JVM (at least hotspot) does have a similar concept to the "GIL", it's just much finer in its lock granularity, most of this comes from the GC's in hotspot which are more advanced.
In CPython it's one big lock (probably not that true, but good enough for arguments sake), in the JVM it's more spread about with different concepts depending on where it is used.
Take a look at, for example, vm/runtime/safepoint.hpp in the hotspot code, which is effectively a barrier. Once at a safepoint the entire VM has stopped with regard to java code, much like the python VM stops at the GIL.
In the Java world such VM pausing events are known as "stop-the-world", at these points only native code that is bound to certain criteria is free running, the rest of the VM has been stopped.
Also the lack of a coarse lock in java makes JNI much more difficult to write, as the JVM makes less guarantees about its environment for FFI calls, one of the things that cpython makes fairly easy (although not as easy as using ctypes).

There is a comment down below in this blog post http://www.grouplens.org/node/244 that hints at the reason why it was so easy dispense with a GIL for IronPython or Jython, it is that CPython uses reference counting whereas the other 2 VMs have garbage collectors.
The exact mechanics of why this is so I don't get, but it does sounds like a plausible reason.

In this link they have the following explanation:
... "Parts of the Interpreter aren't threadsafe, though mostly because making them all threadsafe by massive lock usage would slow single-threaded extremely (source). This seems to be related to the CPython garbage collector using reference counting (the JVM and CLR don't, and therefore don't need to lock/release a reference count every time). But even if someone thought of an acceptable solution and implemented it, third party libraries would still have the same problems."

Python lacks jit/aot and the time frame it was written at multithreaded processors didn't exist. Alternatively you could recompile everything in Julia lang which lacks GIL and gain some speed boost on your Python code. Also Jython kind of sucks it's slower than Cpython and Java. If you want to stick to Python consider using parallel plugins, you won't gain an instant speed boost but you can do parallel programming with the right plugin.

Improving Python Threads Performance based on Resource Locking

The reason between Java and Python threads is
Java is designed to lock on the resources
Python is designed to lock the thread itself(GIL)
So Python's implementation performs better on a machine with singe core processor. This is fine 10-20 years before. With the increasing computing capacity, if we use multiprocessor machine with same piece of code, it performs very badly.
Is there any hack to disable GIL and use resource locking in Python(Like Java implementation)?
P.S. My application is currently running on Python 2.7.12. It is compute intensive with less I/O and network blocking. Assume that I can't use multiprocessing for my use case.

I think the most straight way for you, that will give you also a nice performance increase is to use Cython.
Cython is a Python superset that compiles Python-like code to C code (which makes use of the cPython API), and from there to executable. It allows one to optionally type variables, that then can use native C types instead of Python objects - and also allows one direct control of the GIL.
It does support a with nogil: statement in which the with block runs with the GIL turned off - if there are other threads running (you use the normal Python threading library), they won't be blocked while code is running on the marked with block.
Just keep in mind that the GIL is there for a reason: it is thanks to it that global complex objects like lists and dictionaries work without the danger of getting into an inconsistent state between treads. But if your "nogil" blocks restrict themselves to local data structures, you should have no problems.
Check the Cython project - and here is an specific example of turning off the GIL:
https://lbolla.info/blog/2013/12/23/python-threads-cython-gil

C++ shared_ptr vs. Python object

AFAIK, the use of shared_ptr is often discouraged because of potential bugs caused by careless usage of them (unless you have a really good explanation for significant benefit and carefully checked design).
On the other hand, Python objects seem to be essentially shared_ptrs (ref_count and garbage collection).
I am wondering what makes them work nicely in Python but potentially dangerous in C++. In other words, what are the differences between Python and C++ in dealing with shared_ptr that makes their usage discouraged in C++ but not causing similar problems in Python?
I know e.g. Python automatically detects cycles between objects which prevents memory leaks that dangling cyclic shared_ptrs can cause in C++.

"I know e.g. Python automatically detects cycles" -- that's what makes them work nicely, at least so far as the "potential bugs" relate to memory leaks.
Besides which, C++ programs are more commonly written under tight performance constraints than Python programs (which IMO is a combination of different genuine requirements with some fairly bogus differences in rules-of-thumb, but that's another story). A fairly high proportion of the Python objects I use don't strictly need reference counting, they have exactly one owner and a unique_ptr would be fine (or for that matter a data member of class type). In C++ it's considered (by the people writing the advice you're reading) worth taking the performance advantage and the explicitly simplified design. In Python it's usually not considered a problem, you pay the performance and you keep the flexibility to decide later that it's shared after all without any code change required (other than to take additional references that outlive the original, I mean).
Btw in any language, shared mutable objects have "potential bugs" associated with them, if you lose track of what objects will or won't change when you're not looking at them. I don't just mean race conditions: even in a single-threaded program you need to be aware that C++ Predicates shouldn't change anything and that you (often) can't mutate a container while iterating over it. I don't see this as a difference between C++ and Python, though. Rather, to some extent you should be slightly wary of shared objects in Python too, and when you proliferate references to an object at least understand why you're doing it.
So, on to the list of issues in the question you link to:
cyclic references -- as mentioned, Python rolls its sleeves up, finds them and frees them. For reasons to do with the design and specific uses of the languages, cycle-breaking garbage collection is rather difficult to implement in C++, although not impossible.
creating multiple unrelated shared_ptrs to the same object -- no analog is possible in Python, since the reference-counter isn't open to the user to mess up.
Constructing an anonymous temporary shared pointer -- doesn't arise in Python, there's no risk of a memory leak that way in Python since there's no "gap" in which the object exists but is not yet subject to collection if it becomes unreferenced.
Calling the get() function to get the raw pointer and use it after the pointed-to object goes out of scope -- well, you can mess this up if you're writing Python/C, but not in pure Python.
Passing a reference of or a raw pointer to a shared_ptr should be dangerous too, since it won't increment the internal count -- there's no means in Python to add a reference without the language taking care of the refcount.
we passed 'this' to some thread workers instead of 'shared_from_this' -- in other words, forgot to create a shared_ptr when needed. Can't do this in Python.
most of the predicates you know and love from <functional> don't play nicely with shared_ptr -- Python refcounting is so built in to the runtime (or I suppose to be precise I should say: garbage collection is so built in to the language design) that there are no libraries that fail to cope with it.
Using shared_ptr for really small objects (like char short) could be an overhead -- issue exists in Python, and Python programmers generally don't sweat it. If you need an array of "primitive type" then you can use numpy to reduce overhead. Sometimes Python programs run out of memory and you need to do something about it, that's life ;-)
Giving out a shared_ptr< T > to this inside a class definition is also dangerous. Use enabled_shared_from_this instead -- it may not be obvious, but this is "don't create multiple unrelated shared_ptr to the same object" again.
You need to be careful when you use shared_ptr in multithread code -- it's possible to create race conditions in Python too, this is part of "shared mutable objects are tricksy".
Most of this is to do with the fact that in C++ you have to explicitly do something to get refcounting, and you don't get it if you don't ask for it. This provides several opportunities for error that Python doesn't make available to the programmer because it just does it for you. If you use shared_ptr correctly then apart from the existence of libraries that don't co-operate with it, none of these problems comes up in C++ either. Those who are cautious of using it for these reasons are basically saying they're afraid they'll use it incorrectly, or at any rate more afraid than that they'll misuse some alternative. Much of C++ programming is trading different potential bugs off against each other until you come up with a design that you consider yourself competent to execute. Furthermore it has "don't pay for what you don't need" as a design philosophy. Between these two factors, you don't do anything without a really good explanation, a significant benefit, and a carefully checked design. shared_ptr is no different ;-)

AFAIK, the use of shared_ptr is often discouraged because of potential bugs caused by careless usage of them (unless you have a really good explanation for significant benefit and carefully checked design).
I wouldn't agree. The tendency goes towards generally using these smart pointers unless you have a very good reasons not to do so.
shared_ptr that makes their usage discouraged in C++ but not causing similar problems in Python?
Well, I don't know about your favourite largish signal processing framework ecosystem, but GNU Radio uses shared_ptrs for all their blocks, which are the core elements of the GNU Radio architecture. In fact, blocks are classes, with private constructors, which are only accessible by a friend make function, which returns a shared_ptr. We haven't had problems with this -- and GNU Radio had good reason to adopt such a model. Now, we don't have a single place where users try to use deallocated block objects, not a single block is leaked. Nice!
Also, we use SWIG and a gateway class for a few C++ types that can't just be represented well as Python types. All this works very well on both sides, C++ and Python. In fact, it works so very well, that we can use Python classes as blocks in the C++ runtime, wrapped in shared_ptr.
Also, we never had performance problems. GNU Radio is a high rate, highly optimized, heavily multithreaded framework.

How have other languages overcame the limitations of Python's GIL?

As the industry trends to "web scale" application architecture (as much as I hate buzz words), I know Python has caught a lot of criticism for how the GIL handles concurrency and becomes a bottleneck. I understand the problem on the surface, but not well enough to know how other procedural languages handle threads under the hood. Does Java have similar problems? C#? Ruby? If not, why hasn't Python adopted the same strategy?

The GIL exists because it's needed (mainly) for CPython's implementation of reference counting - it's method of garbage collection. So let's be clear, Python doesn't have a GIL, the reference implementation does, and it's just an implementation detail.
The GIL exists because it makes the implementation simple and fast, and most of the time, it simply doesn't matter. Threading is mainly designed to allow access to slow resources alongside processing, which isn't hindered at all by the GIL.
The only reason that the GIL can be an issue is where one wants to do a lot of parallel computation. In this case, one can make an extension module in C or use the multiprocessing module to side-step the GIL.
All this means that the GIL really isn't an issue 99.9% of the time, and when it is, it's easily worked around. If you find it really hinders you, then you might want to try Jython, which is implemented on top of the JVM and uses a different method of garbage collection that doesn't require the GIL.
As always, premature optimization is a bad idea - if you develop something and find yourself hurt by the GIL, then there are ways to work around it without much pain. That said, it's highly unlikely you'll find it's a problem in the real-world. It's one of the most overblown things surrounding Python (maybe second only to the whole indentation thing).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.