How have other languages overcame the limitations of Python's GIL?

How have other languages overcame the limitations of Python's GIL? - python

As the industry trends to "web scale" application architecture (as much as I hate buzz words), I know Python has caught a lot of criticism for how the GIL handles concurrency and becomes a bottleneck. I understand the problem on the surface, but not well enough to know how other procedural languages handle threads under the hood. Does Java have similar problems? C#? Ruby? If not, why hasn't Python adopted the same strategy?

The GIL exists because it's needed (mainly) for CPython's implementation of reference counting - it's method of garbage collection. So let's be clear, Python doesn't have a GIL, the reference implementation does, and it's just an implementation detail.
The GIL exists because it makes the implementation simple and fast, and most of the time, it simply doesn't matter. Threading is mainly designed to allow access to slow resources alongside processing, which isn't hindered at all by the GIL.
The only reason that the GIL can be an issue is where one wants to do a lot of parallel computation. In this case, one can make an extension module in C or use the multiprocessing module to side-step the GIL.
All this means that the GIL really isn't an issue 99.9% of the time, and when it is, it's easily worked around. If you find it really hinders you, then you might want to try Jython, which is implemented on top of the JVM and uses a different method of garbage collection that doesn't require the GIL.
As always, premature optimization is a bad idea - if you develop something and find yourself hurt by the GIL, then there are ways to work around it without much pain. That said, it's highly unlikely you'll find it's a problem in the real-world. It's one of the most overblown things surrounding Python (maybe second only to the whole indentation thing).

Related

Why does python not lock only the mutable data? [duplicate]

I'm hoping someone can provide some insight as to what's fundamentally different about the Java Virtual Machine that allows it to implement threads nicely without the need for a Global Interpreter Lock (GIL), while Python necessitates such an evil.

Python (the language) doesn't need a GIL (which is why it can perfectly be implemented on JVM [Jython] and .NET [IronPython], and those implementations multithread freely). CPython (the popular implementation) has always used a GIL for ease of coding (esp. the coding of the garbage collection mechanisms) and of integration of non-thread-safe C-coded libraries (there used to be a ton of those around;-).
The Unladen Swallow project, among other ambitious goals, does plan a GIL-free virtual machine for Python -- to quote that site, "In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001)."

The JVM (at least hotspot) does have a similar concept to the "GIL", it's just much finer in its lock granularity, most of this comes from the GC's in hotspot which are more advanced.
In CPython it's one big lock (probably not that true, but good enough for arguments sake), in the JVM it's more spread about with different concepts depending on where it is used.
Take a look at, for example, vm/runtime/safepoint.hpp in the hotspot code, which is effectively a barrier. Once at a safepoint the entire VM has stopped with regard to java code, much like the python VM stops at the GIL.
In the Java world such VM pausing events are known as "stop-the-world", at these points only native code that is bound to certain criteria is free running, the rest of the VM has been stopped.
Also the lack of a coarse lock in java makes JNI much more difficult to write, as the JVM makes less guarantees about its environment for FFI calls, one of the things that cpython makes fairly easy (although not as easy as using ctypes).

There is a comment down below in this blog post http://www.grouplens.org/node/244 that hints at the reason why it was so easy dispense with a GIL for IronPython or Jython, it is that CPython uses reference counting whereas the other 2 VMs have garbage collectors.
The exact mechanics of why this is so I don't get, but it does sounds like a plausible reason.

In this link they have the following explanation:
... "Parts of the Interpreter aren't threadsafe, though mostly because making them all threadsafe by massive lock usage would slow single-threaded extremely (source). This seems to be related to the CPython garbage collector using reference counting (the JVM and CLR don't, and therefore don't need to lock/release a reference count every time). But even if someone thought of an acceptable solution and implemented it, third party libraries would still have the same problems."

Python lacks jit/aot and the time frame it was written at multithreaded processors didn't exist. Alternatively you could recompile everything in Julia lang which lacks GIL and gain some speed boost on your Python code. Also Jython kind of sucks it's slower than Cpython and Java. If you want to stick to Python consider using parallel plugins, you won't gain an instant speed boost but you can do parallel programming with the right plugin.

Are Ruby Ractors the Same as Python's MultiProcessing module?

The Ruby 3.0 release has introduced Ractors and the way they're represented along their examples, brings Python's MultiProcessing module into mind.
So...
Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
Can Ractors be as fast as C/C++ threads and with low latency?
Thanks

Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
The Ractor specification does not prescribe any particular implementation strategy. It most certainly does not prescribe that an implementor must use OS processes. In fact, while that would be a pretty simple implementation because the OS does all the hard work for you, it would also be a pretty stupid implementation because Ractors are meant to be light-weight, which OS processes are typically not.
So, I expect that every implementor will choose their own most efficient implementation strategy. For example, I would expect TruffleRuby's and JRuby's implementation to be based on something like Kilim or Project Loom, Opal's implementation to be based on WebWorkers, Realms, and Promises, Artichoke's implementation to be based on Actix, Riker, or Axiom, and maybe MRuby's implementation might even be based on OS processes because of MRuby's focus on simplicity.
Right at this very moment, there does not exist any production-ready implementation of Ractors. In fact, there cannot be a production-ready implementation of Ractors, because the Ractor specification itself is still experimental, and thus not finalized.
The only implementation in existence right now is Koichi Sasada's original prototype which currently ships with YARV 3.0.0. This implementation does not implement Ractors as processes, it implements them as OS threads. YARV does not have a GIL, but it does have a per-Ractor GVL. So, only one thread of a Ractor can run at the same time, but multiple Ractors can each run one thread at the same time.
However, this is not a very optimized implementation, only a prototype. I would expect TruffleRuby's or JRuby's implementation to not have any sort of global lock. They never had one before, and Ractors don't share any data, so there simply is nothing to lock in the first place.
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
This comparison doesn't make much sense. First of all, Ractor is a specification with potentially multiple implementations, whereas to my understanding, Python's multiprocessing module is simply a way of starting multiple Python interpreters.
Secondly, Ractors are a language feature with specific language semantics.
Can Ractors be as fast as C/C++ threads and with low latency?
It's not quite clear what you mean by this. C doesn't have threads, so asking about C threads doesn't make sense. C++ has threads, but just like Ractors, they are simply a specification with multiple possible implementations. It will simply depend on the particular implementation of Ractors and C++ threads.
It is certainly possible to implement Ractors using threads. The current YARV prototype is proof of that.

I found an article on FastRuby's website that explains the differences between Ractors & other Concurrency & Parallelism features of Ruby.
The whole point was that, they're not fast enough YET (30/12/2020) and are lacking behind fork and even threads so far. So the answer so far is:
No
Unfortunately, not YET (30/12/2020)😁
No😐 (Then again, not YET! But I'd really be happy if they finally could)

Python thread parallelization escaping the GIL

I'm rephrasing my question because I think many thought it was the question "does python have threads". It does, but CPython also has the GIL, which will never schedule more than one thread at any given time. That makes CPython threads useless for cpu-intensive computations.
I need to use threads; process parallelism won't work for me because of the IPC costs (I have large shared objects).
I'm currently using Jython (no GIL) with JyNI so that I can use numpy. JyNI is alpha, but it does now support numpy. I got this to work. However, JyNI is alpha and buggy, and the whole process is slow.
I've read a bunch of old threads. I wonder whether there has been a viable option since then? I'm forced to use python 2.7.
Thanks.

At the moment, Jython is still considerably slower than CPython. Depending on the program and how well the JIT can optimize it, multithreading might or might not pay off. Jython's primary design goal is compatibility, before performance. It is mainly intended for glue code and there is still a lot of potential for efficiency improvements. See e.g. zippy for a blazingly fast Python implementation in Java, however it is experimental and lacks Jython's compatibility level. In a way this represents the opposite design goal.
Now adding JyNI to Jython does not exactly make it faster, but so far I found that performace optimization in JyNI would be premature and usually the Jython part dominates the runtime anyway. Also, e.g. for NumPy the native numerics workload vastly dominates the glue code cost.
Finally, note that JyNI must emulate a GIL on C side. For details have a look at the paper https://arxiv.org/abs/1607.00825. Maybe it will be possible to operate certain extensions without a GIL - it depends on implementation detail, how sensitive an extension is to that. Currently the C-side GIL is mandatory. That's why you might not benefit from Java multithreading when using NumPy. C-extensions have the option to explicitly release the GIL e.g. during computationally intense operations that don't interact with the interpreter. I don't know if NumPy makes use of this.
JyNI is alpha and buggy
Please make sure to report bugs at the issue tracker.

Python:When to use Threads vs. Multiprocessing

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?

Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)

For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

python threading: memory model and visibility

Does python threading expose issues of memory visibility and statement reordering as Java does? Since I can't find any reference to a "Python Memory Model" or anything like that, despite the fact that lots of people are writing multithreaded Python code, I'm guessing that these gotchas don't exist here. No volatile keyword, for instance. But it doesn't seem to be stated explicitly anywhere that, for instance, a change in a variable in one thread is immediately visible to all other threads.
Maybe this stuff is all very obvious to Python programmers, but as a fearful Java programmer, I require a little extra reassurance :)

There is no formal model for Python's threading (hey, after all, there wasn't one for Java's for years... hopefully, one will also eventually be written for Python).
In practice, no Python implementation performs any advanced optimization such as statement reordering or temporarily treating shared variables as thread-local ones -- and you can count on these semantics constraints even though they are not formally assured.
CPython in particular, as #Rawheiser's mention, uses a global interpreter lock; other implementations (PyPy, IronPython, Jython, ...) do not (so they can use multiple cores effectively with a threading model, while CPython requires multi-processing for the same purpose), so you should not count on that if you want to write code that's portable throughout Python implementations. (So, you shouldn't count on the "atomicity" of operations that only happen to be atomic in CPython because of the GIL, such as dictionary accesses -- in other Python implementations, multiple threads might be modifying a dict at once and cause errors unless you protect the dict with a lock or the like).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.