Why does python not lock only the mutable data? [duplicate]

Why does python not lock only the mutable data? [duplicate] - python

I'm hoping someone can provide some insight as to what's fundamentally different about the Java Virtual Machine that allows it to implement threads nicely without the need for a Global Interpreter Lock (GIL), while Python necessitates such an evil.

Python (the language) doesn't need a GIL (which is why it can perfectly be implemented on JVM [Jython] and .NET [IronPython], and those implementations multithread freely). CPython (the popular implementation) has always used a GIL for ease of coding (esp. the coding of the garbage collection mechanisms) and of integration of non-thread-safe C-coded libraries (there used to be a ton of those around;-).
The Unladen Swallow project, among other ambitious goals, does plan a GIL-free virtual machine for Python -- to quote that site, "In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001)."

The JVM (at least hotspot) does have a similar concept to the "GIL", it's just much finer in its lock granularity, most of this comes from the GC's in hotspot which are more advanced.
In CPython it's one big lock (probably not that true, but good enough for arguments sake), in the JVM it's more spread about with different concepts depending on where it is used.
Take a look at, for example, vm/runtime/safepoint.hpp in the hotspot code, which is effectively a barrier. Once at a safepoint the entire VM has stopped with regard to java code, much like the python VM stops at the GIL.
In the Java world such VM pausing events are known as "stop-the-world", at these points only native code that is bound to certain criteria is free running, the rest of the VM has been stopped.
Also the lack of a coarse lock in java makes JNI much more difficult to write, as the JVM makes less guarantees about its environment for FFI calls, one of the things that cpython makes fairly easy (although not as easy as using ctypes).

There is a comment down below in this blog post http://www.grouplens.org/node/244 that hints at the reason why it was so easy dispense with a GIL for IronPython or Jython, it is that CPython uses reference counting whereas the other 2 VMs have garbage collectors.
The exact mechanics of why this is so I don't get, but it does sounds like a plausible reason.

In this link they have the following explanation:
... "Parts of the Interpreter aren't threadsafe, though mostly because making them all threadsafe by massive lock usage would slow single-threaded extremely (source). This seems to be related to the CPython garbage collector using reference counting (the JVM and CLR don't, and therefore don't need to lock/release a reference count every time). But even if someone thought of an acceptable solution and implemented it, third party libraries would still have the same problems."

Python lacks jit/aot and the time frame it was written at multithreaded processors didn't exist. Alternatively you could recompile everything in Julia lang which lacks GIL and gain some speed boost on your Python code. Also Jython kind of sucks it's slower than Cpython and Java. If you want to stick to Python consider using parallel plugins, you won't gain an instant speed boost but you can do parallel programming with the right plugin.

Related

Python thread parallelization escaping the GIL

I'm rephrasing my question because I think many thought it was the question "does python have threads". It does, but CPython also has the GIL, which will never schedule more than one thread at any given time. That makes CPython threads useless for cpu-intensive computations.
I need to use threads; process parallelism won't work for me because of the IPC costs (I have large shared objects).
I'm currently using Jython (no GIL) with JyNI so that I can use numpy. JyNI is alpha, but it does now support numpy. I got this to work. However, JyNI is alpha and buggy, and the whole process is slow.
I've read a bunch of old threads. I wonder whether there has been a viable option since then? I'm forced to use python 2.7.
Thanks.

At the moment, Jython is still considerably slower than CPython. Depending on the program and how well the JIT can optimize it, multithreading might or might not pay off. Jython's primary design goal is compatibility, before performance. It is mainly intended for glue code and there is still a lot of potential for efficiency improvements. See e.g. zippy for a blazingly fast Python implementation in Java, however it is experimental and lacks Jython's compatibility level. In a way this represents the opposite design goal.
Now adding JyNI to Jython does not exactly make it faster, but so far I found that performace optimization in JyNI would be premature and usually the Jython part dominates the runtime anyway. Also, e.g. for NumPy the native numerics workload vastly dominates the glue code cost.
Finally, note that JyNI must emulate a GIL on C side. For details have a look at the paper https://arxiv.org/abs/1607.00825. Maybe it will be possible to operate certain extensions without a GIL - it depends on implementation detail, how sensitive an extension is to that. Currently the C-side GIL is mandatory. That's why you might not benefit from Java multithreading when using NumPy. C-extensions have the option to explicitly release the GIL e.g. during computationally intense operations that don't interact with the interpreter. I don't know if NumPy makes use of this.
JyNI is alpha and buggy
Please make sure to report bugs at the issue tracker.

How have other languages overcame the limitations of Python's GIL?

As the industry trends to "web scale" application architecture (as much as I hate buzz words), I know Python has caught a lot of criticism for how the GIL handles concurrency and becomes a bottleneck. I understand the problem on the surface, but not well enough to know how other procedural languages handle threads under the hood. Does Java have similar problems? C#? Ruby? If not, why hasn't Python adopted the same strategy?

The GIL exists because it's needed (mainly) for CPython's implementation of reference counting - it's method of garbage collection. So let's be clear, Python doesn't have a GIL, the reference implementation does, and it's just an implementation detail.
The GIL exists because it makes the implementation simple and fast, and most of the time, it simply doesn't matter. Threading is mainly designed to allow access to slow resources alongside processing, which isn't hindered at all by the GIL.
The only reason that the GIL can be an issue is where one wants to do a lot of parallel computation. In this case, one can make an extension module in C or use the multiprocessing module to side-step the GIL.
All this means that the GIL really isn't an issue 99.9% of the time, and when it is, it's easily worked around. If you find it really hinders you, then you might want to try Jython, which is implemented on top of the JVM and uses a different method of garbage collection that doesn't require the GIL.
As always, premature optimization is a bad idea - if you develop something and find yourself hurt by the GIL, then there are ways to work around it without much pain. That said, it's highly unlikely you'll find it's a problem in the real-world. It's one of the most overblown things surrounding Python (maybe second only to the whole indentation thing).

python threading: memory model and visibility

Does python threading expose issues of memory visibility and statement reordering as Java does? Since I can't find any reference to a "Python Memory Model" or anything like that, despite the fact that lots of people are writing multithreaded Python code, I'm guessing that these gotchas don't exist here. No volatile keyword, for instance. But it doesn't seem to be stated explicitly anywhere that, for instance, a change in a variable in one thread is immediately visible to all other threads.
Maybe this stuff is all very obvious to Python programmers, but as a fearful Java programmer, I require a little extra reassurance :)

There is no formal model for Python's threading (hey, after all, there wasn't one for Java's for years... hopefully, one will also eventually be written for Python).
In practice, no Python implementation performs any advanced optimization such as statement reordering or temporarily treating shared variables as thread-local ones -- and you can count on these semantics constraints even though they are not formally assured.
CPython in particular, as #Rawheiser's mention, uses a global interpreter lock; other implementations (PyPy, IronPython, Jython, ...) do not (so they can use multiple cores effectively with a threading model, while CPython requires multi-processing for the same purpose), so you should not count on that if you want to write code that's portable throughout Python implementations. (So, you shouldn't count on the "atomicity" of operations that only happen to be atomic in CPython because of the GIL, such as dictionary accesses -- in other Python implementations, multiple threads might be modifying a dict at once and cause errors unless you protect the dict with a lock or the like).

Python on an Real-Time Operation System (RTOS)

I am planning to implement a small-scale data acquisition system on an RTOS platform. (Either on a QNX or an RT-Linux system.)
As far as I know, these jobs are performed using C / C++ to get the most out of the system. However I am curious to know and want to learn some experienced people's opinions before I blindly jump into the coding action whether it would be feasible and wiser to write everything in Python (from low-level instrument interfacing through a shiny graphical user interface). If not, mixing with timing-critical parts of the design with "C", or writing everything in C and not even putting a line of Python code.
Or at least wrapping the C code using Python to provide an easier access to the system.
Which way would you advise me to work on? I would be glad if you point some similar design cases and further readings as well.
Thank you
NOTE1: The reason of emphasizing on QNX is due to we already have a QNX 4.25 based data acquisition system (M300) for our atmospheric measurement experiments. This is a proprietary system and we can't access the internals of it. Looking further on QNX might be advantageous to us since 6.4 has a free academic licensing option, comes with Python 2.5, and a recent GCC version. I have never tested a RT-Linux system, don't know how comparable it to QNX in terms of stability and efficiency, but I know that all the members of Python habitat and non-Python tools (like Google Earth) that the new system could be developed on works most of the time out-of-the-box.

I've built several all-Python soft real-time (RT) systems, with primary cycle times from 1 ms to 1 second. There are some basic strategies and tactics I've learned along the way:
Use threading/multiprocessing only to offload non-RT work from the primary thread, where queues between threads are acceptable and cooperative threading is possible (no preemptive threads!).
Avoid the GIL. Which basically means not only avoiding threading, but also avoiding system calls to the greatest extent possible, especially during time-critical operations, unless they are non-blocking.
Use C modules when practical. Things (usually) go faster with C! But mainly if you don't have to write your own: Stay in Python unless there really is no alternative. Optimizing C module performance is a PITA, especially when translating across the Python-C interface becomes the most expensive part of the code.
Use Python accelerators to speed up your code. My first RT Python project greatly benefited from Psyco (yeah, I've been doing this a while). One reason I'm staying with Python 2.x today is PyPy: Things always go faster with LLVM!
Don't be afraid to busy-wait when critical timing is needed. Use time.sleep() to 'sneak up' on the desired time, then busy-wait during the last 1-10 ms. I've been able to get repeatable performance with self-timing on the order of 10 microseconds. Be sure your Python task is run at max OS priority.
Numpy ROCKS! If you are doing 'live' analytics or tons of statistics, there is NO way to get more work done faster and with less work (less code, fewer bugs) than by using Numpy. Not in any other language I know of, including C/C++. If the majority of your code consists of Numpy calls, you will be very, very fast. I can't wait for the Numpy port to PyPy to be completed!
Be aware of how and when Python does garbage collection. Monitor your memory use, and force GC when needed. Be sure to explicitly disable GC during time-critical operations. All of my RT Python systems run continuously, and Python loves to hog memory. Careful coding can eliminate almost all dynamic memory allocation, in which case you can completely disable GC!
Try to perform processing in batches to the greatest extent possible. Instead of processing data at the input rate, try to process data at the output rate, which is often much slower. Processing in batches also makes it more convenient to gather higher-level statistics.
Did I mention using PyPy? Well, it's worth mentioning twice.
There are many other benefits to using Python for RT development. For example, even if you are fairly certain your target language can't be Python, it can pay huge benefits to develop and debug a prototype in Python, then use it as both a template and test tool for the final system. I had been using Python to create quick prototypes of the "hard parts" of a system for years, and to create quick'n'dirty test GUIs. That's how my first RT Python system came into existence: The prototype (+Psyco) was fast enough, even with the test GUI running!
-BobC
Edit: Forgot to mention the cross-platform benefits: My code runs pretty much everywhere with a) no recompilation (or compiler dependencies, or need for cross-compilers), and b) almost no platform-dependent code (mainly for misc stuff like file handling and serial I/O). I can develop on Win7-x86 and deploy on Linux-ARM (or any other POSIX OS, all of which have Python these days). PyPy is primarily x86 for now, but the ARM port is evolving at an incredible pace.

I can't speak for every data acquisition setup out there, but most of them spend most of their "real-time operations" waiting for data to come in -- at least the ones I've worked on.
Then when the data does come in, you need to immediately record the event or respond to it, and then it's back to the waiting game. That's typically the most time-critical part of a data acquisition system. For that reason, I would generally say stick with C for the I/O parts of the data acquisition, but there aren't any particularly compelling reasons not to use Python on the non-time-critical portions.
If you have fairly loose requirements -- only needs millisecond precision, perhaps -- that adds some more weight to doing everything in Python. As far as development time goes, if you're already comfortable with Python, you would probably have a finished product significantly sooner if you were to do everything in Python and refactor only as bottlenecks appear. Doing the bulk of your work in Python will also make it easier to thoroughly test your code, and as a general rule of thumb, there will be fewer lines of code and thus less room for bugs.
If you need to specifically multi-task (not multi-thread), Stackless Python might be beneficial as well. It's like multi-threading, but the threads (or tasklets, in Stackless lingo) are not OS-level threads, but Python/application-level, so the overhead of switching between tasklets is significantly reduced. You can configure Stackless to multitask cooperatively or preemptively. The biggest downside is that blocking IO will generally block your entire set of tasklets. Anyway, considering that QNX is already a real-time system, it's hard to speculate whether Stackless would be worth using.
My vote would be to take the as-much-Python-as-possible route -- I see it as low cost and high benefit. If and when you do need to rewrite in C, you'll already have working code to start from.

Generally the reason advanced against using a high-level language in a real-time context is uncertainty -- when you run a routine one time it might take 100us; the next time you run the same routine it might decide to extend a hash table, calling malloc, then malloc asks the kernel for more memory, which could do anything from returning instantly to returning milliseconds later to returning seconds later to erroring, none of which is immediately apparent (or controllable) from the code. Whereas theoretically if you write in C (or even lower) you can prove that your critical paths will "always" (barring meteor strike) run in X time.

Our team have done some work combining multiple languages on QNX and had quite a lot of success with the approach. Using python can have a big impact on productivity, and tools like SWIG and ctypes make it really easy to optimize code and combine features from the different languages.
However, if you're writing anything time critical, it should almost certainly be written in c. Doing this means you avoid the implicit costs of an interpreted langauge like the GIL (Global Interpreter Lock), and contention on many small memory allocations. Both of these things can have a big impact on how your application performs.
Also python on QNX tends not to be 100% compatible with other distributions (ie/ there are sometimes modules missing).

One important note: Python for QNX is generally available only for x86.
I'm sure you can compile it for ppc and other archs, but that's not going to work out of the box.

GIL in Python 3.1

Does anybody knows fate of Global Interpreter Lock in Python 3.1 against C++ multithreading integration

GIL is still there in CPython 3.1; the Unladen Swallow projects aims (among many other performance boosts) to eventually remove it, but it's still a way from its goals, and is working on 2.6 first with the intent of eventually porting to 3.x for whatever x will be current by the time the 2.y version is considered to be done. For now, multiprocessing (instead of threading) remains the way of choice for using multiple cores in CPython (IronPython and Jython are fine too, but they don't support Python 3 currently, nor do they make C++ integration all that easy either;-).

Significant changes will occur in the GIL for Python 3.2. Take a look at the What's New for Python 3.2, and the thread that initiated it in the mailing list.
While the changes don't signify the end of the GIL, they herald potentially enormous performance gains.
Update
The general performance gains with the new GIL in 3.2 by Antoine Pitrou were negligible, and instead focused on improving contention issues that arise in certain corner cases.
An admirable effort by David Beazley was made to implement a scheduler to significantly improve performance when CPU and IO bound threads are mixed, which was unfortunately shot down.
The Unladen Swallow work was proposed for merging in Python 3.3, but this has been withdrawn due to lack of results in that project. PyPy is now the preferred project and is currently requesting funding to add Python3k support. There's very little chance that PyPy will become the default at present.
Efforts have been made for the last 15 years to remove the GIL from CPython but for the foreseeable future it is here to stay.

The GIL will not affect your code which does not use python objects. In Numpy, we release the GIL for computational code (linear algebra calls, etc...), and the underlying code can use multithreading freely (in fact, those are generally 3rd party libraries which do not know anything about python)

The GIL is a good thing.
Just make your C++ application release the GIL while it is doing its multithreaded work. Python code will continue to run in the other threads, unspoiled. Only acquire the GIL when you have to touch python objects.

I guess there will always be a GIL.
The reason is performance. Making all the low level access thread safe - means putting a mutex around each hash operation etc. is heavy. Remember that a simple statement like
self.foo(self.bar, 3, val)
Might already have at least 3 (if val is a global) hashtable lookups at the moment and maybe even much more if the method cache is not hot (depending on the inheritance depth of the class)
It's expensive - that's why Java dropped the idea and introduced hashtables which do not use a monitor call to get rid of its "Java Is Slow" trademark.

As I understand it the "brainfuck" scheduler will replace the GIL from python 3.2
BFS bainfuck scheduler

If the GIL is getting in the way, just use the multiprocessing module. It spawns new processes but uses the threading model and (most of the) api. In other words, you can do process-based parallelism in a thread-like way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.