Python, Ruby, Haskell - Do they provide true multithreading?

Python, Ruby, Haskell - Do they provide true multithreading? - python

We are planning to write a highly concurrent application in any of the Very-High Level programming languages.
1) Do Python, Ruby, or Haskell support true multithreading?
2) If a program contains threads, will a Virtual Machine automatically assign work to multiple cores (or to physical CPUs if there is more than 1 CPU on the mainboard)?
True multithreading = multiple independent threads of execution utilize the resources provided by multiple cores (not by only 1 core).
False multithreading = threads emulate multithreaded environments without relying on any native OS capabilities.

1) Do Python, Ruby, or Haskell support true multithreading?
This has nothing to do with the language. It is a question of the hardware (if the machine only has 1 CPU, it is simply physically impossible to execute two instructions at the same time), the Operating System (again, if the OS doesn't support true multithreading, there is nothing you can do) and the language implementation / execution engine.
Unless the language specification explicitly forbids or enforces true multithreading, this has absolutely nothing whatsoever to do with the language.
All the languages that you mention, plus all the languages that have been mentioned in the answers so far, have multiple implementations, some of which support true multithreading, some don't, and some are built on top of other execution engines which might or might not support true multithreading.
Take Ruby, for example. Here are just some of its implementations and their threading models:
MRI: green threads, no true multithreading
YARV: OS threads, no true multithreading
Rubinius: OS threads, true multithreading
MacRuby: OS threads, true multithreading
JRuby, XRuby: JVM threads, depends on the JVM (if the JVM supports true multithreading, then JRuby/XRuby does, too, if the JVM doesn't, then there's nothing they can do about it)
IronRuby, Ruby.NET: just like JRuby, XRuby, but on the CLI instead of on the JVM
See also my answer to another similar question about Ruby. (Note that that answer is more than a year old, and some of it is no longer accurate. Rubinius, for example, uses truly concurrent native threads now, instead of truly concurrent green threads. Also, since then, several new Ruby implementations have emerged, such as BlueRuby, tinyrb, Ruby Go Lightly, Red Sun and SmallRuby.)
Similar for Python:
CPython: native threads, no true multithreading
PyPy: native threads, depends on the execution engine (PyPy can run natively, or on top of a JVM, or on top of a CLI, or on top of another Python execution engine. Whenever the underlying platform supports true multithreading, PyPy does, too.)
Unladen Swallow: native threads, currently no true multithreading, but fix is planned
Jython: JVM threads, see JRuby
IronPython: CLI threads, see IronRuby
For Haskell, at least the Glorious Glasgow Haskell Compiler supports true multithreading with native threads. I don't know about UHC, LHC, JHC, YHC, HUGS or all the others.
For Erlang, both BEAM and HiPE support true multithreading with green threads.
2) If a program contains threads, will a Virtual Machine automatically assign work to multiple cores (or to physical CPUs if there is more than 1 CPU on the mainboard)?
Again: this depends on the Virtual Machine, the Operating System and the hardware. Also, some of the implementations mentioned above, don't even have Virtual Machines.

The Haskell implementation, GHC, supports multiple mechanisms for parallel execution on shared memory multicore. These mechanisms are described in "Runtime Support for Multicore Haskell".
Concretely, the Haskell runtime divides work be N OS threads, distributed over the available compute cores. These N OS threads in turn run M lightweight Haskell threads (sometimes millions of them). In turn, each Haskell thread can take work for a spark queue (there may be billions of sparks). Like so:
The runtime schedules work to be executed on separate cores, migrates work, and load balances. The garbage collector is also a parallel one, using each core to collect part of the heap.
Unlike Python or Ruby, there's no global interpreter lock, so for that, and other reasons, GHC is particularly good on mulitcore in comparison, e.g. Haskell v Python on the multicore shootout

The GHC compiler will run your program on multiple OS threads (and thus multiple cores) if you compile with the -threaded option and then pass +RTS -N<x> -RTS at runtime, where <x> = the number of OS threads you want.

The current version of Ruby 1.9(YARV- C based version) has native threads but has the problem of GIL. As I know Python also has the problem of GIL.
However both Jython and JRuby(mature Java implementations of both Ruby and Python) provide native multithreading, no green threads and no GIL.
Don't know about Haskell.

Haskell is thread-capable, in addition you get pure functional language - no side effects

For real concurrency, you probably want to try Erlang.

I second the choice of Erlang. Erlang can support distributed highly concurrent programming out of the box. Does not matter whether you callit "multi-threading" or "multi-processing". Two important elements to consider are the level of concurrency and the fact that Erlang processes do not share state.
No shared state among processes is a good thing.

Haskell is suitable for anything.
python has processing module, which (I think - not sure) helps to avoid GIL problems. (so it suitable for anything too).
But my opinion - best way you can do is to select highest level possible language with static type system for big and huge things. Today this languages are: ocaml, haskell, erlang.
If you want to develop small thing - python is good. But when things become bigger - all python benefits are eaten by miriads of tests.
I didn't use ruby. I still thinking that ruby is a toy language. (Or at least there's no reason to teach ruby when you know python - better to read SICP book).

Related

Why does python not lock only the mutable data? [duplicate]

I'm hoping someone can provide some insight as to what's fundamentally different about the Java Virtual Machine that allows it to implement threads nicely without the need for a Global Interpreter Lock (GIL), while Python necessitates such an evil.

Python (the language) doesn't need a GIL (which is why it can perfectly be implemented on JVM [Jython] and .NET [IronPython], and those implementations multithread freely). CPython (the popular implementation) has always used a GIL for ease of coding (esp. the coding of the garbage collection mechanisms) and of integration of non-thread-safe C-coded libraries (there used to be a ton of those around;-).
The Unladen Swallow project, among other ambitious goals, does plan a GIL-free virtual machine for Python -- to quote that site, "In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001)."

The JVM (at least hotspot) does have a similar concept to the "GIL", it's just much finer in its lock granularity, most of this comes from the GC's in hotspot which are more advanced.
In CPython it's one big lock (probably not that true, but good enough for arguments sake), in the JVM it's more spread about with different concepts depending on where it is used.
Take a look at, for example, vm/runtime/safepoint.hpp in the hotspot code, which is effectively a barrier. Once at a safepoint the entire VM has stopped with regard to java code, much like the python VM stops at the GIL.
In the Java world such VM pausing events are known as "stop-the-world", at these points only native code that is bound to certain criteria is free running, the rest of the VM has been stopped.
Also the lack of a coarse lock in java makes JNI much more difficult to write, as the JVM makes less guarantees about its environment for FFI calls, one of the things that cpython makes fairly easy (although not as easy as using ctypes).

There is a comment down below in this blog post http://www.grouplens.org/node/244 that hints at the reason why it was so easy dispense with a GIL for IronPython or Jython, it is that CPython uses reference counting whereas the other 2 VMs have garbage collectors.
The exact mechanics of why this is so I don't get, but it does sounds like a plausible reason.

In this link they have the following explanation:
... "Parts of the Interpreter aren't threadsafe, though mostly because making them all threadsafe by massive lock usage would slow single-threaded extremely (source). This seems to be related to the CPython garbage collector using reference counting (the JVM and CLR don't, and therefore don't need to lock/release a reference count every time). But even if someone thought of an acceptable solution and implemented it, third party libraries would still have the same problems."

Python lacks jit/aot and the time frame it was written at multithreaded processors didn't exist. Alternatively you could recompile everything in Julia lang which lacks GIL and gain some speed boost on your Python code. Also Jython kind of sucks it's slower than Cpython and Java. If you want to stick to Python consider using parallel plugins, you won't gain an instant speed boost but you can do parallel programming with the right plugin.

Improving Python Threads Performance based on Resource Locking

The reason between Java and Python threads is
Java is designed to lock on the resources
Python is designed to lock the thread itself(GIL)
So Python's implementation performs better on a machine with singe core processor. This is fine 10-20 years before. With the increasing computing capacity, if we use multiprocessor machine with same piece of code, it performs very badly.
Is there any hack to disable GIL and use resource locking in Python(Like Java implementation)?
P.S. My application is currently running on Python 2.7.12. It is compute intensive with less I/O and network blocking. Assume that I can't use multiprocessing for my use case.

I think the most straight way for you, that will give you also a nice performance increase is to use Cython.
Cython is a Python superset that compiles Python-like code to C code (which makes use of the cPython API), and from there to executable. It allows one to optionally type variables, that then can use native C types instead of Python objects - and also allows one direct control of the GIL.
It does support a with nogil: statement in which the with block runs with the GIL turned off - if there are other threads running (you use the normal Python threading library), they won't be blocked while code is running on the marked with block.
Just keep in mind that the GIL is there for a reason: it is thanks to it that global complex objects like lists and dictionaries work without the danger of getting into an inconsistent state between treads. But if your "nogil" blocks restrict themselves to local data structures, you should have no problems.
Check the Cython project - and here is an specific example of turning off the GIL:
https://lbolla.info/blog/2013/12/23/python-threads-cython-gil

Python socket I/O performance compared to other languages

I'm writing a python program, to work on windows, the program has heavy threading and I/O, it heavily uses sockets in its I/O to send and receive data from remote locations, other than that, it has some string manipulation using regular expressions.
My question is: performance wise, is python the best programming language for such a program, compared to for example Java, or C#? Is there another language that would better fit the description above?

Interesting question. The python modules that deal with sockets wrap the underlying OS functionality directly. Therefore, in a given operation, you are not likely to see any speed difference depending on the wrapper language.
Where you will notice speed issues with python is if you are involved in really tight looping, like looking at every character in a stream.
You did not indiciate how much data you are sending. Unless you are undertaking a solution that has to maintain a huge volume of I/O, then python will likely do just fine. Implementing nginx or memcached or redis in python... not as good of an idea.
And as always... benchmark. If it is fast enough, then why change?
PS. you the programmer will likely get it done faster in python!

Your requirements are:
to work on windows;
the program has heavy threading and I/O
it heavily uses sockets in its I/O to send and receive data
it has some string manipulation using regular expressions.
The reason it is hard to say definitively which is the best language for this task is that almost all languages match your requirements.
Windows: all languages of notes
Heavy use of threads: C#, Java, C, C++, Haskell, Scala, Clojure, Erlang. Processed-based threads or other work arounds: Ruby, Python, and other interpreted languages without true fine-grained concurrency.
Sockets: all languages of note
Regexes: all languages of note
The most interesting constraint is the need to do massive concurrent IO. This means your bottleneck is going to be in context switching, cost of threads, and whether you can run thread pools on multiple cores. Depending on your scaling, you might want to use a compiled language, and one with lightweight threads, that can use multiple cores easily. That reduces the list to C++, Haskell, Erlang, Java, Scala. etc. You can probably work around the global interpreter lock in Python by using forked processes, it just won't be as fine grained.

Parallel processing in Python à la Grand Central Dispatch?

Is there a way of doing parallel processing in Python using concepts similar to those of Apple's Grand Central Dispatch? Grand Central Dispatch looks, from the outset, like a nice way of handling parallel processing.
If Python does not have a mostly equivalent module, what are the fundamental concepts behind Grand Central Dispatch that could usefully be implemented in Python?
I don't know much about Grand Central Dispatch, hence this question: I would love to know whether Grand Central Dispatch uses paradigms that (1) are not yet available in Python, and/or (2) could be implemented in Python.

The main problem here is the compiler and OS part of GCD. In order to have GCD running, you need the compiler to understand Blocks. You could create something that works similarly when programming, but it wouldn't have the same performance at all. With GCD you can create and enqueue thousands of Blocks, and still, there will only be 2 or 4 threads executing this blocks. If you implement the high level functionality of Blocks without the compiler accepting them, the only way I see is to use threads for "simulating" Blocks. Then, using thousands of threads in a system with 2 to 4 CPU cores, would be an spectacular performance mess, due to context switching, and memory use.
Not only you need the proper compiler extension to support GCD, but also you need a proper OS extension to manage the GCD queues where Blocks are enqueued. You need the OS to behave in a way that it controls how many threads are executing, and when and how many of them to activate when CPU cores are available, for the program using GCD. With GCD, threads and queues are independent. The threads just grab Blocks from queues (light data structures), but from any of those. So it doesn't matter how many blocks are there, because they are only pieces of code and pointers stored somewhere in the main memory.
You simply can't implement all this low level features from python. And only implementing the high level "GCD way of programming", you are going to make veeeery slow programs, or maybe even impossible to execute in a personal computer.
So first, for instance Cython could support GCD, and the OS that you want to use too. Linux has an implementation called libdispatch, available for Devian. But it only implements the compiler portion, so the program starts as many threads as cores on the system minus one. So I think it is still not a good option. Someone should add Linux OS support for GCD, maybe as a kernel module.
Nothing to say about Windows. I really don't know.
So the first natural step, should be to add and test support of CGD in Cython, for Mac OS. From there, you could do a native Python library that internally uses de Cython GCD library, to offer blocks and queues to normal python programmers.
Anoder option could be the CPython project to embrace this, and the Python project to add blocks and queues as a native feature of python. That would be amazing XD

Python does not have an equivalent module, though twisted uses many of the same basic concepts (async APIs, callback-based). The Python multiprocessing module actually uses sub-processes rather than threads and is not particularly equivalent either. The best approach would probably be one similar to that taken by MacRuby, which is to create wrappers for the GCD APIs and use those. Unlike Python, of course, MacRuby was also designed not to have a GIL (Global Interpreter Lock) and this will reduce the effectiveness of multi-threading in Python as various interpret threads hit the GIL at different times. Not much to do about that other than redesign the language, I'm afraid.

GIL in Python 3.1

Does anybody knows fate of Global Interpreter Lock in Python 3.1 against C++ multithreading integration

GIL is still there in CPython 3.1; the Unladen Swallow projects aims (among many other performance boosts) to eventually remove it, but it's still a way from its goals, and is working on 2.6 first with the intent of eventually porting to 3.x for whatever x will be current by the time the 2.y version is considered to be done. For now, multiprocessing (instead of threading) remains the way of choice for using multiple cores in CPython (IronPython and Jython are fine too, but they don't support Python 3 currently, nor do they make C++ integration all that easy either;-).

Significant changes will occur in the GIL for Python 3.2. Take a look at the What's New for Python 3.2, and the thread that initiated it in the mailing list.
While the changes don't signify the end of the GIL, they herald potentially enormous performance gains.
Update
The general performance gains with the new GIL in 3.2 by Antoine Pitrou were negligible, and instead focused on improving contention issues that arise in certain corner cases.
An admirable effort by David Beazley was made to implement a scheduler to significantly improve performance when CPU and IO bound threads are mixed, which was unfortunately shot down.
The Unladen Swallow work was proposed for merging in Python 3.3, but this has been withdrawn due to lack of results in that project. PyPy is now the preferred project and is currently requesting funding to add Python3k support. There's very little chance that PyPy will become the default at present.
Efforts have been made for the last 15 years to remove the GIL from CPython but for the foreseeable future it is here to stay.

The GIL will not affect your code which does not use python objects. In Numpy, we release the GIL for computational code (linear algebra calls, etc...), and the underlying code can use multithreading freely (in fact, those are generally 3rd party libraries which do not know anything about python)

The GIL is a good thing.
Just make your C++ application release the GIL while it is doing its multithreaded work. Python code will continue to run in the other threads, unspoiled. Only acquire the GIL when you have to touch python objects.

I guess there will always be a GIL.
The reason is performance. Making all the low level access thread safe - means putting a mutex around each hash operation etc. is heavy. Remember that a simple statement like
self.foo(self.bar, 3, val)
Might already have at least 3 (if val is a global) hashtable lookups at the moment and maybe even much more if the method cache is not hot (depending on the inheritance depth of the class)
It's expensive - that's why Java dropped the idea and introduced hashtables which do not use a monitor call to get rid of its "Java Is Slow" trademark.

As I understand it the "brainfuck" scheduler will replace the GIL from python 3.2
BFS bainfuck scheduler

If the GIL is getting in the way, just use the multiprocessing module. It spawns new processes but uses the threading model and (most of the) api. In other words, you can do process-based parallelism in a thread-like way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.