Parallelism in Julia: Native Threading Support

Parallelism in Julia: Native Threading Support - python

In their arXiv paper, the original authors of Julia mention the following:
2.14 Parallelism.
Parallel execution is provided by a message-based multi-processing system implemented in Julia in the standard library.
The language design supports the implementation of such libraries by
providing symmetric coroutines, which can also be thought of as
cooperatively scheduled threads. This feature allows asynchronous
communication to be hidden inside libraries, rather than requiring the
user to set up callbacks. Julia does not currently support native
threads, which is a limitation, but has the advantage of avoiding the
complexities of synchronized use of shared memory.
What do they mean by saying that Julia does not support native threads? What is a native thread?
Do other interpreted languages such as Python or R support this type of parallelism? Is Julia alone in this?

Update
When this question was asked in 2013, Julia indeed had no support for multithreading. Today, however, Julia supports native threading with what has emerged as the best language model for composable thread programming. This model was pioneered by Cilk and Intel's Threading Building Blocks, pushed further at the language level by Go, and is now also used by Julia. Statements in the original answer below about other dynamic languages remain true: they still do not support native threading with support for parallel execution of user code. The history of adding threading capabilities to Julia progressed in the following high-level stages:
Julia 0.3: support for native threads with an OpenMP-like compute model (i.e. parallel for loops). This functionality was highly limited: no I/O or networking was allowed in parallel code.
Julia 1.3: fully composable high performance M:N threading. This threading model is the same as Go (and Cilk and Intel TBB), where tasks are used to express potential concurrency, and those tasks are run on threads from a pool of native threads by a scheduler.
Julia 1.7: support for migration of tasks between native threads. This allows a task to begin execution on one native thread, get suspended, and then resume on a different thread, allowing full utilization of available compute resources.
Original Answer
"Native threads" are separate contexts of execution, managed by the operating system kernel, accessing a shared memory space and potentially executing concurrently on separate cores. Compare this with separate processes, which may execute concurrently on multiple cores but have separate memory spaces. Making sure that processes interact nicely is easy since they can only communicate with each other via the kernel. Ensuring that threads don't interact in unpredictable, buggy ways is very hard since they can read and write to the same memory in an unrestricted manner.
The R situation is fairly straightforward: R is not multithreaded. Python is a little more complicated: Python does support threading, but due to the global interpreter lock (GIL), no actual concurrent execution of Python code is possible. Other popular open source dynamic languages are in various mixed states with respect to native threading (Ruby: no/kinda/yes?; Node.js: no), but in general, the answer is no, they do not support fully concurrent native threading, so Julia is not alone in this.
When we do add shared-memory parallelism to Julia, as we plan to – whether using native threads or multiple processes with shared memory – it will be true concurrency and there will be no GIL preventing simultaneous execution of Julia code. However, this is an incredibly tricky feature to add to a language, as attested by the non-existent or limited support in other very popular, mature dynamic languages. Adding a shared-memory concurrency model is technically difficult, but the real problem is designing a programming model that will allow programmers to make effective use of hardware concurrency in a productive and safe way. This problem is generally unsolved and is a very active area of research and experimentation – there is no "gold standard" to copy. We could just add POSIX threads support, but that programming model is general considered to be dangerous and incredibly difficult to use correctly and effectively. Go has an excellent concurrency story, but it is designed for writing highly concurrent servers, not for concurrently operating on large data, so it's not at all clear that simply copying Go's model is a good idea for Julia.

Related

Parallel threading python GIL vs Java

I know that python has a GIL that make threads not be able to run at the same time therefore threading is just context switching.
Why is java different?
Threads on the same CPU in every language cannot run parallel.
Is creating new thread in java utilizes cores in multi core machine?
python can only spawn threads on the same CPU, in contrast to java?
If 1. Is the case, when using more threads than CPUs even in java it comes back to context switching again for several of them?
If 1. Is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
Isn't the whole point of threading is being able to use the same memory space? If java does run some of them in multiple threads for perallelism, how do they really share memory?
Thank you

Why is java different?
Because it is able to effectively use multiple cores at the same time.
Does creating a new thread in java utilizes cores in multi core machine?
Yes.
Python can only spawn threads on the same CPU, in contrast to Java?
Java can spawn multiple threads which will on different CPUs. Java is not responsible for the actual thread scheduling. That is handled by the OS. And the OS may reschedule a thread to a different CPU to the one that it started on.
I am not sure about the precise details for Python, but I think the GIL is an implementation detail rather than something that it intrinsic to the language itself1. But either way, in a Python implementation, the GIL means that you would get little performance benefit in spawning threads on multiple cores. As this page says:
"The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter."
If 1. is the case, when using more threads than CPUs does it come back to context switching in Java?
It depends. When switching a CPU between threads belonging to different processes, a full context switch is involved. But when switching between threads in the same process, only the (user) registers need to be switched. (The virtual memory registers and caches don't need to be switched / flushed because the threads share the same virtual address space.)
If 1. is the case then how is it differ from multiprocessing? Because utilizing multiple cores isn't guaranteed?
The key difference between multi-threading and multi-processing is that processes do not share any memory. By contrast, one thread in a process can see the memory of all of the others ... modulo issues of when changes are visible.
This difference has a variety of consequences.
Isn't the whole point of threading is being able to use the same memory space?
Yes, that is the main point ... when you compare multi-threading with multi-processing.
If Java does run some of them in multiple threads for parallelism ...
Java supports threads for many reasons. Parallelism is only one of those reasons. Others include multiplexing I/O and simplifying certain kinds of programming problem. These other reasons are also relevant to Python.
... how do [Java threads] really share memory?
The hardware deals with the issues of making the physical memory visible to all of the threads, and propagation of changes via the memory caches. It is complicated.
In Java the onus is on the programmer to "do the right thing" when threads make use of shared variables / objects. You need to use volatile variables, or synchronized blocks / methods, or something else that ensures that there is a happens before chain between a write and subsequent read. (Otherwise you can get issues with changes not being visible.)
This transfer of responsibility to the programmer allows the compiler to generate code with fewer main memory operations ... and hence that is faster. The downside is that if an application doesn't obey the rules, it is liable to behave in unexpected ways.
By contrast, in Python the memory model is unspecified, but there is an expectation (by millions of Python programmers) that it will behave in an intuitive fashion; i.e. a shared variable write performed by one thread will immediately be visible to other threads. This is hard to achieve efficiently while also allowing Python threads to run in parallel.
1 - While the GIL is not formally part of the Python spec, the influence of GIL on the (unspecified!) Python memory model and Python programmers assumptions make it more than merely an implementation detail. It remains to be seen if Python can successfully evolve into a language where multi-threading can use multiple cores effectively.

Not a complete answer here, but just adding a couple of things that Stephen C didn't already say:
Python can only spawn threads on the same CPU, in contrast to java?
That would be an optimization, not an essential fact. There's no reason in principle why Python could not simply allow the OS to schedule its threads on whatever CPU happened to be available at any given time.
OTOH, given that no two Python threads can do significant work at the same time, it potentially could improve performance if the threads all had affinity for the same CPU. (See what Stephen C said about "full context switch" vs. "only the (user) registers."
Giving user-mode processes control over processor affinity is a relatively new feature in some operating systems. I have no idea of whether or not any Python version actually uses that feature.
If java does run...multiple threads for parallelism...?
Java doesn't "run multiple threads for parallelism." Your Java program creates multiple threads for whatever reason you happen to want them. Most modern OSs provide threads. Java simply makes that ability available to application programmers in a way that is tightly integrated with the language itself. You are free to use them (or not) however you see fit.

Is Python multiprocessing intensive on resources?

Since CPU-bound parellelization is not achievable in CPython due to GIL.
The official documentation recommends to use multiprocessing instead of multithreading.
So, is the use of multiple processes more intensive on resources than multiple threads, if compared to multiprocessing/multithreading performance of any other programming language like Java or C++ which support true parellelization in both multiprocessing and multithreading?

There is little inherent additional cost to multiprocessing in python beyond the cost of forking (unix-like systems) or respawning a process. The expense is when data or state needs to be shared among the processes. This could be anything from the iterable given to Pool.map to the proxies in Manager. As long as those costs are kept low compared to the per-process work load, its a wash. (Note that python is usually slower than java and c++ for other reasons unrelated to mp).

Recommended architecture for telnet-like server (multiprocess? process pools?)

I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.

Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.

Would limiting a GILed Python program to a single CPU boost performance?

Following up on David Beazley's paper regarding Python and GIL, would it be a good practice to limit a Python program (CPython with GIL and all) to a single CPU in a Windows based multi-core system?
Would it boost performance?
UPDATE: Assume multiple threads are used (not sure if it makes a difference)

The paper does indeed imply that limiting a program to a single core would improve performance in that particular case. However, there are a number of concerns that you would need to deal with:
His tests are mainly for compute intensive threads rather than IO bound threads. If the threads you are using often block voluntarily (such as in a web server waiting for a client) then you don't run into GIL issues at all.
The GIL issues deal specifically with threads and not processes. I may be reading your question wrong, but you seem to be asking about restricting all Python programs to a single core. Programs using processes for parallelism don't suffer from GIL issues and restricting them to a single core will make them slower.
The GIL is drastically different in Python 3.2 (as David mentions in this video. The GIL was changed explicitly to deal with such issues. While it still has problems, it no longer has this problem.
In summary, the only time you want to complicate your life by forcing the OS to restrict the program to a single core is when you are running a:
Multithreaded
Compute Intensive
Lower than Python 3.2
program on a multicore machine.

Bias : For parallel computing involving heavy CPU processing, I much
prefer message passing and cooperating processes to thread programming
(of course, it depends on the problem)
You shouldn't limit your programs to one core. Beazley was just demonstrating a specific problem that performed poorly under those unque conditions (those conditions being an IO bound thread and a CPU bound thread competing against each other). Ideally you want to avoid those conditions by using a different method (import multiprocessing).
I think the best solution is to put your CPU bound tasks in other processes using the multiprocessing module so that they utilize their own cores, and IO bound tasks in threads (or microthreads/coroutines, if you read his interesting paper on that: http://www.dabeaz.com/coroutines/) since the GIL is best suited for those types of tasks.
Conclusion: Python threads are best suited for IO bound tasks, NOT CPU bound.

Stackless python and multicores?

So, I'm toying around with Stackless Python and a question popped up in my head, maybe this is "assumed" or "common" knowledge, but I couldn't find it actually written anywhere on the stackless site.
Does Stackless Python take advantage of multicore CPUs? In normal Python you have the GIL being constantly present and to make (true) use of multiple cores you need to use several processes, is this true for Stackless also?

Stackless python does not make use of any kind of multi-core environment it runs on.
This is a common misconception about Stackless, as it allows the programmer to take advantage of thread-based programming. For many people these two are closely intertwined, but are, in fact two separate things.
Internally Stackless uses a round-robin scheduler to schedule every tasklet (micro threads), but no tasklet can be run concurrent with another one. This means that if one tasklet is busy, the others must wait until that tasklet relinquishes control. By default the scheduler will not stop a tasklet and give processor time to another. It is the tasklet's responsibility to schedule itself back in the end of the schedule queue using Stackless.schedule(), or by finishing its calculations.
all tasklets are thus executed in a sequential manner, even when multiplpe cores are available.
The reason why Stackless does not have multi-core support is because this makes threads a whole lot easier. And this is just what stackless is all about:
from the official stackless website
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure.
More readable code.
Increased programmer productivity.
Here is a link to some more information about multiple cores and stackless.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.