Python:When to use Threads vs. Multiprocessing

Python:When to use Threads vs. Multiprocessing - python

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?

Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)

For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

Related

Are Ruby Ractors the Same as Python's MultiProcessing module?

The Ruby 3.0 release has introduced Ractors and the way they're represented along their examples, brings Python's MultiProcessing module into mind.
So...
Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
Can Ractors be as fast as C/C++ threads and with low latency?
Thanks

Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
The Ractor specification does not prescribe any particular implementation strategy. It most certainly does not prescribe that an implementor must use OS processes. In fact, while that would be a pretty simple implementation because the OS does all the hard work for you, it would also be a pretty stupid implementation because Ractors are meant to be light-weight, which OS processes are typically not.
So, I expect that every implementor will choose their own most efficient implementation strategy. For example, I would expect TruffleRuby's and JRuby's implementation to be based on something like Kilim or Project Loom, Opal's implementation to be based on WebWorkers, Realms, and Promises, Artichoke's implementation to be based on Actix, Riker, or Axiom, and maybe MRuby's implementation might even be based on OS processes because of MRuby's focus on simplicity.
Right at this very moment, there does not exist any production-ready implementation of Ractors. In fact, there cannot be a production-ready implementation of Ractors, because the Ractor specification itself is still experimental, and thus not finalized.
The only implementation in existence right now is Koichi Sasada's original prototype which currently ships with YARV 3.0.0. This implementation does not implement Ractors as processes, it implements them as OS threads. YARV does not have a GIL, but it does have a per-Ractor GVL. So, only one thread of a Ractor can run at the same time, but multiple Ractors can each run one thread at the same time.
However, this is not a very optimized implementation, only a prototype. I would expect TruffleRuby's or JRuby's implementation to not have any sort of global lock. They never had one before, and Ractors don't share any data, so there simply is nothing to lock in the first place.
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
This comparison doesn't make much sense. First of all, Ractor is a specification with potentially multiple implementations, whereas to my understanding, Python's multiprocessing module is simply a way of starting multiple Python interpreters.
Secondly, Ractors are a language feature with specific language semantics.
Can Ractors be as fast as C/C++ threads and with low latency?
It's not quite clear what you mean by this. C doesn't have threads, so asking about C threads doesn't make sense. C++ has threads, but just like Ractors, they are simply a specification with multiple possible implementations. It will simply depend on the particular implementation of Ractors and C++ threads.
It is certainly possible to implement Ractors using threads. The current YARV prototype is proof of that.

I found an article on FastRuby's website that explains the differences between Ractors & other Concurrency & Parallelism features of Ruby.
The whole point was that, they're not fast enough YET (30/12/2020) and are lacking behind fork and even threads so far. So the answer so far is:
No
Unfortunately, not YET (30/12/2020)😁
No😐 (Then again, not YET! But I'd really be happy if they finally could)

Does Python multiprocessing module need multi-core CPU?

Do I need a multi-core CPU to take advantage of the Python multiprocessing module?
Also, can someone tell me how it works under the hood?

multiprocessing asks the OS to launch one or more new processes, running the same version of Python and the same version of your script. It can also set up pipes or other ways of sharing data directly between them.
It usually works like magic; when you peek under the hood, sometimes it looks like sausage being made, but you can usually understand the sausage grinders. The multiprocessing docs do a great job explaining things further (they're long, but then there's a lot to explain). And if you need even more under-the-hood knowledge, the docs link to the source, which is pretty readable Python code. If you have a specific question after reading, come back to SO and ask a specific question.
Meanwhile, you can get some of the benefits of multiprocessing without multiple cores.
The main benefit—the reason the module was designed—is parallelism for speed. And obviously, without 4 cores, you aren't going to cut your time down to 25%. But sometimes, you actually can get a bit of speedup even with a single core, especially if that core has "hyperthreading" or similar technologies. I've seen times come down to 80%, or even 60%. More commonly, they'll go up to 108% instead (because you did get a small benefit from hyperthreading, but the overhead cost was higher than the gain). But try it with your code and see.
Meanwhile, you get all of the side benefits:
Concurrency: You can run multiple tasks at once without them blocking each other. Of course threads, asyncio, and other techniques can do this too.
Isolation: You can run multiple tasks at once without the risk of one of them changing data that another one wasn't expecting to change.
Crash protection: If a child task segfaults, only that task is affected. (Well, you still have to be careful of any side-effects—if it crashed in the middle of writing a file that another tasks expects to be in a consistent shape, you're still in trouble.)
You can also use the multiprocessing module without multiple processes. Sometimes you just want the higher-level API of the module, but you want to use it with threads; multiprocessing.dummy does that. And you can switch back and forth in a couple lines of code to test it both ways. Or you can use the higher-level concurrent.futures.ProcessPoolExecutor wrapper, if its model fits what you want to do. Besides often being simpler, it lets you switch between threads and processes by just changing one word in one line.
Also, redesigning your program around multiprocessing takes you a step closer to further redesigning it as a distributed system that runs on multiple separate machines. It forces you to deal with questions like how your tasks communicate without being able to share everything, without forcing you to deal with further questions like how they communicate without reliable connections.

How to choose between different concurrent method available in Python?

There's different ways of doing concurrent in Python, below is a simple list:
process-based: process.Popen, multiprocessing.Process, old fashioned os.system, os.popen, os.exe*
thread-based: threading.Thread
microthread-based: greenlet
I know the difference between thread-based concurrency and process-based concurrency, and I know some (but not too much) about GIL's impact in CPython's thread support.
For a beginner who want to implement some level of concurrency, how to choose between them? Or, what's the general difference between them? Are there any more ways to do concurrent in Python?
I'm not sure if I'm asking the right question, please feel free to improve this question.

The reason all three of these mechanisms exist is that they have different strengths and weaknesses.
First, if you have huge numbers of small, independent tasks, and there's no sensible way to batch them up (typically, this means you're writing a C10k server, but that's not the only possible case), microthreads win hands down. You can only run a few hundred OS threads or processes before everything either bogs down or just fails. So, either you use microthreads, or you give up on automatic concurrency and start writing explicit callbacks or coroutines. This is really the only time microthreads win; otherwise, they're just like OS threads except a few things don't work right.
Next, if your code is CPU-bound, you need processes. Microthreads are an inherently single-core solution; Threads in Python generally can't parallelize well because of the GIL; processes get as much parallelism as the OS can handle. So, processes will let your 4-core system run your code 4x as fast; nothing else will. (In fact, you might want to go farther and distribute across separate computers, but you didn't ask about that.) But if your code is I/O-bound, core-parallelism doesn't help, so threads are just as good as processes.
If you have lots of shared, mutable data, things are going to be tough. Processes require explicitly putting everything into sharable structures, like using multiprocessing.Array in place of list, which gets nightmarishly complicated. Threads share everything automatically—which means there are race conditions everywhere. Which means you need to think through your flow very carefully and use locks effectively. With processes, an experienced developers can build a system that works on all of the test data but has to be reorganized every time you give it a new set of inputs. With threads, an experienced developer can write code that runs for weeks before accidentally and silently scrambling everyone's credit card numbers.
Whichever of those two scares you more—do that one, because you understand the problem better. Or, if it's at all possible, step back and try to redesign your code to make most of the shared data independent or immutable. This may not be possible (without making things either too slow or too hard to understand), but think about it hard before deciding that.
If you have lots of independent data or shared immutable data, threads clearly win. Processes need either explicit sharing (like multiprocessing.Array again) or marshaling. multiprocessing and its third-party alternatives make marshaling pretty easy for the simple cases where everything is picklable, but it's still not as simple as just passing values around directly, and it's also a lot slower.
Unfortunately, most cases where you have lots of immutable data to pass around are the exact same cases where you need CPU parallelism, which means you have a tradeoff. And the best answer to this tradeoff may be OS threads on your current 4-core system, but processes on the 16-core system you have in 2 years. (If you organize things around, e.g., multiprocessing.ThreadPool or concurrent.futures.ThreadPoolExecutor, and trivially switch to Pool or ProcessPoolExecutor later—or even with a runtime configuration switch—that pretty much solves the problem. But this isn't always possible.)
Finally, if your application inherently requires an event loop (e.g., a GUI app or a network server), pick the framework you like first. Coding with, say, PySide vs. wx, or twisted vs. gevent, is a bigger difference than coding with microthreads vs. OS threads. And, once you've picked the framework, see how much you can take advantage of its event loop where you thought you needed real concurrency. For example, if you need some code to run every 30 seconds, don't start a thread (micro- or OS) for that, ask the framework to schedule it however it wants.

GIL in Python 3.1

Does anybody knows fate of Global Interpreter Lock in Python 3.1 against C++ multithreading integration

GIL is still there in CPython 3.1; the Unladen Swallow projects aims (among many other performance boosts) to eventually remove it, but it's still a way from its goals, and is working on 2.6 first with the intent of eventually porting to 3.x for whatever x will be current by the time the 2.y version is considered to be done. For now, multiprocessing (instead of threading) remains the way of choice for using multiple cores in CPython (IronPython and Jython are fine too, but they don't support Python 3 currently, nor do they make C++ integration all that easy either;-).

Significant changes will occur in the GIL for Python 3.2. Take a look at the What's New for Python 3.2, and the thread that initiated it in the mailing list.
While the changes don't signify the end of the GIL, they herald potentially enormous performance gains.
Update
The general performance gains with the new GIL in 3.2 by Antoine Pitrou were negligible, and instead focused on improving contention issues that arise in certain corner cases.
An admirable effort by David Beazley was made to implement a scheduler to significantly improve performance when CPU and IO bound threads are mixed, which was unfortunately shot down.
The Unladen Swallow work was proposed for merging in Python 3.3, but this has been withdrawn due to lack of results in that project. PyPy is now the preferred project and is currently requesting funding to add Python3k support. There's very little chance that PyPy will become the default at present.
Efforts have been made for the last 15 years to remove the GIL from CPython but for the foreseeable future it is here to stay.

The GIL will not affect your code which does not use python objects. In Numpy, we release the GIL for computational code (linear algebra calls, etc...), and the underlying code can use multithreading freely (in fact, those are generally 3rd party libraries which do not know anything about python)

The GIL is a good thing.
Just make your C++ application release the GIL while it is doing its multithreaded work. Python code will continue to run in the other threads, unspoiled. Only acquire the GIL when you have to touch python objects.

I guess there will always be a GIL.
The reason is performance. Making all the low level access thread safe - means putting a mutex around each hash operation etc. is heavy. Remember that a simple statement like
self.foo(self.bar, 3, val)
Might already have at least 3 (if val is a global) hashtable lookups at the moment and maybe even much more if the method cache is not hot (depending on the inheritance depth of the class)
It's expensive - that's why Java dropped the idea and introduced hashtables which do not use a monitor call to get rid of its "Java Is Slow" trademark.

As I understand it the "brainfuck" scheduler will replace the GIL from python 3.2
BFS bainfuck scheduler

If the GIL is getting in the way, just use the multiprocessing module. It spawns new processes but uses the threading model and (most of the) api. In other words, you can do process-based parallelism in a thread-like way.

Easier concurrency building blocks for Python?

It seems that Python standard library lacks various useful concurrency-related concepts such as atomic counter, executor and others that can be found in e.g. java.util.concurrent. Are there any external libraries that would provide easier building blocks for concurrent Python applications?

Kamaelia, as already mentioned, is aimed at making concurrency easier to work with in python.
Its original use case was network systems (which are a naturally concurrent) and developed with the viewpoint "How can we make these systems easier to develop and maintain".
Since then life has moved on and it is being used in a much wider variety of problem domains from desktop systems (like whiteboarding applications, database modelling, tools for teaching children to read and write) through to back end systems for websites (like stuff for transcoding & converting user contributed images and video for web playback in a variety of scenarios and SMS / text messaging applications.
The core concept is essentially the same idea as Unix pipelines - except instead of processes you can have python generators, threads, or processes - which are termed components. These communicate over inboxes and outboxes - as many as you like of each, rather than just stdin/stdout/stderr. Also rather than requiring serialised file interfaces, you pass between components fully fledged python objects. Also rather than being limited to pipelines, you can have arbitrary shapes - called graphlines.
You can find a full tutorial (video, slides, downloadable PDF booklet) here:
http://www.kamaelia.org/PragmaticConcurrency
Or the 5 minute version here (O'Reilly ignite talk):
http://yeoldeclue.com/cgi-bin/blog/blog.cgi?rm=viewpost&nodeid=1235690128
The focus on the library is pragmatic development, system safety and ease of maintenance though some effort has gone in recently towards adding some syntactic sugar. Like anything the developers (me and others :-) welcome feedback on improving it.
You can also find more information here:
- http://www.slideshare.net/kamaelian
Primarily, Kamaelia's core (Axon) was written to make my day job easier, and to wrap up best practice (message passing, software transactional memory) in a reusable fashion. I hope it makes your life easier too :-)

Although it may not be immediately obvious, itertools.count is indeed an atomic counter (the only operation on an instance x thereof, spelled next(x), is equivalent to an "atomic ++x" if C had such a concept;-). Edit: at least, this surely holds in CPython; I thought it was part of the Python standard definition but apparently IronPython and Jython disagree (not ensuring thread-safety of count.next in their current implementations) so I may well be wrong!
That is, suppose you currently have a data structure such as:
counters = dict.fromkeys(words_of_interest, 0)
...
if w in counters: counters[w] += 1
and your problem is that the latter increment is not atomic, so if two threads are at the same time dealing with the same word of interest the two increments might interfere (only one would "take", so the counter would be incremented only by one, not by two). Then:
counters = dict((w, itertools.count()) for w in words_of_interest)
...
if w in counters: next(counters[w])
will perform the same operations, but in an atomic way.
(There is unfortunately no obvious, documented way to "extract the current value of the counter", though in fact str(x) does return a string such as 'count(3)' from which the current value can be parsed out again;-).

Concurrency in Python (at least CPython) and Java are wildly different, at least in part because of the Global Interpreter Lock (GIL). In general, concurrency in Python is achieved not with threads, but processes. See multiprocessing for the "standard" concurrency module.
Also, check out "A Curious Course on Coroutines and Concurrency" for some concurrency techniques that were pretty new to me coming from Java. David Beazley (the author) is a Smart Guy™ when it comes to Python in general, and concurrency in particular.

kamaelia provides tools for abstracting concurrency to threads or process etc.

P-workers creates a "Job-Worker" abstraction over the python multiprocessing library. It simplifies concurrency with multiprocessing by starting "Workers" that have specific skills/attributes (defined functions), and providing a queue where they receive "Jobs" from. Its somewhat analogous to a thread pool, only with processes instead of threads. Therefore its better suited for a high number of CPU instructions. You can also use it to spawn multiple instances of a single application or even spawn "Workers" that have multiple threads.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.