How to convert Python threading code to multiprocessing code?

How to convert Python threading code to multiprocessing code? - python

I need to convert a threading application to a multiprocessing application for multiple reasons (GIL, memory leaks). Fortunately the threads are quite isolated and only communicate via Queue.Queues. This primitive is also available in multiprocessing so everything looks fine. Now before I enter this minefield I'd like to get some advice on the upcoming problems:
How to ensure that my objects can be transfered via the Queue? Do I need to provide some __setstate__?
Can I rely on put returning instantly (like with threading Queues)?
General hints/tips?
Anything worthwhile to read apart from the Python documentation?

Answer to part 1:
Everything that has to pass through a multiprocessing.Queue (or Pipe or whatever) has to be picklable. This includes basic types such as tuples, lists and dicts. Also classes are supported if they are top-level and not too complicated (check the details). Trying to pass lambdas around will fail however.
Answer to part 2:
A put consists of two parts: It takes a semaphore to modify the queue and it optionally starts a feeder thread. So if no other Process tries to put to the same Queue at the same time (for instance because there is only one Process writing to it), it should be fast. For me it turned out to be fast enough for all practical purposes.
Partial answer to part 3:
The plain multiprocessing.queue.Queue lacks a task_done method, so it cannot be used as a drop-in replacement directly. (A subclass provides the method.)
The old processing.queue.Queue lacks a qsize method and the newer multiprocessing version is inaccurate (just keep this in mind).
Since filedescriptors normally inherited on fork, care needs to be taken about closing them in the right processes.

Related

Does Python multiprocessing module need multi-core CPU?

Do I need a multi-core CPU to take advantage of the Python multiprocessing module?
Also, can someone tell me how it works under the hood?

multiprocessing asks the OS to launch one or more new processes, running the same version of Python and the same version of your script. It can also set up pipes or other ways of sharing data directly between them.
It usually works like magic; when you peek under the hood, sometimes it looks like sausage being made, but you can usually understand the sausage grinders. The multiprocessing docs do a great job explaining things further (they're long, but then there's a lot to explain). And if you need even more under-the-hood knowledge, the docs link to the source, which is pretty readable Python code. If you have a specific question after reading, come back to SO and ask a specific question.
Meanwhile, you can get some of the benefits of multiprocessing without multiple cores.
The main benefit—the reason the module was designed—is parallelism for speed. And obviously, without 4 cores, you aren't going to cut your time down to 25%. But sometimes, you actually can get a bit of speedup even with a single core, especially if that core has "hyperthreading" or similar technologies. I've seen times come down to 80%, or even 60%. More commonly, they'll go up to 108% instead (because you did get a small benefit from hyperthreading, but the overhead cost was higher than the gain). But try it with your code and see.
Meanwhile, you get all of the side benefits:
Concurrency: You can run multiple tasks at once without them blocking each other. Of course threads, asyncio, and other techniques can do this too.
Isolation: You can run multiple tasks at once without the risk of one of them changing data that another one wasn't expecting to change.
Crash protection: If a child task segfaults, only that task is affected. (Well, you still have to be careful of any side-effects—if it crashed in the middle of writing a file that another tasks expects to be in a consistent shape, you're still in trouble.)
You can also use the multiprocessing module without multiple processes. Sometimes you just want the higher-level API of the module, but you want to use it with threads; multiprocessing.dummy does that. And you can switch back and forth in a couple lines of code to test it both ways. Or you can use the higher-level concurrent.futures.ProcessPoolExecutor wrapper, if its model fits what you want to do. Besides often being simpler, it lets you switch between threads and processes by just changing one word in one line.
Also, redesigning your program around multiprocessing takes you a step closer to further redesigning it as a distributed system that runs on multiple separate machines. It forces you to deal with questions like how your tasks communicate without being able to share everything, without forcing you to deal with further questions like how they communicate without reliable connections.

Are classmethods thread safe?

I am working on a class which operates in a multithreaded environment, and looks something like this (with excess noise removed):
class B:
#classmethod
def apply(cls, item):
cls.do_thing(item)
#classmethod
def do_thing(cls, item)
'do something to item'
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, self.apply)
My concern is that two threads might call apply or do_thing at the same time, or that a subclass might try to do something stupid with cls in one of these functions. I could use staticmethod instead of classmethod, but calling do_thing would become a lot more complicated, especially if a subclass reimplements one of these but not the other. So my question is this: Is the above class thread-safe, or is there a potential problem with using classmethods like that?

Whether a method is thread safe or not depends on what the method does.
Working with local variables only is thread safe. But when you change the same non local variable from different threads, it becomes unsafe.
‘do something to item’ seems to modify only the given object, which is independent from any other object in the list, so it should be thread safe.
If the same object is in the list several times, you may have to think about making the object thread safe. That can be done by using with self.object_scope_lock: in every method which modifies the object.
Anyway, what you are doing here is using processes instead of threads. In this case the objects are pickled and send through a pipe to the other process, where they are modified and send back. In contrast to threads processes do not share memory. So I don’t think using a lock in the class-method would have an effect.
http://docs.python.org/3/library/threading.html?highlight=threading#module-threading

There's no difference between classmethods and regular functions (and instance methods) in this regard. Neither is automagically thread-safe.
If one or more classmethods/methods/functions can manipulate data structures simultaneously from different threads, you'd need to add synchronization protection, typically using threading.Locks.

Both other answers are technically correct in that the safety of do_thing() depends on what happens inside the function.
But the more precise answer is that the call itself is safe. In other words if apply()and do_thing()are a pure functions, then your code is safe. Any unsafe-ness would be due to them not being pure functions (e.g. relying on or affecting a shared variable during execution)
As shx2 mentioned, classmethods are only "in" a class visually, for grouping. They have no inherent attachment to any instance of the class. Therefore this code is roughly equivalent in functioning:
def apply(item):
do_thing(item)
def do_thing(item)
'do something to item'
class B:
def run(self):
pool = multiprocessing.Pool()
for list_of_items in self.data_groups:
pool.map(list_of_items, apply)
A further note on concurrency given the other answers:
threading.Lock is easy to understand, but should be your last resort. In naive implementations it is often slower than completely linear processing. Your code will usually be faster if you can use things like threading.Event, queue.Queue, or multiprocessing.Pipe to transfer information instead.
asyncio is the new hotness in python3. It's a bit more difficult to get right but is generally the fastest method.
If you want a great walkthrough modern concurrency techniques in python check out core developer Raymond Hettinger's Keynote on Concurrency. The whole thing is great, but the downside of lockis highlighted starting at t=57:59.

Is Python cStringIO thread-safe?

As title say, does Python cStringIO protect their internal structures for multithreading use?
Thank you.

Take a look at an excellent work on explaining GIL, then note that cStringIO is written purely in C, and its calls don't release GIL.
It means that the running thread won't voluntarily switch during read()/write() (with current virtual machine implementation). (The OS will preempt the thread, however other Python threads won't be able to acquire GIL.)
Taking a look at the source: Python-2.7.1/Modules/cStringIO.c there is no mention about internals protection. When in doubt, look at source :)

I assume you are talking about the CPython implementation of Python.
In CPython there is a global interpreter lock which means that only a single thread of Python code can execute at a time. Code written in C will therefore also be effectively single threaded unless it explicitly releases the global lock.
What that means is that if you have multiple Python threads all using cStringIO simultaneously there won't be any problem as only a single call to a cStringIO method can be active at a time and cStringIO never releases the lock. However if you call it directly from C code which is running outside the locked environment you will have problems. Also if you do anything more complex than just reading or writing you will have issues, e.g. if you start using seek as your calls may overlap in unexpected ways.
Also note that some methods such as writelines can invoke Python code from inside the method so in that case you might get other output interleaved inside a single call to writelines.
That is true for most of the standard Python objects: you can safely use objects from multiple threads as the individual operations won't break, but the order in which things happen won't be defined.

It is as "thread-safe", as file operations can be (which means — not much). The Python implementation you're using has Global Interpreter Lock (GIL), which will guarantee that each individual file operation on cStringIO will not be interrupted by another thread. That does not however guarantee, that concurrent file operations from multiple threads won't be interleaved.

No it is not currently thread safe.

Python:When to use Threads vs. Multiprocessing

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?

Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)

For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

Static methods and thread safety

In python with all this idea of "Everything is an object" where is thread-safety?
I am developing django website with wsgi. Also it would work in linux, and as I know they use effective process management, so we could not think about thread-safety alot. I am not doubt in how module loads, and there functions are static or not? Every information would be helpfull.

Functions in a module are equivalent to static methods in a class. The issue of thread safety arises when multiple threads may be modifying shared data, or even one thread may be modifying such data while others are reading it; it's best avoided by making data be owned by ONE module (accessed via Queue.Queue from others), but when that's not feasible you have to resort to locking and other, more complex, synchronization primitives.
This applies whether the access to shared data happens in module functions, static methods, or instance methods -- and the shared data is such whether it's instance variables, class ones, or global ones (scoping and thread safety are essentially disjoint, except that function-local data is, to a pont, intrinsically thread-safe -- no other thread will ever see the data inside a function instance, until and unless the function deliberately "shares" it through shared containers).
If you use the multiprocessing module in Python's standard library, instead of the threading module, you may in fact not have to care about "thread safety" -- essentially because NO data is shared among processes... well, unless you go out of your way to change that, e.g. via mmapped files;-).

See the python documentation to better understand the general thread safety implications of Python.
Django itself seems to be thread safe as of 1.0.3, but your code may not and you will have to verify that...
My advice would be to simply don't care about that and serve your application with multiple processes instead of multiple threads (for example by using apache 'prefork' instead of 'worker' MPM).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.