Python - Loop parallelisation with joblib

Python - Loop parallelisation with joblib - python

I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!

Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.

It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!

Related

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Parallelizing python: multiprocessing vs cython

I want to parallelize an iteration, in which many instances of cython instances are evaluated and the results are stored in a global numpy array:
for cythonInstance in myCythonInstances:
success = cythonInstance.evaluate(someConstantGlobalVariables,) # very CPU intense
if success == False:
break
globalNumpyArray[instanceSpecificLocation] = cythonInstance.resultVector[:]
The results of the instance evaluations are independent of each other. There is no kind of interaction between the instances, except that the results are written to the same global array, but at fixed, pre-determined and independent locations. If one evaluation fails, the iteration must be stopped.
As far as i understood, 2 possibilities would be possible:
1) using the multiprocessing package
2) making a cython function and using prange/openmp.
I have no experience with parallelization at all. Which solution is preferable, or are there also better alternatives? Thank You!

Use Cython if you can:
The prange syntax is pretty similar to range. It lets you take the easy development route of write a Python loop -> convert it to Cython -> convert it to a parallel loop. Hopefully the changes needed each time are small. In contrast, multiprocessing requires you to get the inside of your loop as a function and then set up ppols, so it's less immediately familiar.
OpenMP/Cython threading is pretty low overhead. In contrast there the multiprocessing module is relatively high overhead ("processes" are generally slower than "threads").
Multiprocessing is quite restricted in Windows (everything has to be pickleable). This often turns out to be quite a hassle.
There's a few specific circumstances when you should uses multiprocessing:
You find you need to get the GIL a lot - multiprocessing doesn't share a GIL so isn't slowed down. If you only need to get the GIL occasionally though then small with gil: blocks in Cython often don't slow you down too much, so try this first.
You need to do a bunch of quite different operations at once (i.e. something that doesn't lend itself to a prange loop because each thread is genuinely running separate code). If this is the case the Cython prange syntax doesn't really fit.
The caveats from looking at your code are that you should avoid using Cython classes if you can. If you can refactor it into a call to a cdef function that would be better (Cython classes will still need the GIL at times). Something similar to the following would work well:
cdef int f(double[:] arr, double loop_specific_parameter, int constant_parameter) nogil:
# return your boolean to stop the iteration
# modify arr
return result
# then elsewhere
cdef int i
cdef double[:,:] output = np.zeros(shape)
for i in prange(len(parameters_to_try),nogil=True):
result = f(output[i,:],parameters_to_try[i],constant_parameter)
if result:
break
The reason I don't really recommend using Cython classes is that 1) you can't create them or index an list of them without the GIL (for reference counting reasons) and 2) Python objects including Cython classes don't seem to be allowed to be thread local. See Cython parallel prange - thread locality? for an example of the issues. (Originally I wasn't aware of the restriction on being theead local)
The with_gil overhead involved isn't necessarily huge, so if this design makes most sense then try it. Looking at your CPU usage will tell you how well it's parallelizing.
Nb. Most of the pros/cons in this set of answers still applies, even though you're using Cython rather than the Python threading module. The difference is that you can often avoid the GIL in Cython (so some of the disadvantages of using threads are less significant).

I would suggest using joblib with the threading backend. Joblib is a very good tool to paralellize for loops. Joblib
Threading is prefered over multiprocessing here, because mulitprocessing has a lot of overhead. This would be inapropriate when there are a lot of parallel calculations to be done.
The results are stored in a list however, which you then can convert back to a numpy array.
from joblib import Parallel, delayed
def sim(x):
return x**2
if __name__ == "__main__":
result = Parallel(n_jobs=-1, backend="threading", verbose=5) \
(delayed(sim)(x) for x in range(10))
print result
result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Naive multiprocessing in Python with NumPy

Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.
Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.
I'm trying to find out if this can be improved upon, or if this is a good result. Questions :
I suppose the returned NumPy arrays got pickled?
Were the underlying NumPy buffers copied or just passed by reference?
How can I find out what the bottleneck is? Any particularly useful technique?
Can I improve on that or is such an improvement pretty common in such cases?

I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:
import numpy as np
foo = np.empty(1000000)
do this:
import sharedmem
foo = sharedmem.empty(1000000)
and off you go passing foo from one process to another, like:
q = multiprocessing.Queue()
...
q.put(foo)
Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.
Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.
How I did it
It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :
First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.
I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.
(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)
Profiling
Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :
Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
$ /usr/bin/time -v python test.py
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.
I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
$ sudo strace -c -p <python-process-id>

Will multi-threading necessarily decrease runtime?

If I have a series of CPU-intensive operations, will multi-threading my program necessarily decrease its runtime? What are the trade-offs of doing so? In this case, I'm trying to compute the nullspace of a very large matrix. I'm using Python and, specifically, the numpy package:
def nullspace(A, eps=1e-15):
"""Computes the null space of the real matrix A."""
n, m = shape(A)
if n > m :
return nullspace(transpose(A), eps)
_, s, vh = linalg.svd(A)
s = append(s, zeros(m))[0:m]
null_mask = (s <= eps)
null_space = compress(null_mask, vh, axis=0)
return null_space.tolist()
Also, I would be interested to know just how one would go about multi-threading such a function. Thanks in advance.

Python has the Global Interpreter Lock (GIL), which only allows one thread to interact with the interpreter at a time -- effectively, this means that you can only run one thread of Python at a time. This is a severe disadvantage when trying to run multiple threads.
However, numpy is built on top of a heavily-optimised library for numerical linear algebra called LAPACK. If you install the right version of LAPACK for your system, it will parallelise its computations for you. You can then install numpy on top of your LAPACK, and the Python computations will be parallelised.
This also means that many numpy operations release the GIL, so that you can fire off a long numpy computation in a Python thread and simultaneously execute other Python. Thanks #JFSebastian.

No. For one thing, CPU-bound programs rarely benefit at all from threading in Python because of the Global Interpreter Lock.
Also, on a single-core machine, threading won't reduce runtime at all.

Usually GIL is an impediment for getting benefits of multithreading except for cases when your calculations are being made out of your python interpreter (for example C implementations). I'm not sure if this relates to numpy.
If you're running not so many threads you should have a look at multiprocessing module. You'll have a separate system process instead of a python thread.

Naive and easiest way to decompose independent loop into parallel threads/processes

I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)

From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.

Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.