Parallelizing python: multiprocessing vs cython

Parallelizing python: multiprocessing vs cython - python

I want to parallelize an iteration, in which many instances of cython instances are evaluated and the results are stored in a global numpy array:
for cythonInstance in myCythonInstances:
success = cythonInstance.evaluate(someConstantGlobalVariables,) # very CPU intense
if success == False:
break
globalNumpyArray[instanceSpecificLocation] = cythonInstance.resultVector[:]
The results of the instance evaluations are independent of each other. There is no kind of interaction between the instances, except that the results are written to the same global array, but at fixed, pre-determined and independent locations. If one evaluation fails, the iteration must be stopped.
As far as i understood, 2 possibilities would be possible:
1) using the multiprocessing package
2) making a cython function and using prange/openmp.
I have no experience with parallelization at all. Which solution is preferable, or are there also better alternatives? Thank You!

Use Cython if you can:
The prange syntax is pretty similar to range. It lets you take the easy development route of write a Python loop -> convert it to Cython -> convert it to a parallel loop. Hopefully the changes needed each time are small. In contrast, multiprocessing requires you to get the inside of your loop as a function and then set up ppols, so it's less immediately familiar.
OpenMP/Cython threading is pretty low overhead. In contrast there the multiprocessing module is relatively high overhead ("processes" are generally slower than "threads").
Multiprocessing is quite restricted in Windows (everything has to be pickleable). This often turns out to be quite a hassle.
There's a few specific circumstances when you should uses multiprocessing:
You find you need to get the GIL a lot - multiprocessing doesn't share a GIL so isn't slowed down. If you only need to get the GIL occasionally though then small with gil: blocks in Cython often don't slow you down too much, so try this first.
You need to do a bunch of quite different operations at once (i.e. something that doesn't lend itself to a prange loop because each thread is genuinely running separate code). If this is the case the Cython prange syntax doesn't really fit.
The caveats from looking at your code are that you should avoid using Cython classes if you can. If you can refactor it into a call to a cdef function that would be better (Cython classes will still need the GIL at times). Something similar to the following would work well:
cdef int f(double[:] arr, double loop_specific_parameter, int constant_parameter) nogil:
# return your boolean to stop the iteration
# modify arr
return result
# then elsewhere
cdef int i
cdef double[:,:] output = np.zeros(shape)
for i in prange(len(parameters_to_try),nogil=True):
result = f(output[i,:],parameters_to_try[i],constant_parameter)
if result:
break
The reason I don't really recommend using Cython classes is that 1) you can't create them or index an list of them without the GIL (for reference counting reasons) and 2) Python objects including Cython classes don't seem to be allowed to be thread local. See Cython parallel prange - thread locality? for an example of the issues. (Originally I wasn't aware of the restriction on being theead local)
The with_gil overhead involved isn't necessarily huge, so if this design makes most sense then try it. Looking at your CPU usage will tell you how well it's parallelizing.
Nb. Most of the pros/cons in this set of answers still applies, even though you're using Cython rather than the Python threading module. The difference is that you can often avoid the GIL in Cython (so some of the disadvantages of using threads are less significant).

I would suggest using joblib with the threading backend. Joblib is a very good tool to paralellize for loops. Joblib
Threading is prefered over multiprocessing here, because mulitprocessing has a lot of overhead. This would be inapropriate when there are a lot of parallel calculations to be done.
The results are stored in a list however, which you then can convert back to a numpy array.
from joblib import Parallel, delayed
def sim(x):
return x**2
if __name__ == "__main__":
result = Parallel(n_jobs=-1, backend="threading", verbose=5) \
(delayed(sim)(x) for x in range(10))
print result
result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Related

Modify data in shared memory when using Python's ray module

I am currently trying to parallelize some parts of a Python code using the ray module. Unfortunately, ray does not allow to modify the data in the shared memory by default (at least according to my understanding). This means I would need to perform a numpy.copy() first, which sounds very inefficient to me.
This is a probably very inefficient example:
import numpy as np
import ray
#ray.remote
def mod_arr( arr ):
arr_cp = np.copy(arr)
arr_cp += np.ones(arr_cp.shape)
return arr_cp
ray.init()
arr = np.zeros( (2,3,4) )
arr = ray.get(mod_arr.remote(arr))
If I omit the np.copy() in the function mod_arr() and try to modify arr instead, I get the following error
ValueError: output array is read-only
Am I using ray completely wrong, or is it not the correct tool for my purpose?

Because of Python's GIL, multiple threads cannot run in parallel on Python. Therefore all true parallelism is achieved either outside of Python when a module releases GIL, or by using multiprocessing.
In multiprocessing, this memory copy is a normal process. Not only there, but actually in pure functional programming, where arguments to functions are immutable, the solution is to always copy memory when you have to. It has a lot of advantages in stability, while paying an acceptable performance penalty.
Basically, treat these functions as pure functions.

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Python - Loop parallelisation with joblib

I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!

Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.

It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!

Python 3: Parallel diagonalization of multiple matrices

I am trying to improve the performance of some code of mine, that first constructs a 4x4 matrix depending on two indices, diagonalizes this matrix and then stores the eigenvectors of each diagonalization of each matrix in an 4-dimensional array. At the moment I am just going through all the indices serially and then store the eigenvectors in its place in the 4-dimensional array. Now, I am wondering if it is possible to parallelize this a little bit by using threading or something similar such that each thread would diagonalize one matrix and then store it in its place. The problem I have is, what are my limitations in doing this? Would I run into problems when different threads want to write into the resulting 4-dim. array at the same time and do I have to use a lock in order to prevent this? I am sorry if this question is trivial, but by searching I was not able to find anything related and my knowledge about threading is very limited. A minimal example would be
from numpy.linalg import eigh as eigh2
from scipy import *
spectrum = zeros([L//2,L//2,4,4],complex)
for i in range(0,L//2):
for j in range(0,L//2):
k = [-(2 * i*2*pi/L),-(2 * j*2*pi/L)]
H = ones([4,4],complex)
energies, states = eigh2(H)
spectrum[i,j,:,:] = states
Note that I have exchanged the function that constructs the matrix in dependence of k for some constant matrix for sake of brevity.
I would really appreciate any help or pointers to resources how I could implement some parallelizations. Is threading a realistic way of improving the performance?

The short answer is that yes, you probably need locks—but if you can reorganize your problem, that may be a lot better than locking.
The long answer is a bit involved, especially since I don't know how much you already know.
In general, threading doesn't do much good in CPython for CPU-bound code, because of the Global Interpreter Lock, which prevents any threads from interpreting a line (actually, bytecode) of Python if another thread is in the middle of doing so. However, NumPy has code that specifically releases the GIL in certain places to allow threading to work better, so if you're CPU-bound within low-level NumPy algorithms, threading actually can work. The docs are not always clear about which functions do this and which don't, so you may have to test it yourself just to find out if parallelizing will help here. (A quick&dirty way to do this is to hack up a version of your code that just does the computations without storing them anywhere, run it across N threads, and see how many cores are busy while you do it.)
Now, in general, in CPython, locks aren't necessary around certain kinds of operations, including __setitem__ on simple types—but that's because of that same GIL, so it isn't going to help you here. If you have multiple operations all trying to write to the same array, they will need a lock around that array.
But there may be a better way around this. If you can find a way to divide the array into smaller arrays, only one of which is being modified at any given time, you don't need any locks. Or, if you can have the threads return smaller arrays that can be assembled by a single master thread into the final answer, instead of working in-place in the first place, that also works.
But before you go doing that… in some cases, NumPy (or, rather, one of the libraries it's using) is already auto-parallelizing things for you, or could be if you built it differently. Or it could be SIMD-vectorizing things in a way that actually gives more speedup than threading, which you could end up breaking. And so on.
So, make sure you have a properly-optimized NumPy with all the optional prereqs installed before you try anything. Then make sure it's only using one core as-is. Then build a test scaffolding so you can compare different implementations. And then you can try out each lock-based, non-sharing, and non-mutating algorithm you can come up with to see if the parallelism helps more than the extra stuff hurts.

Naive and easiest way to decompose independent loop into parallel threads/processes

I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)

From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.

Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.