Naive multiprocessing in Python with NumPy

Naive multiprocessing in Python with NumPy - python

Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.
Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.
I'm trying to find out if this can be improved upon, or if this is a good result. Questions :
I suppose the returned NumPy arrays got pickled?
Were the underlying NumPy buffers copied or just passed by reference?
How can I find out what the bottleneck is? Any particularly useful technique?
Can I improve on that or is such an improvement pretty common in such cases?

I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:
import numpy as np
foo = np.empty(1000000)
do this:
import sharedmem
foo = sharedmem.empty(1000000)
and off you go passing foo from one process to another, like:
q = multiprocessing.Queue()
...
q.put(foo)
Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.
Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.
How I did it
It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :
First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.
I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.
(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)
Profiling
Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :
Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
$ /usr/bin/time -v python test.py
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.
I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
$ sudo strace -c -p <python-process-id>

Related

How many processes should I create for the multi-threads CPU in the computational intensive scenario?

I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?

I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.

It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.

Python - Loop parallelisation with joblib

I would like some help understanding exactly what I have done/ why my code isn't running as I would expect.
I have started to use joblib to try and speed up my code by running a (large) loop in parallel.
I am using it like so:
from joblib import Parallel, delayed
def frame(indeces, image_pad, m):
XY_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1]:indeces[1]+m, indeces[2]])
XZ_Patches = np.float32(image_pad[indeces[0]:indeces[0]+m, indeces[1], indeces[2]:indeces[2]+m])
YZ_Patches = np.float32(image_pad[indeces[0], indeces[1]:indeces[1]+m, indeces[2]:indeces[2]+m])
return XY_Patches, XZ_Patches, YZ_Patches
def Patch_triplanar_para(image_path, patch_size):
Image, Label, indeces = Sampling(image_path)
n = (patch_size -1)/2
m = patch_size
image_pad = np.pad(Image, pad_width=n, mode='constant', constant_values = 0)
A = Parallel(n_jobs= 1)(delayed(frame)(i, image_pad, m) for i in indeces)
A = np.array(A)
Label = np.float32(Label.reshape(len(Label), 1))
R, T, Y = np.hsplit(A, 3)
return R, T, Y, Label
I have been experimenting with "n_jobs", expecting that increasing this will speed up my function. However as I increase n_jobs, things slow down quite significantly. When running this code without "Parallel", things are slower, until I increase the number of jobs from 1.
Why is this the case? I understood that the more jobs I run, the faster the script? am i using this wrong?
Thanks!

Maybe your problem is caused because image_pad is a large array. In your code, you are using the default multiprocessing backend of joblib. This backend creates a pool of workers, each of which is a Python process. The input data to the function is then copied n_jobs times and broadcasted to each worker in the pool, which can lead to a serious overhead. Quoting from joblib's docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
This can be problematic for large arguments as they will be reallocated n_jobs times by the workers.
As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.
Note: The following only applies with the default "multiprocessing" backend. If your code can release the GIL, then using backend="threading" is even more efficient.
So if this is your case, you should switch to the threading backend, if you are able to release the global interpreter lock when calling frame, or switch to the shared memory approach of joblib.
The docs say that joblib provides an automated memmap conversion that could be useful.

It's quite possible that the problem you are running up against is a fundamental one to the nature of the python compiler.
If you read "https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en", you can see from a professional who specialises in optimisation and parallelising python code that iterating through large loops is an inherently slow operation for a python thread to perform. Therefore, spawning more processes that loop through arrays is only going to slow things down.
However - there are things that can be done.
The Cython and Numba compilers are both designed to optimise code that is similar to C/C++ style (i.e. your case) - in particular Numba's new #vectorise decorators allow scalar functions to take in and apply operations on large arrays with large arrays in a parallel manner (target=Parallel).
I don't understand your code enough to give an example of an implementation, but try this! These compilers, used in the correct ways, have brought speed increases of 3000,000% to me for parallel processes in the past!

Is the multiprocessing module of python the right way to speed up large numeric calculations?

I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?

While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.

Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.

From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.

Parallel application in python becomes much slower when using mpi rather than multiprocessing module

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.

MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Running multiple instances of a python program efficiently & economically?

I wrote a program that calls a function with the following prototype:
def Process(n):
# the function uses data that is stored as binary files on the hard drive and
# -- based on the value of 'n' -- scans it using functions from numpy & cython.
# the function creates new binary files and saves the results of the scan in them.
#
# I optimized the running time of the function as much as I could using numpy &
# cython, and at present it takes about 4hrs to complete one function run on
# a typical winXP desktop (three years old machine, 2GB memory etc).
My goal is to run this function exactly 10,000 times (for 10,000 different values of 'n') in the fastest & most economical way. following these runs, I will have 10,000 different binary files with the results of all the individual scans. note that every function 'run' is independent (meaning, there is no dependency whatsoever between the individual runs).
So the question is this. having only one PC at home, it is obvious that it will take me around 4.5 years (10,000 runs x 4hrs per run = 40,000 hrs ~= 4.5 years) to complete all runs at home. yet, I would like to have all the runs completed within a week or two.
I know the solution would involve accessing many computing resources at once. what is the best (fastest / most affordable, as my budget is limited) way to do so? must I buy a strong server (how much would it cost?) or can I have this run online? in such a case, is my propritary code gets exposed, by doing so?
in case it helps, every instance of 'Process()' only needs about 500MB of memory. thanks.

Check out PiCloud: http://www.picloud.com/
import cloud
cloud.call(function)
Maybe it's an easy solution.

Does Process access the data on the binary files directly or do you cache it in memory? Reducing the usage of I/O operations should help.
Also, isn't it possible to break Process into separate functions running in parallel? How is the data dependency inside the function?
Finally, you could give some cloud computing service like Amazon EC2 a try (don't forget to read this for tools), but it won't be cheap (EC2 starts at $0.085 per hour) - an alternative would be going to an university with a computer cluster (they are pretty common nowadays, but it will be easier if you know someone there).

Well, from your description, it sounds like things are IO bound... In which case parallelism (at least on one IO device) isn't going to help much.
Edit: I just realized that you were referring more to full cloud computing, rather than running multiple processes on one machine... My advice below still holds, though.... PyTables is quite nice for out-of-core calculations!
You mentioned that you're using numpy's mmap to access the data. Therefore, your execution time is likely to depend heavily on how your data is structured on the disc.
Memmapping can actually be quite slow in any situation where the physical hardware has to spend most of its time seeking (e.g. reading a slice along a plane of constant Z in a C-ordered 3D array). One way of mitigating this is to change the way your data is ordered to reduce the number of seeks required to access the parts you are most likely to need.
Another option that may help is compressing the data. If your process is extremely IO bound, you can actually get significant speedups by compressing the data on disk (and sometimes even in memory) and decompressing it on-the-fly before doing your calculation.
The good news is that there's a very flexible, numpy-oriented library that's already been put together to help you with both of these. Have a look at pytables.
I would be very surprised if tables.Expr doesn't significantly (~ 1 order of magnitude) outperform your out-of-core calculation using a memmapped array. See here for a nice, (though canned) example. From that example:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.