Naive and easiest way to decompose independent loop into parallel threads/processes

Naive and easiest way to decompose independent loop into parallel threads/processes - python

I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)

From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.

Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.

Related

How many processes should I create for the multi-threads CPU in the computational intensive scenario?

I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?

I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.

It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Is the multiprocessing module of python the right way to speed up large numeric calculations?

I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?

While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.

Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.

From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

what is the neat way to divide huge nested loops to 8(or more) processes using Python?

this time i'm facing a "design" problem. Using Python, I have a implement a mathematical algorithm which uses 5 parameters. To find the best combination of these 5 parameters, i used 5-layer nested loop to enumerate all possible combinations in a given range. The time it takes to finish appeared to be beyond my expectation. So I think it's the time to use multithreading...
The task in the core of nested loops are calculation and saving. In current code, result from every calculation is appended to a list and the list will be written to a file at the end of program.
since I don't have too much experience of multithreading in any language, not to mention Python, I would like to ask for some hints on what should the structure be for this problem. Namely, how should the calculations be assigned to the threads dynamically and how should the threads save results and later combine all results into one file. I hope the number of threads can be adjustable.
Any illustration with code will be very helpful.
thank you very much for your time, I appreciate it.
#
update of 2nd Day:
thanks for all helpful answers, now I know that it is multiprocessing instead of multithreading. I always confuse with these two concepts because I think if it is multithreaded then the OS will automatically use multiple processor to run it when available.
I will find time to have some hands-on with multiprocessing tonight.

You can try using jug, a library I wrote for very similar problems. Your code would then look something like
from jug import TaskGenerator
evaluate = TaskGenerator(evaluate)
for p0 in [1,2,3]:
for p1 in xrange(10):
for p2 in xrange(10,20):
for p3 in [True, False]:
for p4 in xrange(100):
results.append(evaluate(p0,p1,p2,p3,p4))
Now you could run as many processes as you'd like (even across a network if you have access to a computer cluster).

Multithreading in Python won't win you anything in this kind of problem, since Python doesn't execute threads in parallel (it uses them for I/O concurrency, mostly).
You want multiprocessing instead, or a friendly wrapper for it such as joblib:
from joblib import Parallel, delayed
# -1 == use all available processors
results = Parallel(n_jobs=-1)(delayed(evaluate)(x) for x in enum_combinations())
print best_of(results)
Where enum_combinations would enumerate all combinations of your five parameters; you can likely implement it by putting a yield at the bottom of your nested loop.
joblib distributes the combinations over multiple worker processes, taking care of some load balancing.

Assuming this is a calculation-heavy problem (and thus CPU-bound), multi-threading won't help you much in Python due to the GIL.
What you can, however, do is split the calculation across multiple processes to take advantage of extra CPU cores. The easiest way to do this is with the multiprocessing library.
There are a number of examples for how to use multiprocessing on the docs page for it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.