Im trying to understand how to do distributed processing with ipyparallel and jupyter notebook, so i did some test and got odd results.
from ipyparallel import Client
%px import numpy as np
rc = Client()
dview = rc[:]
bview = rc.load_balanced_view()
print(len(dview))
print(len(bview))
data = [np.random.rand(10000)] * 4
%time np.sin(data)
%%time #45.7ms
results = dview.map(np.sin, data)
results.get()
%%time #110ms
dview.push({'data': data})
%px results = np.sin(data)
results
%%time #4.9ms
results = np.sin(data)
results
%%time #93ms
results = bview.map(np.sin, data)
results.get()
What is the matter with the overhead?
Is the task i/o bound in this case and just 1 core can do it better?
I tried larger arrays and still got better times with no parallel processing.
Thanks for the advice!
The problem seems to be the io. Push pushes the whole set of data to every node. I am not sure about the map function, but most likely it splits the data in chunks that are sent to nodes. So smaller chunks - faster processing. Load balancer most likely sends the data and the task two time to the same node, which significantly hits performance.
And how did you manage to send the data in 40 ms? I am used to http protocol where only the handshake takes about a second. For me 40 ms in the network is lightning fast.
EDIT About long times (40ms):
In local networks the ping time of 1-10ms is considered a normal situation. Taking into account that you first need to make a handshake (minimum 2 signals) and only then send the data (minimum 1 signal) and wait for the response (another signal) you already talk about 20ms just for connecting two computers. Of course you can try to minimize the ping time to 1ms and then use a faster MPI protocol. But as I understand it does not improve the situation significantly. Only one order of magnitude faster.
Therefore the general recommendations are to use larger jobs. For example, a pretty fast dask distributed framework (faster than Celery based on benchmarks) recommends tasks times to be more than 100ms. Otherwise the overhead of the framework starts overweighting the time of the execution and the parallelization benefits are disappearing. Efficiency on Dask Distributed
Related
I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.
My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.
I was able to reproduce the issue on a simple, and shareable piece of code:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
def f(x):
'''Junk to just waste computing power!'''
arr = np.ones(shape=(10000,10000))*x
val1 = arr*arr
val2 = np.sqrt(val1)
val3 = np.roll(arr,-1) - np.roll(arr,1)
plt.subplot(121)
plt.imshow(arr)
plt.subplot(122)
plt.imshow(val3)
plt.savefig('f.png')
plt.close()
print('Done')
N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
pool.apply_async(f,[i])
pool.close()
pool.join()
print('Time taken = ', time.time()-t0)
The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:
N=1 || Time taken = 7.964005947113037
N=7 || Time taken = 40.3266499042511.
Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.
[Question 1] Is this all a case of overhead (whatever it entails!)?
[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?
For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.
Thank you in advance for your help and input!
I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.
An easy way to test this is just to have something else do the parallelism and see if anything is improved.
With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.
$ for i in $(seq 8) ; do python paralleltest & done
...
Time taken = 32.07925891876221
Done
Time taken = 33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken = 34.14084982872009
Time taken = 34.21410894393921
Time taken = 34.44455814361572
Time taken = 34.49029612541199
Time taken = 34.502259969711304
Time taken = 34.516881227493286
I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.
When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.
SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.
This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.
I find that multiprocessing.Pool() doesn't behave as expected in my case below. Could anyone explain why it behaved in the way and how to improve the performance if possible. Following is just simplistic code:
import numpy as np
import multiprocessing
from itertools import repeat
def group_data_by_runID(args):
data, runID = args
return data[data[:,0].astype(int)==runID,:]
%%time
DATA = np.array([[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[1,8],[2,9],[2,10],[2,11],[2,12]])
runIDs = [0,1,2]*10000000
pool = multiprocessing.Pool(40)
list(pool.map(group_data_by_runID, zip(repeat(DATA), runIDs)))
As you can see in the above code that I intended to use 40 cores (56 cores and far more than enough memory available on this system) to run the code, it took 1min 31s. Then I used:
list(map(group_data_by_runID, zip(repeat(DATA), runIDs)))
It took 2min 33s. So the performance of using 40 cores only again less than twice performance, which is very weird to me. I also notice that even I 40 cores, it sometimes doesn't actually launch it in 40 cores as it can be seen in htop.
Where I did wrong? And how can I improve the speed. Please note that the actual data is much larger.
Maybe there are still many people like me who are confused by the performance of multiprocessing in python. Sometime you may achieve a performance gain and sometime you may even get worse performance. Thus I decided to answer this question by myself according to my own experience with multiprocessing.
There may be a overhead by using multiprocessing if your input data is large because these data would be copied and sent across the wire to different processes as juanpa commented above. This overhead could be very significant. However, we can still get a huge performance gain by chopping the input data into small chunks and let each process handle each chunk.
Another scenario where a significant performance gain can be achieved is there is no input data. Such as reading data from tens or hundreds of files.
Although multiprocessing can boost the speed, the majority of energy shall still be spent on the algorithm itself, which may fundamentally determine the efficiency of the code.
I have implemented a map function that parses strings into an XML tree, traverses this tree and extracts some information. A lot of if-then-else stuff, no additional IO code.
The speedup we got from Dask was not very satisfying, so we took a closer look at the raw execution performance on a single but large item (580 MB of XML string) in a single partition.
Here's the code:
def timedMap(x):
start = time.time()
# computation, no IO or access to Dask variables, no threading or multiprocessing
...
return time.time() - start
print("Direct Execution")
print(timedMap(xml_string))
print("Dask Distributed")
import dask
from dask.distributed import Client
client = Client(threads_per_worker=1, n_workers=1)
print(client.submit(timedMap, xml_string).result())
client.close()
print("Dask Multiprocessing")
import dask.bag as db
bag = db.from_sequence([xml_string], npartitions=1)
print(bag.map(timedMap).compute()[0])
The output (that's the time without before/after overhead) is:
Direct Execution
54.59478211402893
Dask Distributed
91.79525017738342
Dask Multiprocessing
86.94628381729126
I've repeated this many times. It seems that Dask has not only an overhead for communication and task management, but the individual computation steps are also significantly slower as well.
Why is the computation inside Dask so much slower? I suspected the profiler and increased the profiling interval from 10 to 1000ms, which knocked of 5 seconds. But still... Then I suspected memory pressure, but the worker doesn't reach it's limit, not even the 50%.
In terms of overhead (measuring total times submit+result and map+compute), Dask adds 18 seconds for the distributed case and 3 seconds for the multiprocessing case. I'm happy to pay this overhead, but I don't like that the actual computation takes so much longer. Do you have a clue why this would be the case? What can I tweak to speed this up?
Note: when I double the compute load, all durations roughly double as well.
All the best,
Boris
You shouldn't expect any speedup because you have only a single partition.
You should expect some slowdown because you're moving 500MB across processes in Python. I recommend that you read the following: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask
I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?
While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.
Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.
From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.
Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.