It may be a really stupid question, but I didn't find any doc which perfectly answers that question. I'm trying to familiarise with the multiprocessing library on python try to paraglide task using multiprocessing.Pool.
I initiate the number of processes in my Pool with:
Pool(processes=nmbr_of_processes).
The thing is I don't understand exactly how this number of process reduce the work duration time. I wrote a script to evaluate it.
def test_operation(y):
sum = 0
for x in range(1000):
sum += y*x
def main():
time1 = time.time()
p = mp.Pool(processes=2)
result = p.map(test_operation, range(100000))
p.close()
p.join()
print('Parallel tooks {} seconds'.format(time.time() - time1))
final = list()
time2 = time.time()
for y in range(100000):
final.append(test_operation(y))
print('Serial tooks {} seconds'.format(time.time() - time2))
The thing is, when I'm using 2 processes with mp.Pool(processes=2) I get typically:
Parallel took 5.162384271621704 seconds
Serial took 9.853888034820557 seconds
And if I'm using more processes, like p = mp.Pool(processes=4)
I get:
Parallel took 6.404058218002319 seconds
Serial took 9.667300701141357 seconds
I'm working on a MacMini DualCore i7 3Ghz. I know I can't reduce the work duration time to half the time it took with a serial work. But I can't understand why adding more processes increase work duration time compared to a work with 2 processes. And if there is an optimal number of processes to start depending of the cpu, what would it be ?
The thing to note here is that this applies to CPU-bound tasks; your code is heavy on CPU usage. The first thing to do is check how many theoretical cores you have:
import multiprocessing as mp
print(mp.cpu_count())
For CPU-bound tasks like this, there is no benefit to be gained by creating a pool with more workers than theoretical cores. If you don't specify the size of the Pool, it will default back to this number. However, this neglects something else; your code is not the only thing that your OS has to run.
If you launch as many processes as theoretical cores, the system has no choice but to interrupt your processes periodically simply to keep running, so you're likely to get a performance hit. You can't monopolise all cores. The general rule-of-thumb here is to have a pool size of cpu_count() - 1, which leaves the OS a core free to use on other processes.
I was surprised to find that other answers I found don't mention this general rule; it seems to be confined to comments etc. However, your own tests show that it is applicable to the performance in your case so is a reasonable heuristic to determine pool size.
Related
I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()
due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.
from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?
Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.
There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.
As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.
from multiprocessing import Pool
def op1(data):
return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]
import time
start_time = time.time()
re = []
for data_ in data:
re.append(op1(data_))
print('--- %s seconds ---' % (time.time() - start_time))
start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)
print('--- %s seconds ---' % (time.time() - start_time))
I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?
Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.
In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:
the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.
Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.
So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.
That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.
In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.
There are a couple of potential trouble spots with your code, but primarily it's too simple.
The multiprocessing module works by creating different processes, and communicating among them. For each process created, you have to pay the operating system's process startup cost, as well as the python startup cost. Those costs can be high, or low, but they're non-zero in any case.
Once you pay those startup costs, you then pool.map the worker function across all the processes. Which basically adds 1 to a few numbers. This is not a significant load, as your tests prove.
What's worse, you're using .map() which is implicitly ordered (compare with .imap_unordered()), so there's synchronization going on - leaving even less freedom for the various CPU cores to give you speed.
If there's a problem here, it's a "design of experiment" problem - you haven't created a sufficiently difficult problem for multiprocessing to be able to help you.
As others have noted, the overhead that you pay to facilitate multiprocessing is more than the time-savings gained by parallelizing across multiple cores. In other words, your function op1() does not require enough CPU resources to see performance gain from parallelizing.
In the multiprocessing.Pool class, the majority of this overheard is spent serializing and deserializing data before the data is shuttled between the parent process (which creates the Pool) and the children "worker" processes.
This blog post explores, in greater detail, how expensive pickling (serializing) can be when using the multiprocessing.Pool module.
I have been playing around with multiprocessing problem and notice my algorithm is slower when I parallelizes it than when it is single thread.
In my code I don't share memory.
And I'm pretty sure my algorithm (see code), which is just nested loops is CPU bound.
However, no matter what I do. The parallel code runs 10-20% slower on all my computers.
I also ran this on a 20 CPUs virtual machine and single thread beats multithread every times (even slower up there than my computer, actually).
from multiprocessing.dummy import Pool as ThreadPool
from multi import chunks
from random import random
import logging
import time
from multi import chunks
## Product two set of stuff we can iterate over
S = []
for x in range(100000):
S.append({'value': x*random()})
H =[]
for x in range(255):
H.append({'value': x*random()})
# the function for each thread
# just nested iteration
def doStuff(HH):
R =[]
for k in HH['S']:
for h in HH['H']:
R.append(k['value'] * h['value'])
return R
# we will split the work
# between the worker thread and give it
# 5 item each to iterate over the big list
HChunks = chunks(H, 5)
XChunks = []
# turn them into dictionary, so i can pass in both
# S and H list
# Note: I do this because I'm not sure if I use the global
# S, will it spend too much time on cache synchronizatio or not
# the idea is that I dont want each thread to share anything.
for x in HChunks:
XChunks.append({'H': x, 'S': S})
print("Process")
t0 = time.time()
pool = ThreadPool(4)
R = pool.map(doStuff, XChunks)
pool.close()
pool.join()
t1 = time.time()
# measured time for 4 threads is slower
# than when i have this code just do
# doStuff(..) in non-parallel way
# Why!?
total = t1-t0
print("Took", total, "secs")
There are many related question opened, but many are geared toward code being structured incorrectly - each worker being IO bound and such.
You are using multithreading, not multiprocessing. While many languages allow threads to run in parallel, python does not. A thread is just a separate state of control, i.e. it holds it own stack, current function, etc. The python interpreter just switches between executing each stack every now and then.
Basically, all threads are running on a single core. They will only speed up your program when you are not CPU bound.
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Multithreading is usually slower than single threading if you are CPU bound. This is because the work and processing resources stay the same, but you add overhead for managing the threads, e.g. switching between them.
How to fix this: instead of using from multiprocessing.dummy import Pool as ThreadPool do multiprocessing.Pool as ThreadPool.
You might want to read up on the GIL, the Global Interpreter Lock. It's what prevents threads from running in parallel (that and implications on single threaded performance). Python interpreters other than CPython may not have the GIL and be able to run multithreaded on several cores.
I used multiprocessing in Python to run my code in parallel, like the following,
result1 = pool.apply_async(set1, (Q, n))
result2 = pool.apply_async(set2, (Q, n))
set1 and set2 are two independent function and this code is in a while loop.
Then I test the running time, if I run my code in sequence, the for particular parameter, it is 10 seconds, however, when I run in parallel, it only took around 0.2 seconds. I used time.clock() to record the time. Why the running time decreased so much, for intuitive thinking of parallel programming, shouldn't be the time in parallel be between 5 seconds to 10 seconds? I have no idea how to analyze this in my report... Anyone can help? Thanks
To get a definitive answer, you need to show all the code and say which operating system you're using.
My guess: you're running on a Linux-y system, so that time.clock() returns CPU time (not wall-clock time). Then you run all the real work in new, distinct processes. The CPU time consumed by those doesn't show up in the main program's time.clock() results at all. Try using time.time() instead for a quick sanity check.