Python Multiprocessing - python

I used multiprocessing in Python to run my code in parallel, like the following,
result1 = pool.apply_async(set1, (Q, n))
result2 = pool.apply_async(set2, (Q, n))
set1 and set2 are two independent function and this code is in a while loop.
Then I test the running time, if I run my code in sequence, the for particular parameter, it is 10 seconds, however, when I run in parallel, it only took around 0.2 seconds. I used time.clock() to record the time. Why the running time decreased so much, for intuitive thinking of parallel programming, shouldn't be the time in parallel be between 5 seconds to 10 seconds? I have no idea how to analyze this in my report... Anyone can help? Thanks

To get a definitive answer, you need to show all the code and say which operating system you're using.
My guess: you're running on a Linux-y system, so that time.clock() returns CPU time (not wall-clock time). Then you run all the real work in new, distinct processes. The CPU time consumed by those doesn't show up in the main program's time.clock() results at all. Try using time.time() instead for a quick sanity check.

Related

Difference between a "worker" and a "task" for concurrent.futures.ProcessPoolExecutor

I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()
due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.

multiprocessing - Influence of number on process on processing time

It may be a really stupid question, but I didn't find any doc which perfectly answers that question. I'm trying to familiarise with the multiprocessing library on python try to paraglide task using multiprocessing.Pool.
I initiate the number of processes in my Pool with:
Pool(processes=nmbr_of_processes).
The thing is I don't understand exactly how this number of process reduce the work duration time. I wrote a script to evaluate it.
def test_operation(y):
sum = 0
for x in range(1000):
sum += y*x
def main():
time1 = time.time()
p = mp.Pool(processes=2)
result = p.map(test_operation, range(100000))
p.close()
p.join()
print('Parallel tooks {} seconds'.format(time.time() - time1))
final = list()
time2 = time.time()
for y in range(100000):
final.append(test_operation(y))
print('Serial tooks {} seconds'.format(time.time() - time2))
The thing is, when I'm using 2 processes with mp.Pool(processes=2) I get typically:
Parallel took 5.162384271621704 seconds
Serial took 9.853888034820557 seconds
And if I'm using more processes, like p = mp.Pool(processes=4)
I get:
Parallel took 6.404058218002319 seconds
Serial took 9.667300701141357 seconds
I'm working on a MacMini DualCore i7 3Ghz. I know I can't reduce the work duration time to half the time it took with a serial work. But I can't understand why adding more processes increase work duration time compared to a work with 2 processes. And if there is an optimal number of processes to start depending of the cpu, what would it be ?
The thing to note here is that this applies to CPU-bound tasks; your code is heavy on CPU usage. The first thing to do is check how many theoretical cores you have:
import multiprocessing as mp
print(mp.cpu_count())
For CPU-bound tasks like this, there is no benefit to be gained by creating a pool with more workers than theoretical cores. If you don't specify the size of the Pool, it will default back to this number. However, this neglects something else; your code is not the only thing that your OS has to run.
If you launch as many processes as theoretical cores, the system has no choice but to interrupt your processes periodically simply to keep running, so you're likely to get a performance hit. You can't monopolise all cores. The general rule-of-thumb here is to have a pool size of cpu_count() - 1, which leaves the OS a core free to use on other processes.
I was surprised to find that other answers I found don't mention this general rule; it seems to be confined to comments etc. However, your own tests show that it is applicable to the performance in your case so is a reasonable heuristic to determine pool size.

python multiprocessing, cpu-s and cpu cores

I was trying out python3 multiprocessing on a machine that has 8 cpu-s and each cpu has four cores (information is from /proc/cpuinfo). I wrote a little script with a useless function and I use time to see how long it takes for it to finish.
from multiprocessing import Pool,cpu_count
def f(x):
for i in range(100000000):
x*x
return x*x
with Pool(8) as p:
a = p.map(f,[x for x in range(8)])
#~ f(5)
Calling f() without multiprocessing takes about 7s (time's "real" output). Calling f() 8 times with a pool of 8 as seen above, takes around 7s again. If I call it 8 times with a pool of 4 I get around 13.5s, so there's some overhead in starting the script, but it runs twice as long. So far so good. Now here comes the part that I do not understand. If there are 8 cpu-s each with 4 cores, if I call it 32 times with a pool of 32, as far as I see it should run for around 7s again, but it takes 32s which is actually slightly longer than running f() 32 times on a pool of 8.
So my question is multiprocessing not able to make use of cores or I don't understand something about cores or is it something else?
Simplified and short.. Cpu-s and cores are hardware that your computer have. On this hardware there is a operating system, the middleman between hardware and the programs running on the computer. The programs running on the computer are allotted cpu time. One of these programs is the python interpetar, which runs all the programs that has the endswith .py. So of the cpu time on your computer, time is allotted to python3.* which in turn allot time to the program you are running. This speed will depend on what hardware you have, what operation you are running, and how cpu-time is allotted between all these instances.
How is cpu-time allotted? It is an like an while loop, the OS is distributing incremental between programs, and the python interpreter is incremental distributing it's distribution time to programs runned by the python interpretor. This is the reason the entire computer halts when a program misbehaves.
Many processes does not equal more access to hardware. It does equal more allotted cpu-time from the python interpretor allotted time. Since you increase the number of programs under the python interpretor which do work for your application.
Many processes does equal more work horses.
You see this in practice in your code. You increase the number of workhorses to the point where the python interpreters allotted cpu-time is divided up between so many processes that all of them slows down.

Why single python process's cpu usage can be more than 100 percent?

Because of GIL, I thought a multi-thread python process can only have one thread running at one time, thus the cpu usage can not be more than 100 percent.
But I found the code bellow can occupy 950% cpu usage in top.
import threading
import time
def f():
while 1:
pass
for i in range(10):
t = threading.Thread(target=f)
t.setDaemon(True)
t.start()
time.sleep(60)
This is not a same question as Python interpreters uses up to 130% of my CPU. How is that possible?. In that question, the OP said he was doing I/O intensive load-testing which may release the GIL. But in my program, there is no I/O operation.
Tests run on CPython 2.6.6.
I think in this case the CPU is busy doing thread switch instead of doing some actual work. In another word, the thread switch is using all CPUs to do its job, but the python loop code runs too fast to cause observable CPU usage. I tried adding some real calculations as below, the CPU usage dropped to around 200%. And if you add more calculations, I believe the CPU usage will be very close to 100%.
def f():
x=1
while 1:
y=x*2
One reason could be the method you're using to get to 950%. There's a number called (avg) load which is perhaps not what one would expect before reading the documentation.
The load is the (average) number of threads that's either in running or runnable state (in queue for CPU time). If you like in your example have ten busy looping threads while one thread is running the other nine is in runnable state (in queue for a time slot).
The load is an indication on how many cores that you could have made use of. Or how much CPU power your program wants to use (not necessarily the actual CPU power it gets to use).

multiprocessing do not work

I am working on Ubuntu 12 with 8 CPU3 as reported by the System monitor.
the testing code is
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool=mp.Pool(processes=4)
pool.map(square,range(100000000))
pool.close()
# for i in range(100000000):
# square(i)
The problem is:
1) All workload seems to be scheduled to just one core, which gets close to 100% utilization, despite the fact that several processes are started. Occasionally all workload migrates to another core but the workload is never distributed among them.
2) without multiprocessing is faster
for i in range(100000000):
square(i)
I have read the similar questions on stackoverflow like:
Python multiprocessing utilizes only one core
still got no applied result.
The function you are using is way too short (i.e. doesn't take enough time to compute), so you spend all your time in the synchronization between processes, that has to be done in a serial manner (so why not on a single processor). Try this:
import multiprocessing as mp
def square(x):
for i in range(10000):
j = i**2
return x**2
if __name__ == '__main__':
# pool=mp.Pool(processes=4)
# pool.map(square,range(1000))
# pool.close()
for i in range(1000):
square(i)
You will see that suddenly the multiprocessing works well: it takes ~2.5 seconds to accomplish, while it will take 10s without it.
Note: If using python 2, you might want to replace all the range by xrange
Edit: I replaced time.sleep by a CPU-intensive but useless calculation
Addendum: In general, for multi-CPU applications, you should try to make each CPU do as much work as possible without returning to the same process. In a case like yours, this means splitting the range into almost-equal sized lists, one per CPU and send them to the various CPUs.
When you do:
pool.map(square, range(100000000))
Before invoking the map function, it has to create a list with 100000000 elements, and this is done by a single process, That's why you see a single core working.
Use a generator instead, so each core can pop a number out of it and you should see the speedup:
pool.map(square, xrange(100000000))
It isn't sufficient simply to import the multiprocessing library to make use of multiple processes to schedule your work. You actually have to create processes too!
Your work is currently scheduled to a single core because you haven't done so, and so your program is a single process with a single thread.
Naturally, when you start a new process to simply square a number, you are going to get slower performance. The overhead of process creation makes sure of that. So your process pool will very likely take longer than a singe-process run.

Categories

Resources