I was trying out python3 multiprocessing on a machine that has 8 cpu-s and each cpu has four cores (information is from /proc/cpuinfo). I wrote a little script with a useless function and I use time to see how long it takes for it to finish.
from multiprocessing import Pool,cpu_count
def f(x):
for i in range(100000000):
x*x
return x*x
with Pool(8) as p:
a = p.map(f,[x for x in range(8)])
#~ f(5)
Calling f() without multiprocessing takes about 7s (time's "real" output). Calling f() 8 times with a pool of 8 as seen above, takes around 7s again. If I call it 8 times with a pool of 4 I get around 13.5s, so there's some overhead in starting the script, but it runs twice as long. So far so good. Now here comes the part that I do not understand. If there are 8 cpu-s each with 4 cores, if I call it 32 times with a pool of 32, as far as I see it should run for around 7s again, but it takes 32s which is actually slightly longer than running f() 32 times on a pool of 8.
So my question is multiprocessing not able to make use of cores or I don't understand something about cores or is it something else?
Simplified and short.. Cpu-s and cores are hardware that your computer have. On this hardware there is a operating system, the middleman between hardware and the programs running on the computer. The programs running on the computer are allotted cpu time. One of these programs is the python interpetar, which runs all the programs that has the endswith .py. So of the cpu time on your computer, time is allotted to python3.* which in turn allot time to the program you are running. This speed will depend on what hardware you have, what operation you are running, and how cpu-time is allotted between all these instances.
How is cpu-time allotted? It is an like an while loop, the OS is distributing incremental between programs, and the python interpreter is incremental distributing it's distribution time to programs runned by the python interpretor. This is the reason the entire computer halts when a program misbehaves.
Many processes does not equal more access to hardware. It does equal more allotted cpu-time from the python interpretor allotted time. Since you increase the number of programs under the python interpretor which do work for your application.
Many processes does equal more work horses.
You see this in practice in your code. You increase the number of workhorses to the point where the python interpreters allotted cpu-time is divided up between so many processes that all of them slows down.
Related
I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()
due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.
I have two programs, one written in C and one written in Python. I want to pass a few arguments to C program from Python and do it many times in parallel, because I have about 1 million of such C calls.
Essentially I did like this:
from subprocess import check_call
import multiprocessing as mp
from itertools import combinations
def run_parallel(f1, f2):
check_call(f"./c_compiled {f1} {f2} &", cwd='.', shell=True)
if __name__ == '__main__':
pairs = combinations(fns, 2)
pool = mp.Pool(processes=32)
pool.starmap(run_parallel, pairs)
pool.close()
However, sometimes I get the following errors (though the main process is still running)
/bin/sh: fork: retry: No child processes
Moreover, sometimes the whole program in Python fails with
BlockingIOError: [Errno 11] Resource temporarily unavailable
I found while it's still running I can see a lot of processes spawned for my user (around 500), while I have at most 512 available.
This does not happen all the time (depending on the arguments) but it often does. How I can avoid these problems?
I'd wager you're running up against a process/file descriptor/... limit there.
You can "save" one process per invocation by not using shell=True:
check_call(["./c_compiled", f1, f2], cwd='.')
But it'd be better still to make that C code callable from Python instead of creating processes to do so. By far the easiest way to interface "random" C code with Python is Cython.
"many times in parallel" you can certainly do, for reasonable values of "many", but "about 1 million of such C calls" all running at the same time on the same individual machine is almost surely out of the question.
You can lighten the load by running the jobs without interposing a shell, as discussed in #AKX's answer, but that's not enough to bring your objective into range. Better would be to queue up the jobs so as to run only a few at a time -- once you reach that number of jobs, start a new one only when a previous one has finished. The exact number you should try to keep running concurrently depends on your machine and on the details of the computation, but something around the number of CPU cores might be a good first guess.
Note in particular that it is counterproductive to have more jobs at any one time than the machine has resources to run concurrently. If your processes do little or no I/O then the number of cores in your machine puts a cap on that, for only the processes that are scheduled on a core at any given time (at most one per core) will make progress while the others wait. Switching among many processes so as to attempt to avoid starving any of them will add overhead. If your processes do a lot of I/O then they will probably spend a fair proportion of their time blocked on I/O, and therefore not (directly) requiring a core, but in this case your I/O devices may well create a bottleneck, which might prove even worse than the limitation from number of cores.
Suppose I have a function that takes several seconds to compute and I have 8 CPUs (according to multiprocessing.cpu_count()).
What happens when I start this process less than 8 times and more than 8 times?
When I tried to do this I observed that even when I started 20 processes they were all running parallel-y. I expected them to run 8 at a time, others will wait for them to finish.
What happens depends on the underlying operating system.
But generally there are two things: physical cpu cores and threads/processes (all major systems have them). The OS keeps track of this abstract thing called thread/process (which among the other things contains code to execute) and it maps it to some cpu core for a given time or until some syscall (for example network access) is called. If you have more threads then cores (which is usually the case, there are lots of things running in the background of the modern OS) then some of them wait for the context switch to happen (i.e. for their turn).
Now if you have cpu intensive tasks (hard calculations) then you won't see any benefits when running more then 8 threads/processes. But even if you do they won't run one after another. The OS will interrupt a cpu core at some point to allow other threads to run. So the OS will mix the execution: a little bit of this, a little bit of that. That way every thread/process slowly progresses and doesn't wait possibly forever.
On the other hand if your tasks are io bound (for example you make HTTP call so you wait for network) then having 20 tasks will increase performance because when those threads wait for io then the OS will put those threads on hold and will allow other threads to run. And those other threads can do io as well, that way you achieve a higher level of concurrency.
Because of GIL, I thought a multi-thread python process can only have one thread running at one time, thus the cpu usage can not be more than 100 percent.
But I found the code bellow can occupy 950% cpu usage in top.
import threading
import time
def f():
while 1:
pass
for i in range(10):
t = threading.Thread(target=f)
t.setDaemon(True)
t.start()
time.sleep(60)
This is not a same question as Python interpreters uses up to 130% of my CPU. How is that possible?. In that question, the OP said he was doing I/O intensive load-testing which may release the GIL. But in my program, there is no I/O operation.
Tests run on CPython 2.6.6.
I think in this case the CPU is busy doing thread switch instead of doing some actual work. In another word, the thread switch is using all CPUs to do its job, but the python loop code runs too fast to cause observable CPU usage. I tried adding some real calculations as below, the CPU usage dropped to around 200%. And if you add more calculations, I believe the CPU usage will be very close to 100%.
def f():
x=1
while 1:
y=x*2
One reason could be the method you're using to get to 950%. There's a number called (avg) load which is perhaps not what one would expect before reading the documentation.
The load is the (average) number of threads that's either in running or runnable state (in queue for CPU time). If you like in your example have ten busy looping threads while one thread is running the other nine is in runnable state (in queue for a time slot).
The load is an indication on how many cores that you could have made use of. Or how much CPU power your program wants to use (not necessarily the actual CPU power it gets to use).
I am using python multiprocessing module in a program. The program looks like this:
processes = [Process(target=func1, args=(host,servers,q)) for x in range(1,i+1)]
The program is designed to create not more than 50 processes at a time, but it gets stuck after creating a random number of processes. Sometimes the program hangs after creating only 4 processes, sometimes at 9 processes and sometimes at 35 processes. 85% of the time, it gets stuck after creating 4 processes.
I have written another dummy program to test the maximum number of processes that can be created on my system. I can successfully create upto 1000 processes. But in case of this very program, it gets stuck at a random number of processes.
Note: The machine I am working on is a Ubuntu VM on top of a Windows host with Intel core i3 processor. I have allocated 4 CPU cores to the VM and 1Gig RAM.
Please suggest how can I resolve the issue.
Thanks.