I am using python multiprocessing module in a program. The program looks like this:
processes = [Process(target=func1, args=(host,servers,q)) for x in range(1,i+1)]
The program is designed to create not more than 50 processes at a time, but it gets stuck after creating a random number of processes. Sometimes the program hangs after creating only 4 processes, sometimes at 9 processes and sometimes at 35 processes. 85% of the time, it gets stuck after creating 4 processes.
I have written another dummy program to test the maximum number of processes that can be created on my system. I can successfully create upto 1000 processes. But in case of this very program, it gets stuck at a random number of processes.
Note: The machine I am working on is a Ubuntu VM on top of a Windows host with Intel core i3 processor. I have allocated 4 CPU cores to the VM and 1Gig RAM.
Please suggest how can I resolve the issue.
Thanks.
Related
I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()
due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.
I am running a program which is attempting to open 1000 threads using Python's ThreadPoolExecutor, which I have configured to allow a maximum of 1000 threads. On a Windows machine with 4GB of memory, I am able to start ~870 threads before I get a runtime error: can't start new thread. With 16GB of memory, I am able to start ~870 threads as well, though the runtime error, can't start new thread, occurs two minutes later. All threads are running a while loop, which means that they will never complete their tasks. This is the intention.
Why is PyCharm/Windows/Python, whichever may be the culprit, failing to start more than 870 out of the 1000 threads which I am attempting to start, with that number being invariable despite a significant change in the RAM? This leaves me to conclude that hardware limitations are not the problem, which also leaves me completely and utterly confused.
What could be causing this, and how do I fix it?
It is very hard to say without all the details of your configuration and your code, but my guess is that it's windows being starved for certain kinds of memory. I suggest looking into the details in this article:
I attempted to duplicate your issue with Pycharm and python3.8 on my linux box and I was able to make 10000 threads with the code below. Note that I have every thread sleep for quite a while upon creation otherwise the thread creation process slows way down as the main thread of execution, which is trying to make the threads, becomes CPU starved. I have 32GB of RAM but I am able to make 10000 threads with a ThreadPoolExecutor on linux.
from concurrent.futures import ThreadPoolExecutor
import time
def runForever():
time.sleep(10)
while True:
for i in range(100):
a = 10
t = ThreadPoolExecutor(max_workers=10000)
for i in range(10000):
t.submit(runForever)
print(len(t._threads))
print(len(t._threads))
I'm using python's multiprocessing module to open multiple concurrent processes on an 8-core iMac. Something puzzles me on the activity monitor. If I send 5 processes, each is reported using close to 100% CPU, but the CPU load is about 64% idle overall. If I send 8 concurrent processes, it gets to about 50% usage.
Am I interpreting the activity monitor data correctly, meaning that I could efficiently spawn more than 8 processes at once?
Presumably your "8-core" machine supports SMT (Hyperthreading), which means that it has 16 threads of execution or "logical CPUs". If you run 8 CPU-bound processes, you occupy 8 of those 16, which leads to a reported 50% usage / 50% idle.
That's a bit misleading, though, because SMT threads aren't independent cores; they share resources with their "SMT sibling" on the same core. Only using half of them doesn't mean that you're only getting half of the available compute power. Depending on the workload, occupying both SMT threads might get 150% of the throughput of using only one; it might actually be slower due to cache fighting, or anywhere in between.
So you'll need to test for yourself; the sweet spot for your application might be anywhere between 8 and 16 parallel processes. The only way to really know is to measure.
Suppose I have a function that takes several seconds to compute and I have 8 CPUs (according to multiprocessing.cpu_count()).
What happens when I start this process less than 8 times and more than 8 times?
When I tried to do this I observed that even when I started 20 processes they were all running parallel-y. I expected them to run 8 at a time, others will wait for them to finish.
What happens depends on the underlying operating system.
But generally there are two things: physical cpu cores and threads/processes (all major systems have them). The OS keeps track of this abstract thing called thread/process (which among the other things contains code to execute) and it maps it to some cpu core for a given time or until some syscall (for example network access) is called. If you have more threads then cores (which is usually the case, there are lots of things running in the background of the modern OS) then some of them wait for the context switch to happen (i.e. for their turn).
Now if you have cpu intensive tasks (hard calculations) then you won't see any benefits when running more then 8 threads/processes. But even if you do they won't run one after another. The OS will interrupt a cpu core at some point to allow other threads to run. So the OS will mix the execution: a little bit of this, a little bit of that. That way every thread/process slowly progresses and doesn't wait possibly forever.
On the other hand if your tasks are io bound (for example you make HTTP call so you wait for network) then having 20 tasks will increase performance because when those threads wait for io then the OS will put those threads on hold and will allow other threads to run. And those other threads can do io as well, that way you achieve a higher level of concurrency.
I was trying out python3 multiprocessing on a machine that has 8 cpu-s and each cpu has four cores (information is from /proc/cpuinfo). I wrote a little script with a useless function and I use time to see how long it takes for it to finish.
from multiprocessing import Pool,cpu_count
def f(x):
for i in range(100000000):
x*x
return x*x
with Pool(8) as p:
a = p.map(f,[x for x in range(8)])
#~ f(5)
Calling f() without multiprocessing takes about 7s (time's "real" output). Calling f() 8 times with a pool of 8 as seen above, takes around 7s again. If I call it 8 times with a pool of 4 I get around 13.5s, so there's some overhead in starting the script, but it runs twice as long. So far so good. Now here comes the part that I do not understand. If there are 8 cpu-s each with 4 cores, if I call it 32 times with a pool of 32, as far as I see it should run for around 7s again, but it takes 32s which is actually slightly longer than running f() 32 times on a pool of 8.
So my question is multiprocessing not able to make use of cores or I don't understand something about cores or is it something else?
Simplified and short.. Cpu-s and cores are hardware that your computer have. On this hardware there is a operating system, the middleman between hardware and the programs running on the computer. The programs running on the computer are allotted cpu time. One of these programs is the python interpetar, which runs all the programs that has the endswith .py. So of the cpu time on your computer, time is allotted to python3.* which in turn allot time to the program you are running. This speed will depend on what hardware you have, what operation you are running, and how cpu-time is allotted between all these instances.
How is cpu-time allotted? It is an like an while loop, the OS is distributing incremental between programs, and the python interpreter is incremental distributing it's distribution time to programs runned by the python interpretor. This is the reason the entire computer halts when a program misbehaves.
Many processes does not equal more access to hardware. It does equal more allotted cpu-time from the python interpretor allotted time. Since you increase the number of programs under the python interpretor which do work for your application.
Many processes does equal more work horses.
You see this in practice in your code. You increase the number of workhorses to the point where the python interpreters allotted cpu-time is divided up between so many processes that all of them slows down.