Suppose I have a function that takes several seconds to compute and I have 8 CPUs (according to multiprocessing.cpu_count()).
What happens when I start this process less than 8 times and more than 8 times?
When I tried to do this I observed that even when I started 20 processes they were all running parallel-y. I expected them to run 8 at a time, others will wait for them to finish.
What happens depends on the underlying operating system.
But generally there are two things: physical cpu cores and threads/processes (all major systems have them). The OS keeps track of this abstract thing called thread/process (which among the other things contains code to execute) and it maps it to some cpu core for a given time or until some syscall (for example network access) is called. If you have more threads then cores (which is usually the case, there are lots of things running in the background of the modern OS) then some of them wait for the context switch to happen (i.e. for their turn).
Now if you have cpu intensive tasks (hard calculations) then you won't see any benefits when running more then 8 threads/processes. But even if you do they won't run one after another. The OS will interrupt a cpu core at some point to allow other threads to run. So the OS will mix the execution: a little bit of this, a little bit of that. That way every thread/process slowly progresses and doesn't wait possibly forever.
On the other hand if your tasks are io bound (for example you make HTTP call so you wait for network) then having 20 tasks will increase performance because when those threads wait for io then the OS will put those threads on hold and will allow other threads to run. And those other threads can do io as well, that way you achieve a higher level of concurrency.
Related
I am using multiprocessing.Pool in python to schedule around 2500 jobs. I am submitting the jobs like this:
pool = multiprocessing.Pool()
for i from 1 to 2500: # pseudocode
jobs.append(pool.apply_async(....))
for j in jobs:
_ = job.get()
The jobs are such, that after some computation, they go to sleep for a long time, waiting for some event to complete. My expectation was that, while they sleep, the other waiting jobs would get scheduled. But it is not happening like that. The maximum number of jobs scheduled at a single time is around 23 (even though they are all sleeping, ps aux shows state S+) which is the more or less the number of cores in the machine. Only after a job finishes and releases a core, another job is getting scheduled.
My expectation was that all 2500 jobs would get scheduled at once. How do I make python submit all 2500 jobs at once?
The multiprocessing and threading package of Python use process/thread pools. By default the number of processes/threads in a pool is dependent of the hardware concurrency (ie. typically the number of hardware threads supported by your processor). You can tune this number but you should really not create too many threads or processes because they are precious resources of the operating system (OS). Note that threads are less expensive than processes for most OS but the CPython makes threads not very useful (except of IO latency-bound jobs) because of the global interpreter lock (GIL). Creating 2500 processes/threads will put a lot of pressure on the OS scheduler and slow does the whole system. OS are design so that waiting threads are not expensive but frequent wake ups will be clearly expensive. Moreover, the number of processes/threads that can be created on a given platform is bounded. AFAIR, on my old Windows 7 system this was limited to 1024. The biggest problem is that each thread requires a stack typically initialized to 1~2 MiB so that creating 2500 threads will takes 2.5~5.0 GiB of RAM! This will be significantly worst for processes. Not to mention cache misses will be more frequent resulting in a slower execution. Thus, put it shortly, do not create 2500 threads or processes: this is too expensive.
You do not need threads or processes, you needs fibers or more generally green threads likes greenlet or eventlet as well as gevent coroutines. The last is known to be fast and supports thread-pools. Alternatively, you can use the recent async feature of Python which is the standard way to deal with such a problem.
pool = multiprocessing.Pool() will use the all available cores. If you need to use a specific number of processes you need to specify it as argument.
pool = multiprocessing.Pool(processes=100)
I'm new to using HPC and had some questions regarding parallelization of code. I have some python code which I've successfully parallelized using multi-threading which works great on my personal machine and a server. However, I just got access to HPC resources at my university. The general section of code which I use looks like this:
# Iterate over weathers
print(f'Number of CPU: {os.cpu_count()}')
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(run_weather, i) for i in range(len(weather_list))]
for f in concurrent.futures.as_completed(futures):
results.append(f.result())
On my personal machine, when running two weathers (the item I've parallelized) I get the following results:
Number of CPU: 8
done in 29.645377159118652 seconds
On the HPC when running on 1 node with 32 cores, I get the following results:
Number of CPU: 32
done in 86.95256996154785 seconds
So it is running almost two times as worse - like it's running serially with the same processing overhead. I also attempted switching the code to ProcessPoolExecutor, and it was much slower than ThreadPoolExecutor. I'm assuming this must be something with data transfer (I've seen that multiprocessing on multiple nodes attempts to pass the entire program, packages and all), but as I said I'm very new to HPC and the wiki provided by my university leaves much to be desired.
Adding threads will only slow down the execution of CPU-bound programs (as so much time is wasted acquiring and releasing locks). Unlike some other programming languages, in Python, threads are bound by a global interpreter lock (GIL), so you may not see the same behavior you may have seen in other languages.
As a rule of thumb for Python: multiple processes are good for CPU-bound work and spreading workloads across multiple cores. Threads are great for things like concurrent IO. Threads will not speed up any kind of computational work.
Additionally, running n number threads equal to the number of CPUs probably does not make sense for your program. Particularly because threading won't spread work across CPU cores, anyhow. More threads adds more overhead in acquiring and releasing the GIL, hence why your execution times are higher the more threads you use.
Next steps:
Determine if threading is speeding up the execution of your program at all.
Try running your code single-threaded, then try again with a small number of threads. Does it speed up at all?
Chances are, you will find threading is not helping you at all here.
Explore multiprocessing instead of threading. I.E. you should use the ProcessPoolExecutor instead of threadpool.
Keep in mind, processes have their own caveats just like threads.
Here is a useful reference to learn more.
I'm using python's multiprocessing module to open multiple concurrent processes on an 8-core iMac. Something puzzles me on the activity monitor. If I send 5 processes, each is reported using close to 100% CPU, but the CPU load is about 64% idle overall. If I send 8 concurrent processes, it gets to about 50% usage.
Am I interpreting the activity monitor data correctly, meaning that I could efficiently spawn more than 8 processes at once?
Presumably your "8-core" machine supports SMT (Hyperthreading), which means that it has 16 threads of execution or "logical CPUs". If you run 8 CPU-bound processes, you occupy 8 of those 16, which leads to a reported 50% usage / 50% idle.
That's a bit misleading, though, because SMT threads aren't independent cores; they share resources with their "SMT sibling" on the same core. Only using half of them doesn't mean that you're only getting half of the available compute power. Depending on the workload, occupying both SMT threads might get 150% of the throughput of using only one; it might actually be slower due to cache fighting, or anywhere in between.
So you'll need to test for yourself; the sweet spot for your application might be anywhere between 8 and 16 parallel processes. The only way to really know is to measure.
I have two programs, one written in C and one written in Python. I want to pass a few arguments to C program from Python and do it many times in parallel, because I have about 1 million of such C calls.
Essentially I did like this:
from subprocess import check_call
import multiprocessing as mp
from itertools import combinations
def run_parallel(f1, f2):
check_call(f"./c_compiled {f1} {f2} &", cwd='.', shell=True)
if __name__ == '__main__':
pairs = combinations(fns, 2)
pool = mp.Pool(processes=32)
pool.starmap(run_parallel, pairs)
pool.close()
However, sometimes I get the following errors (though the main process is still running)
/bin/sh: fork: retry: No child processes
Moreover, sometimes the whole program in Python fails with
BlockingIOError: [Errno 11] Resource temporarily unavailable
I found while it's still running I can see a lot of processes spawned for my user (around 500), while I have at most 512 available.
This does not happen all the time (depending on the arguments) but it often does. How I can avoid these problems?
I'd wager you're running up against a process/file descriptor/... limit there.
You can "save" one process per invocation by not using shell=True:
check_call(["./c_compiled", f1, f2], cwd='.')
But it'd be better still to make that C code callable from Python instead of creating processes to do so. By far the easiest way to interface "random" C code with Python is Cython.
"many times in parallel" you can certainly do, for reasonable values of "many", but "about 1 million of such C calls" all running at the same time on the same individual machine is almost surely out of the question.
You can lighten the load by running the jobs without interposing a shell, as discussed in #AKX's answer, but that's not enough to bring your objective into range. Better would be to queue up the jobs so as to run only a few at a time -- once you reach that number of jobs, start a new one only when a previous one has finished. The exact number you should try to keep running concurrently depends on your machine and on the details of the computation, but something around the number of CPU cores might be a good first guess.
Note in particular that it is counterproductive to have more jobs at any one time than the machine has resources to run concurrently. If your processes do little or no I/O then the number of cores in your machine puts a cap on that, for only the processes that are scheduled on a core at any given time (at most one per core) will make progress while the others wait. Switching among many processes so as to attempt to avoid starving any of them will add overhead. If your processes do a lot of I/O then they will probably spend a fair proportion of their time blocked on I/O, and therefore not (directly) requiring a core, but in this case your I/O devices may well create a bottleneck, which might prove even worse than the limitation from number of cores.
I have a quad core i7 920 CPU. It is Hyperthreaded, so the computer thinks it has 8 cores.
From what I've read on the interweb, when doing parallel tasks, I should use the number of physical cores, not the number of hyper threaded cores.
So I have done some timings, and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.
Why is this? My example code is too long to post here, but can be found by running the example here: https://github.com/jsphon/MTVectorizer
A chart of the performance is here:
(Intel) hyperthreaded cores act like (up to) two CPUs.
The observation is that a single CPU has a set of resources that are ideally busy continuously, but in practice sit idle surprising often while the CPU waits for some external event, typically memory reads or writes.
By adding a bit of additional state information for another hardware thread (e.g., another copy of the registers + additional stuff), the "single" CPU can switch its attention to executing the other thread when the first one blocks. (One can generalize this N hardware threads, and other architectures have done this; Intel quit at 2).
If both hardware threads spend their time waiting for various events, the CPU can arguably do the corresponding processing for the hardware threads. 40 nanoseconds for a memory wait is a long time. So if your program fetches lots of memory, I'd expect it to look as if both hardware threads were fully effective, e.g, you should get nearly 2x.
If the two hardware threads are doing work that is highly local (e.g., intense computations in just the registers), then internal waits become minimal and the single CPU can't switch fast enough to service both hardware threads as fast as they generate work. In this case, performance will degrade.
I don't recall where I heard it, and I heard this a long time ago: under such circumstances the net effect is more like 1.3x than the idealized 2x. (Expecting the SO audience to correct me on this).
Your application may switch back and forth in its needs depending on which part is running at the moment. Then you will get a mix of performance. I'm happy with any speed up I can get.
Ira Baxter has explained your question pretty well, but I want to add one more thing (can't comment on his answer cuz not enough rep yet): there is an overhead to switching from one thread to another. This process, called context switching (http://wiki.osdev.org/Context_Switching#Hardware_Context_Switching), requires at minimum your CPU core to change its registers to reflect data in the new thread. This cost is significant if you are doing process-level context switching, but gets quite a bit cheaper when you are doing thread-level switching. This means 2 things:
1) Hyper threading will never give you the theoretical 2x performance boost because the cost of context switching is non-trivial. This is also why highly logical threads degrade performance, per Ira: frequent context switching multiplies that cost.
2) 8 single-threaded processes will run slower than 4 double-threaded processes doing the same work. Thus, you should make use of Python's thread library, or the awesome greenlet library (https://greenlet.readthedocs.org/en/latest/) if you plan on doing multithreading work.