concurrent.futures.ThreadPoolExecutor scaling poorly on HPC

concurrent.futures.ThreadPoolExecutor scaling poorly on HPC - python

I'm new to using HPC and had some questions regarding parallelization of code. I have some python code which I've successfully parallelized using multi-threading which works great on my personal machine and a server. However, I just got access to HPC resources at my university. The general section of code which I use looks like this:
# Iterate over weathers
print(f'Number of CPU: {os.cpu_count()}')
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(run_weather, i) for i in range(len(weather_list))]
for f in concurrent.futures.as_completed(futures):
results.append(f.result())
On my personal machine, when running two weathers (the item I've parallelized) I get the following results:
Number of CPU: 8
done in 29.645377159118652 seconds
On the HPC when running on 1 node with 32 cores, I get the following results:
Number of CPU: 32
done in 86.95256996154785 seconds
So it is running almost two times as worse - like it's running serially with the same processing overhead. I also attempted switching the code to ProcessPoolExecutor, and it was much slower than ThreadPoolExecutor. I'm assuming this must be something with data transfer (I've seen that multiprocessing on multiple nodes attempts to pass the entire program, packages and all), but as I said I'm very new to HPC and the wiki provided by my university leaves much to be desired.

Adding threads will only slow down the execution of CPU-bound programs (as so much time is wasted acquiring and releasing locks). Unlike some other programming languages, in Python, threads are bound by a global interpreter lock (GIL), so you may not see the same behavior you may have seen in other languages.
As a rule of thumb for Python: multiple processes are good for CPU-bound work and spreading workloads across multiple cores. Threads are great for things like concurrent IO. Threads will not speed up any kind of computational work.
Additionally, running n number threads equal to the number of CPUs probably does not make sense for your program. Particularly because threading won't spread work across CPU cores, anyhow. More threads adds more overhead in acquiring and releasing the GIL, hence why your execution times are higher the more threads you use.
Next steps:
Determine if threading is speeding up the execution of your program at all.
Try running your code single-threaded, then try again with a small number of threads. Does it speed up at all?
Chances are, you will find threading is not helping you at all here.
Explore multiprocessing instead of threading. I.E. you should use the ProcessPoolExecutor instead of threadpool.
Keep in mind, processes have their own caveats just like threads.
Here is a useful reference to learn more.

Related

Why python does (not) use more CPUs? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Python GIL prevent CPU usage to exceed 100% in multiple core machine?

Many references say that, Python GIL lower down the performance of multi threading code in multi core machine, since each thread will need to acquire the GIL before executioin.
In other words, it looks like GIL make a multi threading Python program to a single thread mode in fact.
For example:
(1) Thread A get GIL, execute some time, release GIL
(2) Thread B get GIL, execute some time, release GIL
...
However, after some simple experiments, I found that although GIL lower down the performance, the total CPU usage may exceed 100% in multiple core machine.
from threading import Thread
def test():
while 1:
pass
for i in range(4):
t = Thread(target=test)
t.start()
On a 4 core, 8 thread machine, the above program will occupy around 160% CPU usage.
Is there anything I misunderstand? Two threads can execute exactly at the same moment? Or the CPU usage calculation has bias or something wrong?
Thanks

In addition to Dolda2000 answer, Python bytecode will only be executed by a single processor at a time because of GIL. Only certain C modules (which don't manage Python state) will be able to run concurrently.
Threading is more appropriate for I/O-bound applications (I/O releases the GIL, allowing for more concurrency) in other cases, python multi-threading is slower and shows degraded performance than serial. So to exploit all cores and for better performance, use multiprocessing.
There's a very nice explanation of it in this answer, check it out!

In all likelyhood, the extra 60% CPU usage you're seeing is simply your various threads fighting over the GIL.
There is, after all, some time spent outside the GIL where the interpreter is working to release/acquire the GIL and the O/S scheduler being at work to arbitrate them.

Does Python support multithreading? Can it speed up execution time?

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Is the max thread limit actually a non-relevant issue for Python / Linux?

The current Python application that I'm working on has a need to utilize 1000+ threads (Pythons threading module). Not that any single thread is working at max cpu cycles, this is just a web server load test app I'm creating. I.E. emulate 200 firefox clients all longing into web server and downloading small web components, basically emulating humans that operate in seconds as opposed to microseconds.
So, I was reading through the various topics such as "how many threads does python support on Linux / windows, etc, and I saw a lot of varied answers. One users said its all about memory and the Linux kernel by default only sets aside 8Meg for threads, if it exceeds that then threads start being killed by the Kernel.
One guy stated this is a non issue for CPython because only 1 thread is running at a time anyway (because of the GIL) so we can specify a gazillion threads??? What's the actual truth on this?

"One thread is running at a time because of the GIL." Well, sort of. The GIL means that only one thread can be executing Python code at a time. However, any number of threads could be doing IO, various other syscalls, or other code that doesn't hold the GIL.
It sounds like your threads will be doing mostly network I/O, and any number of threads can do I/O simultaneously. The GIL competition might be pretty fierce with 1000 threads, but you can always create multiple Python processes and divide the I/O threads between them (i.e., fork a couple times before you start).
"The Linux kernel by default only sets aside 8Meg for threads." I'm not sure where you heard that. Maybe what you actually heard was "On Linux, the default stack size is often 8 MiB," which is true. Each thread will use up 8 MiB of address space for stack (no problem on 64-bit) plus kernel resources for the additional memory maps and the thread process itself. You can change the stack size using the threading.stack_size library function, which helps if you have a lot of threads that don't make deep calls.
>>> import threading
>>> threading.stack_size()
0 # platform default, probably 8 MiB
>>> threading.stack_size(64*1024) # 64 KiB stack size for future threads
Others in this thread have suggested using an asynchronous / nonblocking framework. Well, you can do that. However, on the modern Linux kernel, a multithreaded model is competitive with asynchronous (select/poll/epoll) I/O multiplexing techniques. Rewriting your code to use an asynchronous model is a non-trivial amount of work, so I'd only do it if I couldn't get the required performance from a threaded model. If your threads are really trying to simulate human latency (e.g., spend most of their time sleeping), there are a lot of scenarios in which the asynchronous approach is actually slower. I'm not sure if this applies to Python, where the reduced GIL contention alone might merit the switch.

Both of those are partially true:
Each thread does have a stack, and you can run out of address space for the stack if you create enough threads.
Python also does have something called a GIL, which only allows one Python thread to run at a time. However, once Python code calls into C code, that C code can run while a different Python thread runs. However, threads in Python are still physical, and there is still the stack space limit.
If you're planning on having many connections, rather than using many threads, consider using an asynchronous design. Twisted would probably work well here.

Would limiting a GILed Python program to a single CPU boost performance?

Following up on David Beazley's paper regarding Python and GIL, would it be a good practice to limit a Python program (CPython with GIL and all) to a single CPU in a Windows based multi-core system?
Would it boost performance?
UPDATE: Assume multiple threads are used (not sure if it makes a difference)

The paper does indeed imply that limiting a program to a single core would improve performance in that particular case. However, there are a number of concerns that you would need to deal with:
His tests are mainly for compute intensive threads rather than IO bound threads. If the threads you are using often block voluntarily (such as in a web server waiting for a client) then you don't run into GIL issues at all.
The GIL issues deal specifically with threads and not processes. I may be reading your question wrong, but you seem to be asking about restricting all Python programs to a single core. Programs using processes for parallelism don't suffer from GIL issues and restricting them to a single core will make them slower.
The GIL is drastically different in Python 3.2 (as David mentions in this video. The GIL was changed explicitly to deal with such issues. While it still has problems, it no longer has this problem.
In summary, the only time you want to complicate your life by forcing the OS to restrict the program to a single core is when you are running a:
Multithreaded
Compute Intensive
Lower than Python 3.2
program on a multicore machine.

Bias : For parallel computing involving heavy CPU processing, I much
prefer message passing and cooperating processes to thread programming
(of course, it depends on the problem)
You shouldn't limit your programs to one core. Beazley was just demonstrating a specific problem that performed poorly under those unque conditions (those conditions being an IO bound thread and a CPU bound thread competing against each other). Ideally you want to avoid those conditions by using a different method (import multiprocessing).
I think the best solution is to put your CPU bound tasks in other processes using the multiprocessing module so that they utilize their own cores, and IO bound tasks in threads (or microthreads/coroutines, if you read his interesting paper on that: http://www.dabeaz.com/coroutines/) since the GIL is best suited for those types of tasks.
Conclusion: Python threads are best suited for IO bound tasks, NOT CPU bound.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.