Multiprocessing - setting the pool size in a portable manner

Multiprocessing - setting the pool size in a portable manner - python

When creating a multiprocessing.Pool instance with no argument, it is my understanding that the pool size will be assigned the value of os.cpu_count()
On my machine os.cpu_count() == 20
Also:-
from multiprocessing import Pool
with Pool() as pool:
print(len(pool.__dict__['_pool']))
...gives me a value of 20
But here's the problem (if indeed it is a problem). My CPU is a Xeon W-2150B. Thus it is a single CPU with 10 cores where each core is capable of handling 2 concurrent threads (hyper-threading). And so the value of 20 seems to be the number of concurrent threads that the CPU can handle.
However, it seems to me that one wouldn't want to create a pool size that's going to "use up" the CPU's capabilities in its entirety because any other processes running on the same machine would suffer degradation.
So, my thinking is that the pool size should be limited to (number_of_cores - 1)
One way of doing this would be as follows:-
import psutil
from multiprocessing import Pool
maxWorkers = max(2, psutil.cpu_count(logical=False) - 1)
with Pool(maxWorkers) as pool:
print(len(pool.__dict__['_pool']))
...gives a value of 9 which I think is much more reasonable and, potentially, more portable.
Is this a reasonable approach?

It depends on your application.
When you create a Pool in size n, n empty processes are created. Those processes almost doesn't consume CPU until a task is applied to them.
So another approach can be to set the Pool with maximum size and limit the number of running processes with another logic - hold a queue for incoming tasks, and apply them to the pool as long as you don't reach a limit of parallelly running processes.
This approach might be useful if you might want to increase the number of running processes in runtime. Decreasing the number of processes needs more logic.
Notice that even after you apply a task to a process in the pool, the OS still have control over it, so it can be scheduled and interrupted by the OS according to its priority.

Related

Sleeping jobs occupying cores when using python multiprocessing pool

I am using multiprocessing.Pool in python to schedule around 2500 jobs. I am submitting the jobs like this:
pool = multiprocessing.Pool()
for i from 1 to 2500: # pseudocode
jobs.append(pool.apply_async(....))
for j in jobs:
_ = job.get()
The jobs are such, that after some computation, they go to sleep for a long time, waiting for some event to complete. My expectation was that, while they sleep, the other waiting jobs would get scheduled. But it is not happening like that. The maximum number of jobs scheduled at a single time is around 23 (even though they are all sleeping, ps aux shows state S+) which is the more or less the number of cores in the machine. Only after a job finishes and releases a core, another job is getting scheduled.
My expectation was that all 2500 jobs would get scheduled at once. How do I make python submit all 2500 jobs at once?

The multiprocessing and threading package of Python use process/thread pools. By default the number of processes/threads in a pool is dependent of the hardware concurrency (ie. typically the number of hardware threads supported by your processor). You can tune this number but you should really not create too many threads or processes because they are precious resources of the operating system (OS). Note that threads are less expensive than processes for most OS but the CPython makes threads not very useful (except of IO latency-bound jobs) because of the global interpreter lock (GIL). Creating 2500 processes/threads will put a lot of pressure on the OS scheduler and slow does the whole system. OS are design so that waiting threads are not expensive but frequent wake ups will be clearly expensive. Moreover, the number of processes/threads that can be created on a given platform is bounded. AFAIR, on my old Windows 7 system this was limited to 1024. The biggest problem is that each thread requires a stack typically initialized to 1~2 MiB so that creating 2500 threads will takes 2.5~5.0 GiB of RAM! This will be significantly worst for processes. Not to mention cache misses will be more frequent resulting in a slower execution. Thus, put it shortly, do not create 2500 threads or processes: this is too expensive.
You do not need threads or processes, you needs fibers or more generally green threads likes greenlet or eventlet as well as gevent coroutines. The last is known to be fast and supports thread-pools. Alternatively, you can use the recent async feature of Python which is the standard way to deal with such a problem.

pool = multiprocessing.Pool() will use the all available cores. If you need to use a specific number of processes you need to specify it as argument.
pool = multiprocessing.Pool(processes=100)

Difference between a "worker" and a "task" for concurrent.futures.ProcessPoolExecutor

I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()

due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.

Maximum pool size when using ThreadPool Python

I am using ThreadPool to achieve multiprocessing. When using multiprocessing, pool size limit should be equivalent to number of CPU cores.
My question- When using ThreadPool, should the pool size limit be number of CPU cores?
This is my code
from multiprocessing.pool import ThreadPool as Pool
class Subject():
def __init__(self, url):
#rest of the code
def func1(self):
#returns something
if __name__=="__main__":
pool_size= 11
pool= Pool(pool_size)
objects= [Subject() for url in all_my_urls]
for obj in objects:
pool.apply_async(obj.func1, ())
pool.close()
pool.join()
What should be the maximum pool size be?
Thanks in advance.

You cannot use threads for multiprocessing, you can only achieve multithreading. Multiple threads cannot run concurrently in a single Python process because of the GIL and so multithreading is only useful if they are running IO heavy work (e.g. talking to the Internet) where they spend a lot of time waiting, rather than CPU heavy work (e.g. maths) which constantly occupies a core.
So if you have many IO heavy tasks running at once then having that many threads will be useful, even if it's more than the the number of CPU cores. A very large number threads will eventually have a negative impact on performance, but until you actually measure a problem don't worry. Something like 100 threads should be fine.

No, you needn't limit the thread pool size be same as the number of CPU cores. If you are using it in IO high throughput situation, you could adjust your thread pool size to a suitable number which help you get the highest IO throughput, and if increasing the threads number, you couldn't get higher IO throughput number.
(I found that the maximum thread number I could set for threadpool is only around 9000, if higher, the Python3.6 reports error, and Google let me visit your question)

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

I have some string processing job in Python. And I wish to speed up the job
by using a thread pool. The string processing job has no dependency to each
other. The result will be stored into a mongodb database.
I wrote my code as follow:
thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
pool.apply_async(_process, [single_string ])
pool.close()
pool.join()
def _process(s):
# Do staff, pure python string manipulation.
# Save the output to a database (pyMongo).
I try to run the code in a Linux machine with 8 CPU cores. And it turns out
that the maximum CPU usage can only be around 130% (read from top), when I
run the job for a few minutes.
Is my approach correct to use a thread pool? Is there any better way to do so?

You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:
def _process(s):
for i in xrange(100000000):
j = i * i

Multiprocessing vs Threading Python [duplicate]

This question already has answers here:
What are the differences between the threading and multiprocessing modules?
(6 answers)
Closed 3 years ago.
I am trying to understand the advantages of multiprocessing over threading. I know that multiprocessing gets around the Global Interpreter Lock, but what other advantages are there, and can threading not do the same thing?

Here are some pros/cons I came up with.
Multiprocessing
Pros
Separate memory space
Code is usually straightforward
Takes advantage of multiple CPUs & cores
Avoids GIL limitations for cPython
Eliminates most needs for synchronization primitives unless if you use shared memory (instead, it's more of a communication model for IPC)
Child processes are interruptible/killable
Python multiprocessing module includes useful abstractions with an interface much like threading.Thread
A must with cPython for CPU-bound processing
Cons
IPC a little more complicated with more overhead (communication model vs. shared memory/objects)
Larger memory footprint
Threading
Pros
Lightweight - low memory footprint
Shared memory - makes access to state from another context easier
Allows you to easily make responsive UIs
cPython C extension modules that properly release the GIL will run in parallel
Great option for I/O-bound applications
Cons
cPython - subject to the GIL
Not interruptible/killable
If not following a command queue/message pump model (using the Queue module), then manual use of synchronization primitives become a necessity (decisions are needed for the granularity of locking)
Code is usually harder to understand and to get right - the potential for race conditions increases dramatically

The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads.

Threading's job is to enable applications to be responsive. Suppose you have a database connection and you need to respond to user input. Without threading, if the database connection is busy the application will not be able to respond to the user. By splitting off the database connection into a separate thread you can make the application more responsive. Also because both threads are in the same process, they can access the same data structures - good performance, plus a flexible software design.
Note that due to the GIL the app isn't actually doing two things at once, but what we've done is put the resource lock on the database into a separate thread so that CPU time can be switched between it and the user interaction. CPU time gets rationed out between the threads.
Multiprocessing is for times when you really do want more than one thing to be done at any given time. Suppose your application needs to connect to 6 databases and perform a complex matrix transformation on each dataset. Putting each job in a separate thread might help a little because when one connection is idle another one could get some CPU time, but the processing would not be done in parallel because the GIL means that you're only ever using the resources of one CPU. By putting each job in a Multiprocessing process, each can run on its own CPU and run at full efficiency.

Python documentation quotes
The canonical version of this answer is now at the dupliquee question: What are the differences between the threading and multiprocessing modules?
I've highlighted the key Python documentation quotes about Process vs Threads and the GIL at: What is the global interpreter lock (GIL) in CPython?
Process vs thread experiments
I did a bit of benchmarking in order to show the difference more concretely.
In the benchmark, I timed CPU and IO bound work for various numbers of threads on an 8 hyperthread CPU. The work supplied per thread is always the same, such that more threads means more total work supplied.
The results were:
Plot data.
Conclusions:
for CPU bound work, multiprocessing is always faster, presumably due to the GIL
for IO bound work. both are exactly the same speed
threads only scale up to about 4x instead of the expected 8x since I'm on an 8 hyperthread machine.
Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do 'real', 'user' and 'sys' mean in the output of time(1)?
TODO: I don't know the reason for this, there must be other Python inefficiencies coming into play.
Test code:
#!/usr/bin/env python3
import multiprocessing
import threading
import time
import sys
def cpu_func(result, niters):
'''
A useless CPU bound function.
'''
for i in range(niters):
result = (result * result * i + 2 * result * i * i + 3) % 10000000
return result
class CpuThread(threading.Thread):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class CpuProcess(multiprocessing.Process):
def __init__(self, niters):
super().__init__()
self.niters = niters
self.result = 1
def run(self):
self.result = cpu_func(self.result, self.niters)
class IoThread(threading.Thread):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
class IoProcess(multiprocessing.Process):
def __init__(self, sleep):
super().__init__()
self.sleep = sleep
self.result = self.sleep
def run(self):
time.sleep(self.sleep)
if __name__ == '__main__':
cpu_n_iters = int(sys.argv[1])
sleep = 1
cpu_count = multiprocessing.cpu_count()
input_params = [
(CpuThread, cpu_n_iters),
(CpuProcess, cpu_n_iters),
(IoThread, sleep),
(IoProcess, sleep),
]
header = ['nthreads']
for thread_class, _ in input_params:
header.append(thread_class.__name__)
print(' '.join(header))
for nthreads in range(1, 2 * cpu_count):
results = [nthreads]
for thread_class, work_size in input_params:
start_time = time.time()
threads = []
for i in range(nthreads):
thread = thread_class(work_size)
threads.append(thread)
thread.start()
for i, thread in enumerate(threads):
thread.join()
results.append(time.time() - start_time)
print(' '.join('{:.6e}'.format(result) for result in results))
GitHub upstream + plotting code on same directory.
Tested on Ubuntu 18.10, Python 3.6.7, in a Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB), SSD: Samsung MZVLB512HAJQ-000L7 (3,000 MB/s).
Visualize which threads are running at a given time
This post https://rohanvarma.me/GIL/ taught me that you can run a callback whenever a thread is scheduled with the target= argument of threading.Thread and the same for multiprocessing.Process.
This allows us to view exactly which thread runs at each time. When this is done, we would see something like (I made this particular graph up):
+--------------------------------------+
+ Active threads / processes +
+-----------+--------------------------------------+
|Thread 1 |******** ************ |
| 2 | ***** *************|
+-----------+--------------------------------------+
|Process 1 |*** ************** ****** **** |
| 2 |** **** ****** ** ********* **********|
+-----------+--------------------------------------+
+ Time --> +
+--------------------------------------+
which would show that:
threads are fully serialized by the GIL
processes can run in parallel

The key advantage is isolation. A crashing process won't bring down other processes, whereas a crashing thread will probably wreak havoc with other threads.

As mentioned in the question, Multiprocessing in Python is the only real way to achieve true parallelism. Multithreading cannot achieve this because the GIL prevents threads from running in parallel.
As a consequence, threading may not always be useful in Python, and in fact, may even result in worse performance depending on what you are trying to achieve. For example, if you are performing a CPU-bound task such as decompressing gzip files or 3D-rendering (anything CPU intensive) then threading may actually hinder your performance rather than help. In such a case, you would want to use Multiprocessing as only this method actually runs in parallel and will help distribute the weight of the task at hand. There could be some overhead to this since Multiprocessing involves copying the memory of a script into each subprocess which may cause issues for larger-sized applications.
However, Multithreading becomes useful when your task is IO-bound. For example, if most of your task involves waiting on API-calls, you would use Multithreading because why not start up another request in another thread while you wait, rather than have your CPU sit idly by.
TL;DR
Multithreading is concurrent and is used for IO-bound tasks
Multiprocessing achieves true parallelism and is used for CPU-bound tasks

Another thing not mentioned is that it depends on what OS you are using where speed is concerned. In Windows processes are costly so threads would be better in windows but in unix processes are faster than their windows variants so using processes in unix is much safer plus quick to spawn.

Other answers have focused more on the multithreading vs multiprocessing aspect, but in python Global Interpreter Lock (GIL) has to be taken into account. When more number (say k) of threads are created, generally they will not increase the performance by k times, as it will still be running as a single threaded application. GIL is a global lock which locks everything out and allows only single thread execution utilizing only a single core. The performance does increase in places where C extensions like numpy, Network, I/O are being used, where a lot of background work is done and GIL is released. So when threading is used, there is only a single operating system level thread while python creates pseudo-threads which are completely managed by threading itself but are essentially running as a single process. Preemption takes place between these pseudo threads. If the CPU runs at maximum capacity, you may want to switch to multiprocessing.
Now in case of self-contained instances of execution, you can instead opt for pool. But in case of overlapping data, where you may want processes communicating you should use multiprocessing.Process.

MULTIPROCESSING
Multiprocessing adds CPUs to increase computing power.
Multiple processes are executed concurrently.
Creation of a process is time-consuming and resource intensive.
Multiprocessing can be symmetric or asymmetric.
The multiprocessing library in Python uses separate memory space, multiple CPU cores, bypasses GIL limitations in CPython, child processes are killable (ex. function calls in program) and is much easier to use.
Some caveats of the module are a larger memory footprint and IPC’s a little more complicated with more overhead.
MULTITHREADING
Multithreading creates multiple threads of a single process to increase computing power.
Multiple threads of a single process are executed concurrently.
Creation of a thread is economical in both sense time and resource.
The multithreading library is lightweight, shares memory, responsible for responsive UI and is used well for I/O bound applications.
The module isn’t killable and is subject to the GIL.
Multiple threads live in the same process in the same space, each thread will do a specific task, have its own code, own stack memory, instruction pointer, and share heap memory.
If a thread has a memory leak it can damage the other threads and parent process.
Example of Multi-threading and Multiprocessing using Python
Python 3 has the facility of Launching parallel tasks. This makes our work easier.
It has for thread pooling and Process pooling.
The following gives an insight:
ThreadPoolExecutor Example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
ProcessPoolExecutor
import concurrent.futures
import math
PRIMES = [
112272535095293,
112582705942171,
112272535095293,
115280095190773,
115797848077099,
1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
print('%d is prime: %s' % (number, prime))
if __name__ == '__main__':
main()

Threads share the same memory space to guarantee that two threads don't share the same memory location so special precautions must be taken the CPython interpreter handles this using a mechanism called GIL, or the Global Interpreter Lock
what is GIL(Just I want to Clarify GIL it's repeated above)?
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
For the main question, we can compare using Use Cases, How?
1-Use Cases for Threading: in case of GUI programs threading can be used to make the application responsive For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on. Here, the program has to wait for user interaction. which is the biggest bottleneck. Another use case for threading is programs that are IO bound or network bound, such as web-scrapers.
2-Use Cases for Multiprocessing: Multiprocessing outshines threading in cases where the program is CPU intensive and doesn’t have to do any IO or user interaction.
For More Details visit this link and link or you need in-depth knowledge for threading visit here for Multiprocessing visit here

Process may have multiple threads. These threads may share memory and are the units of execution within a process.
Processes run on the CPU, so threads are residing under each process. Processes are individual entities which run independently. If you want to share data or state between each process, you may use a memory-storage tool such as Cache(redis, memcache), Files, or a Database.

As I learnd in university most of the answers above are right. In PRACTISE on different platforms (always using python) spawning multiple threads ends up like spawning one process. The difference is the multiple cores share the load instead of only 1 core processing everything at 100%. So if you spawn for example 10 threads on a 4 core pc, you will end up getting only the 25% of the cpus power!! And if u spawn 10 processes u will end up with the cpu processing at 100% (if u dont have other limitations). Im not a expert in all the new technologies. Im answering with own real experience background

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.