I am using multiprocessing.Pool in python to schedule around 2500 jobs. I am submitting the jobs like this:
pool = multiprocessing.Pool()
for i from 1 to 2500: # pseudocode
jobs.append(pool.apply_async(....))
for j in jobs:
_ = job.get()
The jobs are such, that after some computation, they go to sleep for a long time, waiting for some event to complete. My expectation was that, while they sleep, the other waiting jobs would get scheduled. But it is not happening like that. The maximum number of jobs scheduled at a single time is around 23 (even though they are all sleeping, ps aux shows state S+) which is the more or less the number of cores in the machine. Only after a job finishes and releases a core, another job is getting scheduled.
My expectation was that all 2500 jobs would get scheduled at once. How do I make python submit all 2500 jobs at once?
The multiprocessing and threading package of Python use process/thread pools. By default the number of processes/threads in a pool is dependent of the hardware concurrency (ie. typically the number of hardware threads supported by your processor). You can tune this number but you should really not create too many threads or processes because they are precious resources of the operating system (OS). Note that threads are less expensive than processes for most OS but the CPython makes threads not very useful (except of IO latency-bound jobs) because of the global interpreter lock (GIL). Creating 2500 processes/threads will put a lot of pressure on the OS scheduler and slow does the whole system. OS are design so that waiting threads are not expensive but frequent wake ups will be clearly expensive. Moreover, the number of processes/threads that can be created on a given platform is bounded. AFAIR, on my old Windows 7 system this was limited to 1024. The biggest problem is that each thread requires a stack typically initialized to 1~2 MiB so that creating 2500 threads will takes 2.5~5.0 GiB of RAM! This will be significantly worst for processes. Not to mention cache misses will be more frequent resulting in a slower execution. Thus, put it shortly, do not create 2500 threads or processes: this is too expensive.
You do not need threads or processes, you needs fibers or more generally green threads likes greenlet or eventlet as well as gevent coroutines. The last is known to be fast and supports thread-pools. Alternatively, you can use the recent async feature of Python which is the standard way to deal with such a problem.
pool = multiprocessing.Pool() will use the all available cores. If you need to use a specific number of processes you need to specify it as argument.
pool = multiprocessing.Pool(processes=100)
Related
I've got an "embarrassingly parallel" problem running on python, and I thought I could use the concurrent.futures module to parallelize this computation. I've done this before successfully, and this is the first time I'm trying to do this on a computer that's more powerful than my laptop. This new machine has 32 cores / 64 threads, compared to 2/4 on my laptop.
I'm using a ProcessPoolExecutor object from the concurrent.futures library. I set the max_workers argument to 10, and then submit all of my jobs (of which there are maybe 100s) one after the other in a loop. The simulation seems to work, but there is some behaviour I don't understand, even after some intense googling. I'm running this on Ubuntu, and so I use the htop command to monitor my processors. What I see is that:
10 processes are created.
Each process requests > 100% CPU power (say, up to 600%)
A whole bunch of processes are created as well. (I think these are "tasks", not processes. When I type SHIFT+H, they disappear.)
Most alarmingly, it looks like ALL of processors spool up to 100%. (I'm talking about the "equalizer bars" at the top of the terminal:
Screenshot of htop
My question is — if I'm only spinning out 10 workers, why do ALL of my processors seem to be being used at maximum capacity? My working theory is that the 10 workers I call are "reserved," and the other processors just jump in to help out... if someone else were to run something else and ask for some processing power (but NOT including my 10 requested workers), my other tasks would back off and give them back. But... this isn't what "creating 10 processes" intuitively feels like to me.
If you want a MWE, this is roughly what my code looks like:
def expensive_function(arg):
a = sum(list(range(10 ** arg)))
print(a)
return a
def main():
import concurrent.futures
from random import randrange
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
# Submit the tasks
futures = []
for i in range(100):
random_argument = randrange(5, 7)
futures.append(executor.submit(expensive_function, random_argument))
# Monitor your progress:
num_results = len(futures)
for k, _ in enumerate(concurrent.futures.as_completed(futures)):
print(f'********** Completed {k + 1} of {num_results} simulations **********')
if __name__ == '__main__':
main()
due to the GIL a single proccess can have only 1 thread executing python bytecode at a given time, so if you have 10 processes you should have 10 threads (and therefore cores) executing python bytecode at a given time, however this is not the full story.
the expensive_function is ambiguous, python can create 10 worker processes, and therefore you can only have 10 cores executing python code at a given time (+ main process) (due to GIL), however, if expensive_function is doing some sort of multithreading using an external C module (which doesn't have to abide to the GIL), then each of the 10 processes can have Y threads working in parallel and therefore you'll have a total of 10*Y cores being utilized at a given time, for example your code might be running 6 threads externally on each of the 10 processes for a total of 60 threads running concurrently on 60 cores.
however this doesn't really answer your question, so the main answer is, workers is the number of processes (cores) that can execute python bytecode at a given time (with a strong emphasis on "python bytecode"), wheres tasks is the total number of tasks that will be executed by your workers, and when any worker finishes the task at hand, it will start another task.
(I don't know if this should be asked here at SO, or one of the other stackexchange)
When doing heav I/O bound tasks e.g API-calls or database fetching, I wonder, if Python only uses one process for multithreading, i.e can we create even more threads by combining multiprocessing and multithreading, like the pseudo-code below
for process in Processes:
for thread in threads:
fetch_api_resuls(thread)
or does Python do this automatically?
I do not think there would be any point doing this: spinning up a new process has a relatively high cost, and spinning up a new thread has a pretty high cost. Serialising tasks to those threads or processes costs again, and synchronising state costs...again.
What I would do if I had two sets of problems:
I/O bound problems (e.g. fetching data over a network)
CPU bound problems related to those I/O bound problems
is to combine multiprocessing with asyncio. This has a much lower overhead---we only have one thread, and we pay for the scheduler only (but no serialisation), doesn't involve spinning up a gazillion processes (each of which uses around as much virtual memory as the parent process) or threads (each of which still uses a fair chunk of memory).
However, I would not use asyncio within the multiprocessing threads---I'd use asyncio in the main thread, and offload cpu-intensive tasks to a pool of worker threads when needed.
I suspect you probably can use threading inside multiprocessing, but it is very unlikely to bring you any speed boost.
When creating a multiprocessing.Pool instance with no argument, it is my understanding that the pool size will be assigned the value of os.cpu_count()
On my machine os.cpu_count() == 20
Also:-
from multiprocessing import Pool
with Pool() as pool:
print(len(pool.__dict__['_pool']))
...gives me a value of 20
But here's the problem (if indeed it is a problem). My CPU is a Xeon W-2150B. Thus it is a single CPU with 10 cores where each core is capable of handling 2 concurrent threads (hyper-threading). And so the value of 20 seems to be the number of concurrent threads that the CPU can handle.
However, it seems to me that one wouldn't want to create a pool size that's going to "use up" the CPU's capabilities in its entirety because any other processes running on the same machine would suffer degradation.
So, my thinking is that the pool size should be limited to (number_of_cores - 1)
One way of doing this would be as follows:-
import psutil
from multiprocessing import Pool
maxWorkers = max(2, psutil.cpu_count(logical=False) - 1)
with Pool(maxWorkers) as pool:
print(len(pool.__dict__['_pool']))
...gives a value of 9 which I think is much more reasonable and, potentially, more portable.
Is this a reasonable approach?
It depends on your application.
When you create a Pool in size n, n empty processes are created. Those processes almost doesn't consume CPU until a task is applied to them.
So another approach can be to set the Pool with maximum size and limit the number of running processes with another logic - hold a queue for incoming tasks, and apply them to the pool as long as you don't reach a limit of parallelly running processes.
This approach might be useful if you might want to increase the number of running processes in runtime. Decreasing the number of processes needs more logic.
Notice that even after you apply a task to a process in the pool, the OS still have control over it, so it can be scheduled and interrupted by the OS according to its priority.
Suppose I have a function that takes several seconds to compute and I have 8 CPUs (according to multiprocessing.cpu_count()).
What happens when I start this process less than 8 times and more than 8 times?
When I tried to do this I observed that even when I started 20 processes they were all running parallel-y. I expected them to run 8 at a time, others will wait for them to finish.
What happens depends on the underlying operating system.
But generally there are two things: physical cpu cores and threads/processes (all major systems have them). The OS keeps track of this abstract thing called thread/process (which among the other things contains code to execute) and it maps it to some cpu core for a given time or until some syscall (for example network access) is called. If you have more threads then cores (which is usually the case, there are lots of things running in the background of the modern OS) then some of them wait for the context switch to happen (i.e. for their turn).
Now if you have cpu intensive tasks (hard calculations) then you won't see any benefits when running more then 8 threads/processes. But even if you do they won't run one after another. The OS will interrupt a cpu core at some point to allow other threads to run. So the OS will mix the execution: a little bit of this, a little bit of that. That way every thread/process slowly progresses and doesn't wait possibly forever.
On the other hand if your tasks are io bound (for example you make HTTP call so you wait for network) then having 20 tasks will increase performance because when those threads wait for io then the OS will put those threads on hold and will allow other threads to run. And those other threads can do io as well, that way you achieve a higher level of concurrency.
I have a python (2.6.5 64-bit, Windows 2008 Server R2) app that launches worker processes. The parent process puts jobs in a job queue, from which workers pick them up. Similarly it has a results queue. Each worker performs its job by querying a server. CPU usage by the workers is low.
When the number of workers grows, CPU usage on the servers actually shrinks. The servers themselves are not the bottleneck, as I can load them up further from other applications.
Anyone else seen similar behavior? Is there an issue with python multiprocessing queues when a large number of processes are reading or writing to the same queues?
Two different ideas for performance constraints:
The bottleneck is the workers fighting each other and the parent for access to the job queue.
The bottleneck is connection rate-limits (syn-flood protection) on the servers.
Gathering more information:
Profile the amount of work done: tasks completed per second, use this as your core performance metric.
Use packet capture to view the network activity for network-level delays.
Have your workers document how long they wait for access to the job queue.
Possible improvements:
Have your workers use persistent connections if available/applicable (e.g. HTTP).
Split the tasks into multiple job queues fed to pools of workers.
Not exactly sure what is going on unless you provide all the details.
However, remember that the real concurrency is bounded by the actual number of hardware threads. If the number of processes launched is much larger than the actual number of hardware threads, at some point the context-switching overhead will be more than the benefit of having more concurrent processes.
Creating of new thead is very expensive operation.
One of the simplest ways for controling a lot of paralell network connections is to use stackless threads with support of asyncronical sockets. Python had great support and a bunch of libraries for that.
My favorite one is gevent, which has a great and comletely transparent monkey-patching utility.