Efficiently select the number of multiprocessing workers - python

I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?

Related

Multiprocessing pool size - cpu_count or cpu_count/2?

I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.
The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).
There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.

Parallel processing of huge number of small tasks

I have a requirement where I have to process a huge number(in millions) of small CPU intense tasks parallely where each task takes around 10s. If I go with multiprocessing or multithreading I would be needing a huge number of instances of threads/processes. How do I go about solving this so that it takes a minimal amount of time?
The most common pattern for this is to scale out horizontally. If you have 1,000,000 tasks # 10 sec/task = 10,000,000 seconds for a single cpu to process or 166,667 minutes (2,778 hours or 116 days). Consider if you're on a multi core machine 4 cores that cuts it down to 29 days (rough estimate maybe need one core to handle queues...). 64 cores would be ~116 days / 64 = 1.8 Days.
If single machine performance doesn't meet your criteria you can scale out to multiple machines. All major cloud services offer queueing systems to make this easy/possible:
Amazon SQS
RabbitMQ
Kafka
NSQ
Etc
Image property of https://anwaarlabs.wordpress.com/2014/04/28/message-queue-part-3-jms-domains/
Instead of being limited to a single machine each machine shares a connection to global queue where they (the consumers in the image) can pull tasks, allowing you to scale out to as many cpu cores as you need.
https://anwaarlabs.files.wordpress.com/2014/04/messaging-queue.png
For CPU bound tasks multi-threading is a poor choice because of the GIL. If you only have a couple of million items it may reduce complexity to use python multiprocessing and a multiprocessing queue and scale out on a single machine (ie rent a 64 core machine from a cloud provider to process in a couple of minutes. The strategy (scale up single machine vs scale out multi machines) depends on your workload size and performance constraints and cost constraints.

Python multithreaded execution - not scalable

I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?
You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.

Best allocation of resources for multiprocessed command running within python

I've developed a tool that requires the user to provide the number of CPUs available to run it.
As part of the program, the tool calls HMMER (hmmer - http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) which itself is quite slow and needs multiple CPUs to run.
I'm confused about the most efficient way to spread the CPUs considering how many CPUs the user has specified.
For instance, assuming the user gave N cpus, I could run
N HMMER jobs with 1 CPU each
N/2 jobs with 2 CPUs each
etc..
My current solution is arbitrary opening pool size of N/5 and open a pool, then to call HMMER with 5 CPUs in each process in the pool.:
pool = multiprocessing.Pool(processes = N/5)
pool.map_async(run_scan,tuple(jobs))
pool.close()
pool.join()
where run_scan calls HMMER and jobs holds all the command line arguments for each HMMER job as dictionaries.
The program is very slow and I was wondering if there was a better way to do this.
Thanks
Almost always, parallelization comes at some cost in efficieny, but the cost depends strongly on the specifics of the computation, so I think the only way to answer this question is a series of experiments.
(I'm assuming memory or disk I/O isn't an issue here; don't know much about HMMER, but the user's guide doesn't mention memory at all in the requirements section.)
Run the same job on one core (--cpu 1), then two cores, four, six, ..., and see how long it takes. That will give you an idea of how well the jobs get parallelized. Used CPU time = runtime * number of cores should remain constant.
Once you notice a below-linear correlation between runtime and number of cores devoted to the job, that's when you start to run multiple jobs in parallel. Say you have 24 cores, and a job takes 240 seconds on a single core, 118 seconds on two cores, 81 seconds on three cores, 62 seconds on four, but a hardly faster 59 seconds for five cores (instead of the expected 48 secs), you should run 6 jobs in parallel with 4 cores each.
You might see a sharp decline at about n_cores/2: some computations don't work well with Hyperthreading, and the number of cores is effectively half of what the CPU manufacturer claims.

Python Multiprocessing: #cores versus #CPU's

It seems to me that using the python multiprocessing Pool.map as described here parallelizes the process to some extent between different cores of one CPU, but I have the feeling that there is no speed-up reflecting more CPU's on a computer. If that's right, is there a way to effectively use the "Number of CPU's times number of cores in each CPU"?
(Admittedly, I may be wrong because my experiments are based on a virtual Amazon cloud machine with 16 virtual CPU's but I know it's not a "real computer".)
More exactly, by default the number of processes will be the number of cores presented by the OS. If the computer uses more than one CPU, the OS should present the total number of cores to Python. But anyway, you can always force the number of process to a smaller value is you do not want to use all the resources from the machine (if it is running a background server for example) or to a higher value if the task is not CPU bound but IO bound for example.

Categories

Resources