It seems to me that using the python multiprocessing Pool.map as described here parallelizes the process to some extent between different cores of one CPU, but I have the feeling that there is no speed-up reflecting more CPU's on a computer. If that's right, is there a way to effectively use the "Number of CPU's times number of cores in each CPU"?
(Admittedly, I may be wrong because my experiments are based on a virtual Amazon cloud machine with 16 virtual CPU's but I know it's not a "real computer".)
More exactly, by default the number of processes will be the number of cores presented by the OS. If the computer uses more than one CPU, the OS should present the total number of cores to Python. But anyway, you can always force the number of process to a smaller value is you do not want to use all the resources from the machine (if it is running a background server for example) or to a higher value if the task is not CPU bound but IO bound for example.
Related
I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?
I want to make a python script which includes many numba njitted functions with parallel=True to use all the cores I request on a cluster.
On the cluster, I can only request the number of cores I want to use, via #SBATCH -n no_of_cores_you_want.
At the moment, having something like:
#SBATCH -n 150
NUMBA_NUM_THREADS=100 python main.py
makes the main.py to output that numba.config.NUMBA_DEFAULT_NUM_THREADS=20 and numba.config.NUMBA_NUM_THREADS=100.
My explanation for this is that a node on the cluster is composed of 20 single threaded cores, looking at its specs.
How can I make the main.py to use all the cores the cluster gives to me? I underline that I only want the main.py to be run once only, and not multiple times. The aim is that single run to make use of all the available cores (located on multiple separate nodes).
(The numba.config.NUMBA_NUM_THREADS is 100 because if I set it to 150, a slurm error appears. It can probably be higher than 100, but it is mandatory to be less than 150.)
A computing cluster is not just a bag of cores. It is far more complex. To begin with, a modern mainstream cluster is basically a set of computing nodes (interconnected with a network). Each node contains one or multiple microprocessors. Each microprocessor has many cores (typically dozens nowadays). Each cores can have multiple hardware threads (typically 2). Nodes have their own memory and a process cannot access to the memory of a remote node (unless the hardware support it or a software abstract this). This is called distributed memory. Core of a node share the same main memory. Note that in practice the access is generally not uniform (NUMA): some cores often have a faster access to some part of the main memory (and if you do not care about that your application can poorly scale).
This means you need to write a distributed application so to use all the cores of a cluster. MPI is a good way to write such an application. Numba does not support distributed memory, only shared memory, so you can only use one computing node with it. Note that writing distributed application is not trivial. Note also that you can mix MPI codes with Numba.
By the way, please consider optimizing your application before using multiple nodes. It is often simpler, but it is also less expensive, use less energy and it makes your application easier to maintain (debugging distributed applications is tricky).
Also note that using more threads than available cores on a node cause an over-subscription which often results in a severe performance degradation. If your application is well optimized, hardware threads should not improve the performance and can even slow down your application.
I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.
The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).
There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.
I am running a Moving Average and SARIMA model for time series forecasting on my machine which has 12 cores.
The moving average model takes 25 min to run on a single core. By using the multiprocessing module, I was able to bring down the running time to ~4 min (by using 8 out of 12 cores). On checking the results of the "top" command, one can easily see that multiprocessing is actually using the 8 cores and the behaviour is as expected.
Moving Average(1 core) -> CPU Usage for Moving Average 1 core
Moving Average(8 core) -> CPU Usage for Moving Average 8 cores
I ran the same routines using the SARIMA model first without using multiprocessing. To my surprise, it was automatically using all the cores/distributing work on all cores. Unlike Moving Average model(Image 1) where I could see the CPU Usage of the process to be 100% for the single process and ~800% cumaltively on using 8 cores, here the CPU Usage for a single core only was fluctuating between 1000%-1200%(i.e all 12 cores). As expected, the multiprocessing module didn't help me much in this case and the results were far worse.
SARIMA(1 core) -> CPU USage Sarima 1 core
SARIMA (8 core) -> CPU Usage Sarima 8 core (Instead of one process using 1200% in this case, some processes go over 100% )
My question is why is the OS automatically distributing work on different cores in case of SARIMA model, while I have to do it explicitly(using multiprocessing)in Moving Average Model. Can it be due to the style of writing the python program?
Some other info:
I am using http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html for SARIMA Tuning.
I am using process queue technique to parallelise the code
SARIMA is taking 9 hrs on 1 core(maxing at 1200% as shown in above images) and more than 24 hrs if I use multiprocessing.
I am new to stackoverflow and will be happy to supplement any other information required. Please let me know if anything is not clear.
Updated:
I had raised an issue on the official repo of pyramid package and the author had replied. The same can be accessed here: https://github.com/alkaline-ml/pmdarima/issues/301
The obvious reason is that SARIMA is developed to work on multiple cores of the CPU. Whereas Moving average does not have that functionality implemented into it. It has nothing to do with your style of writing the code. It is just that the package authors have developed the package code in 2 different ways i.e
No native multiprocessing support for Moving Average and
Native multiprocessing support for SARIMA
One more point to correct in your understanding is that the OS is not automatically distributing work on different cores in case of SARIMA. The package code of SARIMA is the master who is distributing all the work on different cores of CPU since it was developed to support and work with multiple cores by it's authors.
Update:
Your hunch is that multiprocessing code with client level multiprocessing + native multiprocessing should perform better. But actually it is not performing better. This is because,
Since the native multiprocessing of SARIMA is itself taking up the resources of all the cores of your CPU, what computation power will your client level multiprocessing have in order to speed up the process since all the computation power on all cores of CPU is being utilized by native multiprocessing of SARIMA?
This is the way a linux OS or any OS works. When there is no CPU power left for a process(in your case it is for client level multiprocessing process), the process is placed in a queue by the OS until the CPU becomes available to it. Since your client level multiprocessing process is placed in a queue and is not actively performing since there is no CPU left at all, it is stalling up. You could refer Linux OS documentations for verifying what I have said.
facing a problem utilizing the processor cores for maximum using multiprocessing on 64-bit python.
There are 2 places in the code where I have multiprocessing. The code is actually quite resource-intensive. I've tried using "map" function on the processors' pool as per documentation with simple code samples and it used to kinda work, I saw processor utilization spikes going to 99%. This one, though, is keeping at around 20%, it uses all the cores, but somehow it not "pushing" to max. For the moment it's running for around 24 hours having this low utilization. Obviously, there is a lot of computational work to do, but something is keeping it at low pace. Seems like Windows may be controlling it (I have no other soft running on the server though). What can it be?
with Pool() as p: # should calculate the number of processors automatically
p.map(fdas_and_digitize, Reports)
and
with Pool() as p: # should calculate the number of processes automatically
# put permutations of 2 reports there
p.map(PnL, files_periods) # this executes in parallel now
There is an average of 260 python processes in the memory at once (they finish and new ones are created).
Here's the full source code
https://pastebin.com/PnPFUxeL
Windows Server 2016, 64 bit, 24 cores
C:\Users\Administrator>python --version
Python 3.7.1