Parallel processing of huge number of small tasks

Parallel processing of huge number of small tasks - python

I have a requirement where I have to process a huge number(in millions) of small CPU intense tasks parallely where each task takes around 10s. If I go with multiprocessing or multithreading I would be needing a huge number of instances of threads/processes. How do I go about solving this so that it takes a minimal amount of time?

The most common pattern for this is to scale out horizontally. If you have 1,000,000 tasks # 10 sec/task = 10,000,000 seconds for a single cpu to process or 166,667 minutes (2,778 hours or 116 days). Consider if you're on a multi core machine 4 cores that cuts it down to 29 days (rough estimate maybe need one core to handle queues...). 64 cores would be ~116 days / 64 = 1.8 Days.
If single machine performance doesn't meet your criteria you can scale out to multiple machines. All major cloud services offer queueing systems to make this easy/possible:
Amazon SQS
RabbitMQ
Kafka
NSQ
Etc
Image property of https://anwaarlabs.wordpress.com/2014/04/28/message-queue-part-3-jms-domains/
Instead of being limited to a single machine each machine shares a connection to global queue where they (the consumers in the image) can pull tasks, allowing you to scale out to as many cpu cores as you need.
https://anwaarlabs.files.wordpress.com/2014/04/messaging-queue.png
For CPU bound tasks multi-threading is a poor choice because of the GIL. If you only have a couple of million items it may reduce complexity to use python multiprocessing and a multiprocessing queue and scale out on a single machine (ie rent a 64 core machine from a cloud provider to process in a couple of minutes. The strategy (scale up single machine vs scale out multi machines) depends on your workload size and performance constraints and cost constraints.

Related

Efficiently select the number of multiprocessing workers

I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?

Python Multiprocessing: How to set up the number of max_workers properly?

Background: I have a huge DataFrame with 40 million rows. I have to run some functions on some columns. The loops were taking too long, so I decided to go with Multiprocessing.
CPU: 8 cores 16 threads
RAM: 128 GB
Question: How many chunks should I break the data into? And how many workers are properly for this dataset?
p.s. I found out that when I set the max_workers = 15, all threads are running 100%. But if I change the max_workers to 40, they dropped to 40%.
Thank you!

There are three types of parallel computing. Those are io-intensive, cpu-intensive and io-cpu intensive computing.
If your thread is runing on cpu-intensive task, then You can increase your worker numbers as you want to get better performance.
But If it is runing on io-intensive, it willl no effect to increase them.
You seems to be working on io-cpu intensive task.
So If you increase worker numbers, you can get good result until there is no competition for using io resource(hard disk)
so In local machine . it is not a good choice to increase worker numbers.
You can use Hadoop on GPS or AWS for this work.

Multiprocessing pool size - cpu_count or cpu_count/2?

I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.

The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).

There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.

Python SARIMA model is automatically using all cores of CPU. How?

I am running a Moving Average and SARIMA model for time series forecasting on my machine which has 12 cores.
The moving average model takes 25 min to run on a single core. By using the multiprocessing module, I was able to bring down the running time to ~4 min (by using 8 out of 12 cores). On checking the results of the "top" command, one can easily see that multiprocessing is actually using the 8 cores and the behaviour is as expected.
Moving Average(1 core) -> CPU Usage for Moving Average 1 core
Moving Average(8 core) -> CPU Usage for Moving Average 8 cores
I ran the same routines using the SARIMA model first without using multiprocessing. To my surprise, it was automatically using all the cores/distributing work on all cores. Unlike Moving Average model(Image 1) where I could see the CPU Usage of the process to be 100% for the single process and ~800% cumaltively on using 8 cores, here the CPU Usage for a single core only was fluctuating between 1000%-1200%(i.e all 12 cores). As expected, the multiprocessing module didn't help me much in this case and the results were far worse.
SARIMA(1 core) -> CPU USage Sarima 1 core
SARIMA (8 core) -> CPU Usage Sarima 8 core (Instead of one process using 1200% in this case, some processes go over 100% )
My question is why is the OS automatically distributing work on different cores in case of SARIMA model, while I have to do it explicitly(using multiprocessing)in Moving Average Model. Can it be due to the style of writing the python program?
Some other info:
I am using http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html for SARIMA Tuning.
I am using process queue technique to parallelise the code
SARIMA is taking 9 hrs on 1 core(maxing at 1200% as shown in above images) and more than 24 hrs if I use multiprocessing.
I am new to stackoverflow and will be happy to supplement any other information required. Please let me know if anything is not clear.
Updated:
I had raised an issue on the official repo of pyramid package and the author had replied. The same can be accessed here: https://github.com/alkaline-ml/pmdarima/issues/301

The obvious reason is that SARIMA is developed to work on multiple cores of the CPU. Whereas Moving average does not have that functionality implemented into it. It has nothing to do with your style of writing the code. It is just that the package authors have developed the package code in 2 different ways i.e
No native multiprocessing support for Moving Average and
Native multiprocessing support for SARIMA
One more point to correct in your understanding is that the OS is not automatically distributing work on different cores in case of SARIMA. The package code of SARIMA is the master who is distributing all the work on different cores of CPU since it was developed to support and work with multiple cores by it's authors.
Update:
Your hunch is that multiprocessing code with client level multiprocessing + native multiprocessing should perform better. But actually it is not performing better. This is because,
Since the native multiprocessing of SARIMA is itself taking up the resources of all the cores of your CPU, what computation power will your client level multiprocessing have in order to speed up the process since all the computation power on all cores of CPU is being utilized by native multiprocessing of SARIMA?
This is the way a linux OS or any OS works. When there is no CPU power left for a process(in your case it is for client level multiprocessing process), the process is placed in a queue by the OS until the CPU becomes available to it. Since your client level multiprocessing process is placed in a queue and is not actively performing since there is no CPU left at all, it is stalling up. You could refer Linux OS documentations for verifying what I have said.

Best allocation of resources for multiprocessed command running within python

I've developed a tool that requires the user to provide the number of CPUs available to run it.
As part of the program, the tool calls HMMER (hmmer - http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) which itself is quite slow and needs multiple CPUs to run.
I'm confused about the most efficient way to spread the CPUs considering how many CPUs the user has specified.
For instance, assuming the user gave N cpus, I could run
N HMMER jobs with 1 CPU each
N/2 jobs with 2 CPUs each
etc..
My current solution is arbitrary opening pool size of N/5 and open a pool, then to call HMMER with 5 CPUs in each process in the pool.:
pool = multiprocessing.Pool(processes = N/5)
pool.map_async(run_scan,tuple(jobs))
pool.close()
pool.join()
where run_scan calls HMMER and jobs holds all the command line arguments for each HMMER job as dictionaries.
The program is very slow and I was wondering if there was a better way to do this.
Thanks

Almost always, parallelization comes at some cost in efficieny, but the cost depends strongly on the specifics of the computation, so I think the only way to answer this question is a series of experiments.
(I'm assuming memory or disk I/O isn't an issue here; don't know much about HMMER, but the user's guide doesn't mention memory at all in the requirements section.)
Run the same job on one core (--cpu 1), then two cores, four, six, ..., and see how long it takes. That will give you an idea of how well the jobs get parallelized. Used CPU time = runtime * number of cores should remain constant.
Once you notice a below-linear correlation between runtime and number of cores devoted to the job, that's when you start to run multiple jobs in parallel. Say you have 24 cores, and a job takes 240 seconds on a single core, 118 seconds on two cores, 81 seconds on three cores, 62 seconds on four, but a hardly faster 59 seconds for five cores (instead of the expected 48 secs), you should run 6 jobs in parallel with 4 cores each.
You might see a sharp decline at about n_cores/2: some computations don't work well with Hyperthreading, and the number of cores is effectively half of what the CPU manufacturer claims.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.