I've developed a tool that requires the user to provide the number of CPUs available to run it.
As part of the program, the tool calls HMMER (hmmer - http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) which itself is quite slow and needs multiple CPUs to run.
I'm confused about the most efficient way to spread the CPUs considering how many CPUs the user has specified.
For instance, assuming the user gave N cpus, I could run
N HMMER jobs with 1 CPU each
N/2 jobs with 2 CPUs each
etc..
My current solution is arbitrary opening pool size of N/5 and open a pool, then to call HMMER with 5 CPUs in each process in the pool.:
pool = multiprocessing.Pool(processes = N/5)
pool.map_async(run_scan,tuple(jobs))
pool.close()
pool.join()
where run_scan calls HMMER and jobs holds all the command line arguments for each HMMER job as dictionaries.
The program is very slow and I was wondering if there was a better way to do this.
Thanks
Almost always, parallelization comes at some cost in efficieny, but the cost depends strongly on the specifics of the computation, so I think the only way to answer this question is a series of experiments.
(I'm assuming memory or disk I/O isn't an issue here; don't know much about HMMER, but the user's guide doesn't mention memory at all in the requirements section.)
Run the same job on one core (--cpu 1), then two cores, four, six, ..., and see how long it takes. That will give you an idea of how well the jobs get parallelized. Used CPU time = runtime * number of cores should remain constant.
Once you notice a below-linear correlation between runtime and number of cores devoted to the job, that's when you start to run multiple jobs in parallel. Say you have 24 cores, and a job takes 240 seconds on a single core, 118 seconds on two cores, 81 seconds on three cores, 62 seconds on four, but a hardly faster 59 seconds for five cores (instead of the expected 48 secs), you should run 6 jobs in parallel with 4 cores each.
You might see a sharp decline at about n_cores/2: some computations don't work well with Hyperthreading, and the number of cores is effectively half of what the CPU manufacturer claims.
Related
I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?
Background: I have a huge DataFrame with 40 million rows. I have to run some functions on some columns. The loops were taking too long, so I decided to go with Multiprocessing.
CPU: 8 cores 16 threads
RAM: 128 GB
Question: How many chunks should I break the data into? And how many workers are properly for this dataset?
p.s. I found out that when I set the max_workers = 15, all threads are running 100%. But if I change the max_workers to 40, they dropped to 40%.
Thank you!
There are three types of parallel computing. Those are io-intensive, cpu-intensive and io-cpu intensive computing.
If your thread is runing on cpu-intensive task, then You can increase your worker numbers as you want to get better performance.
But If it is runing on io-intensive, it willl no effect to increase them.
You seems to be working on io-cpu intensive task.
So If you increase worker numbers, you can get good result until there is no competition for using io resource(hard disk)
so In local machine . it is not a good choice to increase worker numbers.
You can use Hadoop on GPS or AWS for this work.
I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.
The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).
There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.
I have to run a prediction using prophet on a certain number of series (several thousands).
prophet works fine but does not make use of multiple CPU. Each prediction takes around 36 seconds (actually the whole functions that makes also some data preprocessing and post-elaborations). If I run sequentially (for just 15 series to make a test) it takes 540 seconds to complete. Here is the code:
for i in G:
predictions = list(make_prediction(i, c, p))
where G is an iterator that returns a series at a time (capped for the test at 15 series), c and p are two dataframes (the functions uses them only to read).
I then tried with joblib:
predictions = Parallel(n_jobs=5)( delayed(make_prediction)(i, c, p) for i in G)
Time taken 420 seconds.
Then I tried Pool:
p = Pool(5)
predictions = list(p.imap(make_prediction2, G))
p.close()
p.join()
Since with map I can pass just one parameter, I call a function that calls make_prediction(G, c,p). Time taken 327 seconds.
Eventually I tried ray:
I decorated make_prediction with ##ray.remote and called:
predictions = ray.get([make_prediction.remote(i, c, p) for i in G])
time taken: 340 seconds! I also tried to make c and p ray objects with c_ray = ray.put(c) and the same for p and passing c_ray and p_ray to the functions but I could not see any improvement in performance.
I understand the overhead required by forking and actually the time taken from the function is not huge, but I'd expect better performance (less than 40% gain, at best, using 5 time as much CPUs does not seem amazing) especially from ray. Am I missing something or doing something wrong? Any idea to get better performances?
Let me point out that RAM usage is under control. Each process, in the worst scenario, uses less than 2GB of memory so less than 10 in total with 26 GB available.
Python 3.7.7 Ray 0.8.5 joblib 0.14.1 on Mac os X 10.14.6 i7 with 4 physical cores and 8 threads (cpu_count() = 8; with 5 cpu working no throttle reported by pmset -g thermlog)
PS Increasing the test size to 50 series the performance improves, especially with ray, meaning that the overhead of forking is relevant in such a small test. I will make a longer test to have a more accurate performance metric but I guess I won't go far from 50% a value that seems consistent with other posts I have seen where to have a 90% reduction they used 50CPUs.
**************************** UPDATE ***********************************
Increasing the number of series to work to 100, ray hasn't showed good performances (maybe I miss something in its implementation). Pool, instead, using initializer function and imap_unordered went better. There is a tangible overhead caused by forking and preparing each process environment, but I got really weird results:
a single CPU took 2685 seconds to complete the job (100 series);
using pool as described above with 2 CPUs took 1808 seconds (-33% and it seems fine);
using the same configuration with 4 CPUs took 1582 seconds (-41% compared to single CPU but just -12,5% compared to the 2 CPUs job).
Doubling the number of CPU for just 12% of time decrease? Even worse using 5 CPUs takes nearly the same time (please note than 100 can be evenly divided by 1, 2, 4 and 5 so the queues are always filled)! No throttle, no swap, the machine has 8 cores and plenty of unused CPU power even when running the test, so no bottlenecks. What's wrong?
I have a requirement where I have to process a huge number(in millions) of small CPU intense tasks parallely where each task takes around 10s. If I go with multiprocessing or multithreading I would be needing a huge number of instances of threads/processes. How do I go about solving this so that it takes a minimal amount of time?
The most common pattern for this is to scale out horizontally. If you have 1,000,000 tasks # 10 sec/task = 10,000,000 seconds for a single cpu to process or 166,667 minutes (2,778 hours or 116 days). Consider if you're on a multi core machine 4 cores that cuts it down to 29 days (rough estimate maybe need one core to handle queues...). 64 cores would be ~116 days / 64 = 1.8 Days.
If single machine performance doesn't meet your criteria you can scale out to multiple machines. All major cloud services offer queueing systems to make this easy/possible:
Amazon SQS
RabbitMQ
Kafka
NSQ
Etc
Image property of https://anwaarlabs.wordpress.com/2014/04/28/message-queue-part-3-jms-domains/
Instead of being limited to a single machine each machine shares a connection to global queue where they (the consumers in the image) can pull tasks, allowing you to scale out to as many cpu cores as you need.
https://anwaarlabs.files.wordpress.com/2014/04/messaging-queue.png
For CPU bound tasks multi-threading is a poor choice because of the GIL. If you only have a couple of million items it may reduce complexity to use python multiprocessing and a multiprocessing queue and scale out on a single machine (ie rent a 64 core machine from a cloud provider to process in a couple of minutes. The strategy (scale up single machine vs scale out multi machines) depends on your workload size and performance constraints and cost constraints.