I'm running python scripts that do batch data processing on fairly large aws instances (48 or 96 vCPU). multiprocessing.Pool() works nicely: the workers have minimal communication with the main process (take a file path, return True/False). I/O and memory don't seem to be limiting.
I've had variable performance where sometimes the best speed comes from pool size = number of vCPU, sometimes number of vCPU/2, and sometimes vCPU*some multiple around 2-4. These are for different kinds of jobs, on different instances, so it would be hard to benchmark all of them.
Is there a rule of thumb for what size pool to use?
P.S. multiprocessing.cpu_count() returns a number that seems to be equal to the number of vCPU. If that is consistent, I'd like to pick some reasonable multiple of cpu_count and just leave it at that.
The reason for those numbers:
number of vCPU: It is reasonable, we use all the cores.
number of vCPU/2: It is also reasonable, as sometimes we have double logical cores compares to the physical cores. But logical cores won't actually speed your program up, so we just use vCPU/2.
vCPU*some multiple around 2-4: It is reasonable for some IO-intensive tasks. For these kinds of tasks, the process is not occupying the core all the time, so we can schedule some other tasks during IO operations.
So now let's analyze the situation, I guess you are running on a server which might be a VPS. In this case, there is no difference between logical cores and physical cores, because vCPU is just an abstract computation resource provided by the VPS provider. You cannot really touch the underlying physical cores.
If your main process is not computation-intensive, or let's say it is just a simple controller, then you don't need to allocate a whole core for it, which means you don't need to minus one.
Based on your situation, I would like to suggest the number of vCPU. But you still need to decide based on the real situation you meet. The critical rule is:
Maximize resource usage(use as many cores as you can), minimize resource competition(Too many processes will compete for the resource, which will slow the whole program down).
There are many rules-of-thumb that you may follow, depending on the task as you already figured out
Number of physical cores
Number of logical cores
Number of phyiscal or logical cores minus one (supposedly reserving one core for the logic and control)
To avoid counting logical cores instead of physical ones, I suggest using the psutil library:
import psutil
psutil.cpu_count(logical=False)
As for what using in the end, for numerically intensive applications I tend to go with the number of physical cores. Bear in mind that some BLAS implementations use multithreading by default, which may hurt a lot the scalability of data-parallel pipelines. Use MKL_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 (depending on your BLAS backend) as environment variables whenever doing batch processing and you should have quasi-linear speedups w.r.t. the number of physical cores.
Related
I read the captioned sentence in dask’s website and wonder what it means. I have extracted the relevant part below for ease of reference:
A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).
While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.
You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.
Does it mean that the same chunk will co-exist at the mother node(or process or thread?) and the child node? Is it not necessary to have the same chunk twice?
PS: I don't quite understand the difference among node, process and thread so I just put all of them there.
Answering this part:
I don't quite understand the difference among node, process and thread so I just put all of them there.
a node is a computer machine. This can be a physical box somewhere, with a CPU, disks, etc. In the cloud, you likely have a "virtual machine" that runs on physical hardware what you don't get to know about, but still it runs a single operating system kernel. Communication between nodes is via the network.
a container (you didn't ask about this) is an isolated runtime on a node which takes up a specified amount of memory and CPU resources from the node (also called "host") but share the disk, network and GPU. Communication between containers is via the network, whether they are on the same node or not (will be faster if yes). kubernetes and yarn are examples of container frameworks. There may be several containers per node.
a process is a running executable thing. It may be within a container or not. It has its own isolated memory. A node will be running many processes, but a container typically runs one. dask-scheduler, dask-worker and your client session (ipython, jupyter, python...) are examples of processes. Dask processes communicate with other processes on the same machine using networking primitives (still needs serialisation of data), although other possibilities exist.
threads are multiple execution points that might exist within a process. They share memory, so don't need to copy anything between themselves, but not all operations in python can run in parallel on threads, because of the "interpreter lock", which exists to make the single-threaded case safer and faster.
For dask, the number of cores you can use is n_threads * n_processes. If you weight this more towards threads, you are more efficient on memory. If you weight it more to processes, you get more parallelism. Which is best depends on your workload.
In many cases, a dask graph will involve many more chunks than there are threads. This warning is noting that multiple of these chunks per worker might be in memory at the same time. for example, in the job:
avg = dask.array.random(
size=(1000, 1000, 1000), chunks=(10, 1000, 1000)
).mean().compute()
there are 100 chunks, each of which are ~80MB in size, and you should anticipate roughly 80MB * nworkers * 2 to be in memory at once.
I want to make a python script which includes many numba njitted functions with parallel=True to use all the cores I request on a cluster.
On the cluster, I can only request the number of cores I want to use, via #SBATCH -n no_of_cores_you_want.
At the moment, having something like:
#SBATCH -n 150
NUMBA_NUM_THREADS=100 python main.py
makes the main.py to output that numba.config.NUMBA_DEFAULT_NUM_THREADS=20 and numba.config.NUMBA_NUM_THREADS=100.
My explanation for this is that a node on the cluster is composed of 20 single threaded cores, looking at its specs.
How can I make the main.py to use all the cores the cluster gives to me? I underline that I only want the main.py to be run once only, and not multiple times. The aim is that single run to make use of all the available cores (located on multiple separate nodes).
(The numba.config.NUMBA_NUM_THREADS is 100 because if I set it to 150, a slurm error appears. It can probably be higher than 100, but it is mandatory to be less than 150.)
A computing cluster is not just a bag of cores. It is far more complex. To begin with, a modern mainstream cluster is basically a set of computing nodes (interconnected with a network). Each node contains one or multiple microprocessors. Each microprocessor has many cores (typically dozens nowadays). Each cores can have multiple hardware threads (typically 2). Nodes have their own memory and a process cannot access to the memory of a remote node (unless the hardware support it or a software abstract this). This is called distributed memory. Core of a node share the same main memory. Note that in practice the access is generally not uniform (NUMA): some cores often have a faster access to some part of the main memory (and if you do not care about that your application can poorly scale).
This means you need to write a distributed application so to use all the cores of a cluster. MPI is a good way to write such an application. Numba does not support distributed memory, only shared memory, so you can only use one computing node with it. Note that writing distributed application is not trivial. Note also that you can mix MPI codes with Numba.
By the way, please consider optimizing your application before using multiple nodes. It is often simpler, but it is also less expensive, use less energy and it makes your application easier to maintain (debugging distributed applications is tricky).
Also note that using more threads than available cores on a node cause an over-subscription which often results in a severe performance degradation. If your application is well optimized, hardware threads should not improve the performance and can even slow down your application.
I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?
I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.
It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.
It seems to me that using the python multiprocessing Pool.map as described here parallelizes the process to some extent between different cores of one CPU, but I have the feeling that there is no speed-up reflecting more CPU's on a computer. If that's right, is there a way to effectively use the "Number of CPU's times number of cores in each CPU"?
(Admittedly, I may be wrong because my experiments are based on a virtual Amazon cloud machine with 16 virtual CPU's but I know it's not a "real computer".)
More exactly, by default the number of processes will be the number of cores presented by the OS. If the computer uses more than one CPU, the OS should present the total number of cores to Python. But anyway, you can always force the number of process to a smaller value is you do not want to use all the resources from the machine (if it is running a background server for example) or to a higher value if the task is not CPU bound but IO bound for example.
Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.