Number of MKL threads not affecting performance of numpy mean

Number of MKL threads not affecting performance of numpy mean - python

I am trying to optimize the number of MKL library threads that are used when a call is made to numpy.mean() (I am using numpy that has been built against the MKL library). The number of threads can be dynamically controlled at runtime using mkl.set_num_threads(n) from the mkl-service library. While this does correctly set the number of threads, and in-fact this is verified in the CPU usage with htop, I am bewildered to find that it doesn't have any impact on the runtime. Consider this trial code where tmp is a (12, 384, 320) array:
for j in range(1000):
out = np.mean(tmp, axis=(0))
With a single thread this takes up roughly 21 seconds and it takes up the same amount if I use a greater number of threads. The CPU consumption does go up with more threads, but there is no performance improvement. I also verified this issue by averaging over the last dimension to make the averaging more cache efficient.
Any ideas on why this might be happening?

MKL Summary Statistics functions work in 1 thread in the case of such small input problem sizes. The threading will be switch on when the problem size >~ 10 K elements.

Related

Multiprocessing large convolutions using Scipy no speed up

I am using scipy.signal.correlate to perform large 2D convolutions. I have a large number of arrays that I want to operate on, and so naturally thought the multiprocessing.Poolcould help. However using the following simple setup (on a 4-core cpu) provides no benefit.
import multiprocessing as mp
import numpy as np
from scipy import signal
arrays = [np.ones([500, 500])] * 100
kernel = np.ones([30, 30])
def conv(array, kernel):
return (array, kernel, signal.correlate(array, kernel, mode="valid", method="fft"))
pool = mp.Pool(processes=4)
results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]
Changing the process count to {1, 2, 3, 4} has approximately the same time (2.6 sec +- .2).
What could be going ?

I think the problem is in this line:
results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]
Pool.apply is a blocking operation, running conv on each element in the array and waiting before going to the next element, so even though it looks like you are multiprocessing nothing is actually being distributed. You need to use pool.map instead to get the behavior you are looking for:
from functools import partial
conv_partial = partial( conv, kernel=kernel )
results = pool.map( conv_partial, arrays )

What could be going on?
First I thought that the reason is that the underlying FFT implementation is already parallelized. While the Python interpreter is single threaded, the C code that may be called from Python may be able to fully utilize your CPU.
However, the underlying FFT implementation of scipy.correlate seems to be fftpack in NumPy (translated from Fortran that was written in 1985) which seems to be single threaded from what it looks like on the fftpack page.
Indeed, I ran your script and got considerably higher CPU usage. With four cores on my computer and four processes I got a speedup of ~2 (for 2000x2000 arrays and using the code changes from Raw Dawg).
The overhead from creating the processes and communicating with the processes eats up parts of the benefits of more efficient CPU usage.
You can still try to optimize the array sizes or to compute only a small part of the correlation (if you don't need it all) or if the kernel is the same all the time, compute the FFT of the kernel only once and reuse it (this would require implementing part of scipy.correlate yourself) or try single precision instead of double or do the correlation on the graphics card with CUDA.

How do I know if my Embarassingly Parallel Task is Suitable for GPU?

Are we saying that a task that requires fairly light computation per row on a huge number of rows is fundamentally unsuited to a GPU?
I have some data processing to do on a table where the rows are independent. So it is embarrasingly parallel. I have a GPU so....match made in heaven? It is something quite similar to this example which calculates moving average for each entry per row (rows are independent.)
import numpy as np
from numba import guvectorize
#guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
window_width = window_arr[0]
asum = 0.0
count = 0
for i in range(window_width):
asum += a[i]
count += 1
out[i] = asum / count
for i in range(window_width, len(a)):
asum += a[i] - a[i - window_width]
out[i] = asum / count
arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))
Like this example, my processing for each row is not heavily mathematical. Rather it is looping across the row and doing some sums, assignments and other bits and pieces with some conditional logic thrown in.
I have tried using guVectorize in Numba library to assign this to a Nvidia GPU. It works fine but I'm not getting a speedup.
Is this type of task suited to a GPU in principle? i.e. if I go deeper into Numba and start tweaking the threads, blocks and memory management or the algorithm implementation should I , in theory , get a speed up. Or, is this kind of problem fundamentally just unsuited to the architecture.
The answers below seem to suggest it is unsuited but I am not quite convinced yet.
numba - guvectorize barely faster than jit
And numba guvectorize target='parallel' slower than target='cpu'

Your task is obviously memory-bound, but it doesn't mean that you cannot profit from GPU, however it is probably less straight forward than for a CPU-bound task.
Let's look at common configuration and do some math:
CPU-RAM memory bandwidth of ca. 24GB/s
CPU-GPU transfer bandwidth of ca. 8GB/s
GPU-RAM memory bandwidth of ca. 180GB/s
Let's assume we need to transfer 24 GB of data to complete the task, so we will have the following optimal times (whether and how to achieve these times is another question!):
scenario: only CPU time = 24GB/24GB/s = 1 second.
scenario: Data must be transferred from CPU to GPU (24GB/8GB/s = 3 seconds) and processed there (24GB/180GB/s = 0.13 second) leads to 3.1 seconds.
scenario: Data is already on the device, so only 24GB/180GB/s = 0.13 seconds are needed.
As one can see, there is a potential for a speed-up but only in the 3. scenario - when your data is already on the GPU-device.
However, achieving the maximal bandwidth is a quite challenging enterprise.
For example, when processing the matrix row-wise, on CPU, you would like your data to be in the row-major-order (C-order) in order to get the most out of the L1-cache: while reading a double you actually get 8 doubles loaded into the cache and you don't want them to be evicted from the cache, before you could process the remaining 7.
On GPU, on the other hand, you want the memory accesses to be coalesced, e.g. thread 0 should access address 0, thread 1 - address 1 and so on. For this to work, the data must be in column-major-order (Fortran-order).
There is another thing to be considered: the way you test the performance. Your test array is only about 2MB large and thus small enough for the L3 cache. The bandwidth of the L3 cache depends on the number of cores used for the calculation, but will be at least around 100GB/s - not much slower than GPU and probably much faster when parallelized on CPU.
You need a bigger dataset to not get fooled by cache behavior.
A somewhat off-topic remark: your algorithm is not very robust from the numerical point of view.
If the window width were 3, as in your example, but there were about 10**4 elements in a row. So for the last element, the value is result of about 10**4 additions and subtractions, each of which adds a rounding error to the value - compared to only three three additions if done "naively" it is quite a difference.
Of cause, it might not be of significance (for 10 elements in a row as in your example), but also might bite you one day...

Numexpr detecting number of threads less than number of cores

I am using numexpr for simple array addition on a remote cluster. My computer has 8 cores and the remote cluster has 28 cores. Numexpr documentation says that "During initialization time Numexpr sets this number to the number of detected cores in the system" But the cluster gives this output.
detect_number_of_cores() = 28
detect_number_of_threads()=8
ALthough when I try to set the number of threads manually to something else(set_num_threads=20) , array operation seems to run faster. But detect_number_of_threads() still gives 8 as output.
Is this a bug?

I´m not sure, how numexpr actually works internally, when detect_number_of_threads is called, but maybe it reads out the number of threads that is available to openmp and not the number of threads that were locally set.

Is the multiprocessing module of python the right way to speed up large numeric calculations?

I have a strong background in numeric compuation using FORTRAN and parallelization with OpenMP, which I found easy enough to use it on many problems. I switched to PYTHON since it much more fun (at least for me) to develop with, but parallelization for nummeric tasks seem much more tedious than with OpenMP. I'm often interested in loading large (tens of GB) data sets to to the main Memory and manipulate it in parallel while containing only a single copy of the data in main memory (shared data). I started to use the PYTHON module MULTIPROCESSING for this and came up with this generic example:
#test cases
#python parallel_python_example.py 1000 1000
#python parallel_python_example.py 10000 50
import sys
import numpy as np
import time
import multiprocessing
import operator
n_dim = int(sys.argv[1])
n_vec = int(sys.argv[2])
#class which contains large dataset and computationally heavy routine
class compute:
def __init__(self,n_dim,n_vec):
self.large_matrix=np.random.rand(n_dim,n_dim)#define large random matrix
self.many_vectors=np.random.rand(n_vec,n_dim)#define many random vectors which are organized in a matrix
def dot(self,a,b):#dont use numpy to run on single core only!!
return sum(p*q for p,q in zip(a,b))
def __call__(self,ii):# use __call__ as computation such that it can be handled by multiprocessing (pickle)
vector = self.dot(self.large_matrix,self.many_vectors[ii,:])#compute product of one of the vectors and the matrix
return self.dot(vector,vector)# return "length" of the result vector
#initialize data
comp = compute(n_dim,n_vec)
#single core
tt=time.time()
result = [comp(ii) for ii in range(n_vec)]
time_single = time.time()-tt
print "Time:",time_single
#multi core
for prc in [1,2,4,10]:#the 20 case is there to check that the large_matrix is only once in the main memory
tt=time.time()
pool = multiprocessing.Pool(processes=prc)
result = pool.map(comp,range(n_vec))
pool.terminate()
time_multi = time.time()-tt
print "Time using %2i processes. Time: %10.5f, Speedup:%10.5f" % (prc,time_multi,time_single/time_multi)
I ran two test cases on my machine (64bit Linux using Fedora 18) with the following results:
andre#lot:python>python parallel_python_example.py 10000 50
Time: 10.3667809963
Time using 1 processes. Time: 15.75869, Speedup: 0.65785
Time using 2 processes. Time: 11.62338, Speedup: 0.89189
Time using 4 processes. Time: 15.13109, Speedup: 0.68513
Time using 10 processes. Time: 31.31193, Speedup: 0.33108
andre#lot:python>python parallel_python_example.py 1000 1000
Time: 4.9363951683
Time using 1 processes. Time: 5.14456, Speedup: 0.95954
Time using 2 processes. Time: 2.81755, Speedup: 1.75201
Time using 4 processes. Time: 1.64475, Speedup: 3.00131
Time using 10 processes. Time: 1.60147, Speedup: 3.08242
My question is, am I misusing the MULTIPROCESSING module here? Or is this the way it goes with PYTHON (i.e. don't parallelize within python but rely totally on numpy's optimizations)?

While there is no general answer to your question (in the title), I think it is valid to say that multiprocessing alone is not the key for great number-crunching performance in Python.
In principle however, Python (+ 3rd party modules) are awesome for number crunching. Find the right tools, you will be amazed. Most of the times, I am pretty sure, you will get better performance with writing (much!) less code than you have achieved before doing everything manually in Fortran. You just have to use the right tools and approaches. This is a broad topic. A few random things that might interest you:
You can compile numpy and scipy yourself using Intel MKL and OpenMP (or maybe a sys admin in your facility already did so). This way, many linear algebra operations will automatically use multiple threads and get the best out of your machine. This is simply awesome and probably underestimated so far. Get your hands on a properly compiled numpy and scipy!
multiprocessing should be understood as a useful tool for managing multiple more or less independent processes. Communication among these processes has to be explicitly programmed. Communication happens mainly through pipes. Processes talking a lot to each other spend most of their time talking and not number crunching. Hence, multiprocessing is best used in cases when the transmission time for input and output data is small compared to the computing time. There are also tricks, you can for instance make use of Linux' fork() behavior and share large amounts of memory (read-only!) among multiple multiprocessing processes without having to pass this data around through pipes. You might want to have a look at https://stackoverflow.com/a/17786444/145400.
Cython has already been mentioned, you can use it in special situations and replace performance-critical code parts in your Python program with compiled code.
I did not comment on the details of your code, because (a) it is not very readable (please get used to PEP8 when writing Python code :-)) and (b) I think especially regarding number crunching it depends on the problem what the right solution is. You have already observed in your benchmark what I have outlined above: in the context of multiprocessing, it is especially important to have an eye on the communication overhead.
Spoken generally, you should always try to find a way from within Python to control compiled code to do the heavy work for you. Numpy and SciPy provide great interfaces for that.

Number crunching with Python... You probably should learn about Cython. It is and intermediate language between Python and C. It is tightly interfaced with numpy and has support for paralellization using openMP as backend.

From the test results you supplied, it appears that you ran your tests on a two core machine. I have one of those and ran your test code getting similar results. What these results show is that there is little benefit to running more processes than you have cores for numerical applications that lend themselves to parallel computation.
On my two core machine, approximately 20% of the CPU is absorbed simply in keeping my environment going, so when I see a 1.8 improvement running two processes I am confident that all the available cycles are being used for my work. Basically, for parallel numerical work the more cores the better as this raises the percentage of the computer that is available to do your work.
The other posters are entirely correct in pointing you at Numpy, Scipy, Cython etc. Basically you first need to make your computation use as few cycles as possible and then use multiprocessing in some form to find more cycles to apply to your problem.

Parallel application in python becomes much slower when using mpi rather than multiprocessing module

Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.

MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.