I'm trying to put together a simple example to illustrate the benefits of using numba.prange for myself and some colleagues, but I'm unable to get a decent speedup. I coded up a simple 1D diffusion solver, which essentially loops over a long array, combining elements i+1, i, and i-1, and writing the results into element i of a second array. This should be a pretty perfect use case for a parallel for loop, similar to OpenMP in Fortran or C.
My complete example is included below:
import numpy as np
from numba import jit, prange
#jit(nopython=True, parallel=True)
def diffusion(Nt):
alpha = 0.49
x = np.linspace(0, 1, 10000000)
# Initial condition
C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
# Temporary work array
C_ = np.zeros_like(C)
# Loop over time (normal for-loop)
for j in range(Nt):
# Loop over array elements (space, parallel for-loop)
for i in prange(1, len(C)-1):
C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
C[:] = C_
return C
# Run once to just-in-time compile
C = diffusion(1)
# Check timing
%timeit C = diffusion(100)
When running with parallel=False, this takes about 2 second, and with parallel=True it takes about 1.5 seconds. I'm running on a MacBook Pro with 4 physical cores, and Activity Monitor reports 100% and around 700% CPU usage with and without parallelisation.
I would have expected something closer to a factor 4 speedup. Am I doing something wrong?
The poor scalability certainly comes from the saturation of the RAM which is shared by all cores on a desktop machine. Indeed, your code is memory-bound and the memory throughput is pretty limited on modern machines compared to the computing power of CPUs (or GPUs). As a result, 1 or 2 cores are generally enough to saturate the RAM on most desktop machine (a bit more cores are needed on computing servers).
On a 10 core Intel Xeon processor with a 40~43 GiB/s RAM, the code takes 1.32 seconds in parallel and 2.56 seconds in sequential. This means only 2 times faster with 10 cores. That being said, the parallel loop read the full C array once per time step and also read+write the full C_ array once per time step (x86 processors need to read the written memory by default due to the write allocate cache policy). The C[:] = C_ does the same thing. This means (2*3)*(8*10e6)*100/1024**3 = 44.7 GiB or RAM are read/written during only 1.32 second in parallel, resulting in a 33.9 GiB/s memory throughput reaching 80% of the RAM bandwidth (very good for this use-case).
To speed up this code, you need to read/write less data from/to RAM and compute data as much as possible in cache. The first thing to do is to use a double-buffering approach with two views so to avoid a very expensive copy. Another optimization is to try to do multiple time step in parallel at the same time. This is theoretically possible using a complex trapezoidal tiling strategy but very tricky to implement in practice, especially in Numba. High-performance stencil libraries should do that for you. Such optimization should not only improve the sequential execution but also the scalability of the resulting parallel code.
Related
Can I use the cores in my GPU to accelerate this problem and speed it up? If so, how do I do it? At around 10 trillion a single thread in my CPU just can't do it, which is why I want to accelerate it with my GPU. I am also interested in seeing any multi threaded CPU answers, but I really want to see it done on the GPU. Ideally I'd like the answers to be as simple as possible.
My code:
y=0
for x in range(1, 10000000000):
y += 1/x
print(y)
Yes, this operation can be done one GPU using a basic parallel reduction. In fact, it is well suited on GPUs since it is mainly embarrassingly parallel and heavily makes use of floating-point/integer operations.
Note that the convergence of such basic sequence is AFAIK well known analytically (as pointed out by #jfaccioni) and thus you should prefer analytical solutions that are generally far cheaper to compute. Note also that client-side GPUs are not great to efficiently compute 64-bit floating-point (FP) numbers so you should generally use 32-bit ones so to see a speed-up at the expense of a lower precision. That being said, server-side GPUs can compute 64-bit FP numbers efficiently so the best solution depends of the hardware you actually have.
Nvidia GPU are usually programmed with CUDA which is pretty low-level compared to a basic pure-Python code. There are Python wrapper and higher-level library but in this case most are not efficient since they will cause unnecessary memory loads/stores or other overheads. AFAIK, PyCUDA and Numba are likely the best tool for that yet. If your GPU is not an Nvidia GPU, then you can use libraries based on OpenCL (as CUDA is not really well supported on non-Nvidia GPUs yet).
Numba support high-level reduction so it can be done very easily (note that Numba use CUDA internally so you need an Nvidia GPU):
from numba import cuda
# "out" is an array with 1 FP item that must be initialized to 0.0
#cuda.jit
def vec_add(out):
x = cuda.threadIdx.x
bx = cuda.blockIdx.x
bdx = cuda.blockDim.x
i = bx * bdx + x
if i < 10_000_000_000-1:
numba.cuda.atomic.add(out, 0, i+1)
This is only the GPU kernel, not the whole code. For more information about how to run it please read the documentation. In general, one need to care about data transfer, allocation kernel dependencies, streams, etc. Keep in mind that GPUs are hard to program (efficiently). This kernel is simple but clearly not optimal, especially on old GPU without hardware atomic acceleration units. To write a faster kernel, you need to perform a local reduction using an inner loop. Also note that C++ is much better to write efficient kernel codes, especially with libraries like CUB (based on CUDA) which supports iterators and high-level flexible efficient primitives.
Note that Numba can also be used to implement fast parallel CPU codes. Here is an example:
import numba as nb
#nb.njit('float64(int64)', fastmath=True, parallel=True)
def compute(limit):
y = 0.0
for x in nb.prange(1, limit):
y += 1 / x
return y
print(compute(10000000000))
This takes only 0.6 seconds on my 10-core CPU machine. CPU codes have the benefit to be simpler, easier to maintain, more portable and more flexible despite being possibly slower.
(Nvidia CUDA) GPU's can be used with the proper modules installed. However a standard multiprocessing solution (not multithreading since this is a computation-only task) is easy enough to achieve, and reasonably efficient:
import multiprocessing as mp
def inv_sum(start):
return sum(1/x for x in range(start,start+1000000))
def main():
pool = mp.Pool(mp.cpu_count())
result = sum(pool.map(inv_sum, range(1,1000000000,1000000)))
print(result)
if __name__ == "__main__":
main()
I didn't dare to test run it on one trillion, but one billion on my 8-core i-5 laptop runs in about 20 seconds
I have to run a prediction using prophet on a certain number of series (several thousands).
prophet works fine but does not make use of multiple CPU. Each prediction takes around 36 seconds (actually the whole functions that makes also some data preprocessing and post-elaborations). If I run sequentially (for just 15 series to make a test) it takes 540 seconds to complete. Here is the code:
for i in G:
predictions = list(make_prediction(i, c, p))
where G is an iterator that returns a series at a time (capped for the test at 15 series), c and p are two dataframes (the functions uses them only to read).
I then tried with joblib:
predictions = Parallel(n_jobs=5)( delayed(make_prediction)(i, c, p) for i in G)
Time taken 420 seconds.
Then I tried Pool:
p = Pool(5)
predictions = list(p.imap(make_prediction2, G))
p.close()
p.join()
Since with map I can pass just one parameter, I call a function that calls make_prediction(G, c,p). Time taken 327 seconds.
Eventually I tried ray:
I decorated make_prediction with ##ray.remote and called:
predictions = ray.get([make_prediction.remote(i, c, p) for i in G])
time taken: 340 seconds! I also tried to make c and p ray objects with c_ray = ray.put(c) and the same for p and passing c_ray and p_ray to the functions but I could not see any improvement in performance.
I understand the overhead required by forking and actually the time taken from the function is not huge, but I'd expect better performance (less than 40% gain, at best, using 5 time as much CPUs does not seem amazing) especially from ray. Am I missing something or doing something wrong? Any idea to get better performances?
Let me point out that RAM usage is under control. Each process, in the worst scenario, uses less than 2GB of memory so less than 10 in total with 26 GB available.
Python 3.7.7 Ray 0.8.5 joblib 0.14.1 on Mac os X 10.14.6 i7 with 4 physical cores and 8 threads (cpu_count() = 8; with 5 cpu working no throttle reported by pmset -g thermlog)
PS Increasing the test size to 50 series the performance improves, especially with ray, meaning that the overhead of forking is relevant in such a small test. I will make a longer test to have a more accurate performance metric but I guess I won't go far from 50% a value that seems consistent with other posts I have seen where to have a 90% reduction they used 50CPUs.
**************************** UPDATE ***********************************
Increasing the number of series to work to 100, ray hasn't showed good performances (maybe I miss something in its implementation). Pool, instead, using initializer function and imap_unordered went better. There is a tangible overhead caused by forking and preparing each process environment, but I got really weird results:
a single CPU took 2685 seconds to complete the job (100 series);
using pool as described above with 2 CPUs took 1808 seconds (-33% and it seems fine);
using the same configuration with 4 CPUs took 1582 seconds (-41% compared to single CPU but just -12,5% compared to the 2 CPUs job).
Doubling the number of CPU for just 12% of time decrease? Even worse using 5 CPUs takes nearly the same time (please note than 100 can be evenly divided by 1, 2, 4 and 5 so the queues are always filled)! No throttle, no swap, the machine has 8 cores and plenty of unused CPU power even when running the test, so no bottlenecks. What's wrong?
I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?
You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.
I am trying to optimize the number of MKL library threads that are used when a call is made to numpy.mean() (I am using numpy that has been built against the MKL library). The number of threads can be dynamically controlled at runtime using mkl.set_num_threads(n) from the mkl-service library. While this does correctly set the number of threads, and in-fact this is verified in the CPU usage with htop, I am bewildered to find that it doesn't have any impact on the runtime. Consider this trial code where tmp is a (12, 384, 320) array:
for j in range(1000):
out = np.mean(tmp, axis=(0))
With a single thread this takes up roughly 21 seconds and it takes up the same amount if I use a greater number of threads. The CPU consumption does go up with more threads, but there is no performance improvement. I also verified this issue by averaging over the last dimension to make the averaging more cache efficient.
Any ideas on why this might be happening?
MKL Summary Statistics functions work in 1 thread in the case of such small input problem sizes. The threading will be switch on when the problem size >~ 10 K elements.
Are we saying that a task that requires fairly light computation per row on a huge number of rows is fundamentally unsuited to a GPU?
I have some data processing to do on a table where the rows are independent. So it is embarrasingly parallel. I have a GPU so....match made in heaven? It is something quite similar to this example which calculates moving average for each entry per row (rows are independent.)
import numpy as np
from numba import guvectorize
#guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
window_width = window_arr[0]
asum = 0.0
count = 0
for i in range(window_width):
asum += a[i]
count += 1
out[i] = asum / count
for i in range(window_width, len(a)):
asum += a[i] - a[i - window_width]
out[i] = asum / count
arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))
Like this example, my processing for each row is not heavily mathematical. Rather it is looping across the row and doing some sums, assignments and other bits and pieces with some conditional logic thrown in.
I have tried using guVectorize in Numba library to assign this to a Nvidia GPU. It works fine but I'm not getting a speedup.
Is this type of task suited to a GPU in principle? i.e. if I go deeper into Numba and start tweaking the threads, blocks and memory management or the algorithm implementation should I , in theory , get a speed up. Or, is this kind of problem fundamentally just unsuited to the architecture.
The answers below seem to suggest it is unsuited but I am not quite convinced yet.
numba - guvectorize barely faster than jit
And numba guvectorize target='parallel' slower than target='cpu'
Your task is obviously memory-bound, but it doesn't mean that you cannot profit from GPU, however it is probably less straight forward than for a CPU-bound task.
Let's look at common configuration and do some math:
CPU-RAM memory bandwidth of ca. 24GB/s
CPU-GPU transfer bandwidth of ca. 8GB/s
GPU-RAM memory bandwidth of ca. 180GB/s
Let's assume we need to transfer 24 GB of data to complete the task, so we will have the following optimal times (whether and how to achieve these times is another question!):
scenario: only CPU time = 24GB/24GB/s = 1 second.
scenario: Data must be transferred from CPU to GPU (24GB/8GB/s = 3 seconds) and processed there (24GB/180GB/s = 0.13 second) leads to 3.1 seconds.
scenario: Data is already on the device, so only 24GB/180GB/s = 0.13 seconds are needed.
As one can see, there is a potential for a speed-up but only in the 3. scenario - when your data is already on the GPU-device.
However, achieving the maximal bandwidth is a quite challenging enterprise.
For example, when processing the matrix row-wise, on CPU, you would like your data to be in the row-major-order (C-order) in order to get the most out of the L1-cache: while reading a double you actually get 8 doubles loaded into the cache and you don't want them to be evicted from the cache, before you could process the remaining 7.
On GPU, on the other hand, you want the memory accesses to be coalesced, e.g. thread 0 should access address 0, thread 1 - address 1 and so on. For this to work, the data must be in column-major-order (Fortran-order).
There is another thing to be considered: the way you test the performance. Your test array is only about 2MB large and thus small enough for the L3 cache. The bandwidth of the L3 cache depends on the number of cores used for the calculation, but will be at least around 100GB/s - not much slower than GPU and probably much faster when parallelized on CPU.
You need a bigger dataset to not get fooled by cache behavior.
A somewhat off-topic remark: your algorithm is not very robust from the numerical point of view.
If the window width were 3, as in your example, but there were about 10**4 elements in a row. So for the last element, the value is result of about 10**4 additions and subtractions, each of which adds a rounding error to the value - compared to only three three additions if done "naively" it is quite a difference.
Of cause, it might not be of significance (for 10 elements in a row as in your example), but also might bite you one day...