Pre-compile numba cuda kernels (non-jit)

Pre-compile numba cuda kernels (non-jit) - python

Hi I am using numba to write some kernels with the #cuda.jit decorator. I have 8 CPU threads each calling a kernel on 1 of 2 GPU devices. (cpu_idx % len(cuda.gpus) to be specific)
I believe each CPU thread is compiling the kernel, which takes up a lot of time relative to the time it takes for the kernel to process an entire image. Ideally it should be only compiled once for all the CPU threads to use. But I can't initialize any cuda gpu code before forking with multiprocessing.Pool because cuda doesn't like that.
So is there a way to pre-compile cuda kernels? I don't want just-in-time compilation

You can use Eager Compilation to pre-compile the CUDA kernel for a given signature: http://numba.pydata.org/numba-doc/latest/user/jit.html?highlight=eager#eager-compilation
Note that the CUDA backend only supports a single signature for Eager Compilation (whereas the CPU target supports multiple), and that Eager-compiled kernels don't check types when you call the kernel. This results in a faster launch, but opens up the possibility of un-noticed user error by allowing argument types to mis-match between the compiled kernel and what is passed in.
However, noting from your question that you're using multiple processes rather than threads, you will still have an issue as you'd have to pre-compile and then fork, which CUDA still wouldn't like. Is it possible for you to use threads in your application with the threading module instead?
A side question: if the runtime of your application / GPU kernel is so short that the compilation time is the dominating factor, does using a GPU make sense for your application? Once you have processed one image in each thread, even when JIT compiling, there won't be a compilation overhead for that thread again - for subsequent launches, the kernel will be retrieved from the cache. If you were processing many images, the compilation time would become negligible in the overall runtime.

Related

CPU utilization when using ray and torch

I use ray and torch in my code and set one CPU core for each ray remote actor
to compute gradient(use torch package). But I find the CPU utilization of the actor
can go up to 300% in some time, This seems to be impossible since The actor is supposed to use
only one CPU core.
I want to know if the actor is actually using more CPU resources since torch may open one or more
threads to compute gradient.
My OS is Win10 and CPU is Ryzen 5600H. Thanks.

Ray currently does not automatically pin the actor to specific CPU cores and prevent it from using other CPU cores. So what you're seeing makes sense.
It is possible to use a library like psutil to pin the actor to a specific core and prevent it from using other cores. This can be helpful if you have many parallel tasks/actors that are all multi-threaded and competing with each other for resources (e.g., because they use pytorch or numpy).

Automatic GPU offloading in python

I have written a piece of scientific code in python, mainly using the numpy library (especially Fast Fourier Transforms), and a bit of Cython. Nothing in CUDA or anything GPU related that I am aware of. There is no graphic interface, everything runs in the terminal (I'm using WSL2 on Windows). The whole code is mostly about number crunching, nothing fancy at all.
When I run my program, I see that CPU usage is ~ 100% (to be expected of course), but GPU usage also rises, to around 5%.
Is it possible that a part of the work gets offloaded automatically to the GPU? How else can I explain this small but consistent increase in GPU usage ?
Thanks for the help

No, there is no automatic offloading in Numpy, at least not with the standard Numpy implementation. Note that some specific FFT libraries can use the GPU, but the standard implementation of Numpy uses its own implementation of FFT called PocketFFT based on FFTPack that do not use the GPU. Cython do not perform any automatic implicit GPU offloading. The code need to do that explicitly/manually.
No GPU offloading are automatically performed because GPUs are not faster than CPUs for all tasks and offloading data to the GPU is expensive, especially with small arrays (due to the relatively high-latency of the PCI bus and kernel calls in such a case). Moreover, this is hard to do efficiently even in case where the GPUs could be theoretically faster.
The 5% GPU usage is relative to the frequency of the GPU which is often configured to use an adaptative frequency. For example my discrete Nv-1660S GPU frequency is currently 300 MHz while it can automatically reach 1.785 GHz. Using actually 5% of a GPU running at a 17% of its maximum frequency with a 2D rendering of a terminal is totally possible. On my machine, printing lines in a for loop at 10 FPS in a Windows terminal takes 6% of my GPU still running at low-frequency (0-1% without running anything).
If you want to check the frequency of your GPU and the load there are plenty of tools for that starting from vendor tools often installed with the driver to softwares like GPU-z on Windows. For Nvidia GPU, you can list the processes currently using you GPU with nvidia-smi (it should be rocm-smi on AMD GPUs).

Does #jit run a python code on the gpu if we do not explicitly mention a target?

I have a python code I am trying to accelerate using Cuda. I have used the #jit function. How do I know if the code is actually being run on the gpu? Is there any way to check/verify that?

You should use numba.cuda.jit in order to run your jitted function on GPU. Moreover, this function should be written in the manner of CUDA kernel (http://numba.pydata.org/numba-doc/0.30.1/cuda/kernels.html). When the function runs some GPU monitor (for example nvidia-smi for Linux) can be used to see the GPU load and to check that it is involved in calculation.

PyOpenCL vs Python Multiprocessing?

I've researched this topic quite a bit and can't seem to come to a conclusion.
So I know OpenCL can be used for parallel processing using both the GPU and CPU (in contrast to CUDA). Since I want to do parallel processing with GPU and CPU, would it be better to use Multiprocessing module from python + PyOpenCL/PyCUDA for parallel processing or just use PyOpenCL for both GPU and CPU parallel programming?
I'm pretty new to this but intuitively, I would imagine multiprocessing module from python to be the best possible way to do CPU parallel processing in Python.
Any help or direction would be much appreciated

I don't know if you already got your answer, but take in mind GPUs are designed for floating point operations, and execute a complete python processes in it could slower than you may expect for a GPU.
Anyway, once you are new in parallell processing, you should start with multiprocessing module, once GPU programming, and OpenCL library itself, are diffcult to learn when you have no base.
You may take a look here https://philipwfowler.github.io/2015-01-13-oxford/intermediate/python/04-multiprocessing.html

n_jobs don't work in sklearn-classes

Does anybody use "n_jobs" of sklearn-classes? I am work with sklearn in Anaconda 3.4 64 bit. Spyder version is 2.3.8. My script can't finish its execution after setting "n_jobs" parameter of some sklearn-class to non-zero value.Why is this happening?

Several scikit-learn tools such as GridSearchCV and cross_val_score rely internally on Python’s multiprocessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as argument.
Taken from Sklearn documentation:
The problem is that Python multiprocessing does a fork system call
without following it with an exec system call for performance reasons.
Many libraries like (some versions of) Accelerate / vecLib under OSX,
(some versions of) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and
probably many others), manage their own internal thread pool. Upon a
call to fork, the thread pool state in the child process is corrupted:
the thread pool believes it has many threads while only the main
thread state has been forked. It is possible to change the libraries
to make them detect when a fork happens and reinitialize the thread
pool in that case: we did that for OpenBLAS (merged upstream in master
since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not
yet reviewed).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.