Automatic GPU offloading in python

Automatic GPU offloading in python - python

I have written a piece of scientific code in python, mainly using the numpy library (especially Fast Fourier Transforms), and a bit of Cython. Nothing in CUDA or anything GPU related that I am aware of. There is no graphic interface, everything runs in the terminal (I'm using WSL2 on Windows). The whole code is mostly about number crunching, nothing fancy at all.
When I run my program, I see that CPU usage is ~ 100% (to be expected of course), but GPU usage also rises, to around 5%.
Is it possible that a part of the work gets offloaded automatically to the GPU? How else can I explain this small but consistent increase in GPU usage ?
Thanks for the help

No, there is no automatic offloading in Numpy, at least not with the standard Numpy implementation. Note that some specific FFT libraries can use the GPU, but the standard implementation of Numpy uses its own implementation of FFT called PocketFFT based on FFTPack that do not use the GPU. Cython do not perform any automatic implicit GPU offloading. The code need to do that explicitly/manually.
No GPU offloading are automatically performed because GPUs are not faster than CPUs for all tasks and offloading data to the GPU is expensive, especially with small arrays (due to the relatively high-latency of the PCI bus and kernel calls in such a case). Moreover, this is hard to do efficiently even in case where the GPUs could be theoretically faster.
The 5% GPU usage is relative to the frequency of the GPU which is often configured to use an adaptative frequency. For example my discrete Nv-1660S GPU frequency is currently 300 MHz while it can automatically reach 1.785 GHz. Using actually 5% of a GPU running at a 17% of its maximum frequency with a 2D rendering of a terminal is totally possible. On my machine, printing lines in a for loop at 10 FPS in a Windows terminal takes 6% of my GPU still running at low-frequency (0-1% without running anything).
If you want to check the frequency of your GPU and the load there are plenty of tools for that starting from vendor tools often installed with the driver to softwares like GPU-z on Windows. For Nvidia GPU, you can list the processes currently using you GPU with nvidia-smi (it should be rocm-smi on AMD GPUs).

Related

Pre-compile numba cuda kernels (non-jit)

Hi I am using numba to write some kernels with the #cuda.jit decorator. I have 8 CPU threads each calling a kernel on 1 of 2 GPU devices. (cpu_idx % len(cuda.gpus) to be specific)
I believe each CPU thread is compiling the kernel, which takes up a lot of time relative to the time it takes for the kernel to process an entire image. Ideally it should be only compiled once for all the CPU threads to use. But I can't initialize any cuda gpu code before forking with multiprocessing.Pool because cuda doesn't like that.
So is there a way to pre-compile cuda kernels? I don't want just-in-time compilation

You can use Eager Compilation to pre-compile the CUDA kernel for a given signature: http://numba.pydata.org/numba-doc/latest/user/jit.html?highlight=eager#eager-compilation
Note that the CUDA backend only supports a single signature for Eager Compilation (whereas the CPU target supports multiple), and that Eager-compiled kernels don't check types when you call the kernel. This results in a faster launch, but opens up the possibility of un-noticed user error by allowing argument types to mis-match between the compiled kernel and what is passed in.
However, noting from your question that you're using multiple processes rather than threads, you will still have an issue as you'd have to pre-compile and then fork, which CUDA still wouldn't like. Is it possible for you to use threads in your application with the threading module instead?
A side question: if the runtime of your application / GPU kernel is so short that the compilation time is the dominating factor, does using a GPU make sense for your application? Once you have processed one image in each thread, even when JIT compiling, there won't be a compilation overhead for that thread again - for subsequent launches, the kernel will be retrieved from the cache. If you were processing many images, the compilation time would become negligible in the overall runtime.

Lagging System or a possible bug in TensorFlow?

I am currently working on RnD in TensorFlow (CPU Version), but unable to decide on the basic requirement for my system for training on large datasets or may be I stumbled upon a possible bug in TensorFlow library.
The Official TensorFlow documentation, nowhere suggests any specific requirement for the system to be building and running TensorFlow programs on. From what I can understand, if that can be run over Windows, Linux, Mac along with Android, iOS and also over embedded systems like RaspberryPi, I suppose there should not be any such hardware requirement for the same.
However, while in the process of initial research, I tried running the TensorFlow Seq2Seq model (translating English to French https://www.tensorflow.org/tutorials/seq2seq), where the training and test datasets end up taking around 7-8 GB of diskspace initially and 20-22Gb on a whole. Once the translate.py python script is executed, it ends up choking the memory and pushing disk utilization to 98% and 100% respectively.
My current system runs Windows 8.1 64 bit OS, Core i5 5200U clocking at 2.2 GHz, 8GB RAM and around 70GB free space on HDD (specifically allotted for TensorFlow usage). But even after allowing my system to run over 7-8 hours (with no other application running) it got stuck multiple times and usually after the memory utilization peeks to around 100% after tokenizing the datasets.
Though I am not sure, but I suppose the TensorFlow learning graph is being created inside the RAM and once it expands to around all the memory space, the program ends up in un-ending loop waiting for memory to get cleared and then increase the learning graph.
So the whole drills down to 3 questions:
Does TensorFlow uses RAM for building and saving Learning Graph? If so, is it possible to get choked in a similar fashion?
From a business perspective, is there a minimum hardware requirement for training such a system?
If it is not the system requirement, can this be a possible bug in TensorFlow library which pushes it into an unending loop waiting for memory to get cleared?
Update
After running the python script for over 30 hours continuously, the process seems to have stuck at the same place for past 14 hours while "Reading development and training data". Refer image below for further investigation:

As soon as I was about to shut down the program, the same started responding again and I waited for another 15-20 minutes and finally I got the answer from the OS itself. It was indeed low RAM that was causing the problem. Attaching the screen grab of the Windows Alert of system running low on memory for reference, incase anyone gets caught in same situation.
UPDATE
I tried taking a VM instance on Google Cloud Platform. This machine had 2 x Intel Xeon (R) each running at 2.23 GHZ, with 13GB RAM and 50GB storage. But the result was same in this situation also even though the application was utilising more than 10.5 GB RAM. Seems like this tutorial script needs a very intense system probably a Super Computer with atleast 32 GB RAM to run and execute completely. I might look to write/arrange my own dataset now. However, this must be taken as future enhancement to use Persistent Storage (HDD/SSD) to create the Graph instead of RAM so as to avoid chocking of Memory.

Theano for GPU without use of CUDA or using a CUDA workaround

I have an Intel Graphics Card (Intel(R) HD Graphics 520, also am on Windows 10) and as far as I know I can't use CUDA unless I have a NVIDIA GPU. The purpose is to use Theano's GPU capabilities (for deep learning which is why I need GPU power).
Is there a workaround that somehow allows me to use CUDA with my current GPU?
If not is there another API that I can use with my current GPU for Theano (in Python 2.7)?
Or as a last option, using another language entirely, such as Java that has an API that allows for GPU use that I can use?
Figuring this out would be very helpful, because even though I just started with deep learning, I will probably get to the point where I need GPU parallel processing power to get results without waiting days at a minimum.

In order:
No. You must have a supported NVIDIA GPU to use CUDA.
As pointed out in comments, there is an alternative backend for Theano which uses OpenCL and which might work on your GPU
Intel support OpenCL on your GPU, so any language bindings for the OpenCL APIs, or libraries with in-built OpenCL would be a possible solution in this case
[This answer has been assembled from comments and added as a community wiki entry in order to get it off the unanswered queue for the CUDA tag].

Can normal algos run on PyOpenGL?

I want to write an algorithm that would benefit from the GPU's superior hashing capability over the CPU.
Is PyOpenGL the answer? I don't want to use drawing tools, but simply run a "vanilla" python script ported to the GPU.
I have an ATI/AMD GPU if that means anything.

Is PyOpenGL the answer?
No. At least not in the way you expect it. If your GPU does support OpenGL-4.3 you could use Compute Shaders in OpenGL, but those are not written in Python
but simply run a "vanilla" python script ported to the GPU.
That's not how GPU computing works. You have to write the shaders of computation kernels in a special language. Either OpenCL or OpenGL Compute Shaders or, specific to NVIDIA, in CUDA.
Python would then just deliver the framework for getting the GPU computation running.

Python execution speed: laptop vs desktop

I am running a program that does simple data processing:
parses text
populates dictionaries
calculates some functions over the resulting data
The program only uses CPU, RAM, and HDD:
run from Windows command line
input/output to the local hard drive
nothing displayed on or printed to screen
no networking
The same program is run on:
desktop: Windows 7, i7-930 CPU overclocked #3.6 GHz (with matching memory speed), Intel X-25M SSD
laptop: Windows XP, Intel Core2 Duo T9300 #2.5GHz, 7200 rpm HDD
The CPU is 1.44 faster frequency, HDD is 4 times higher benchmark score (Passmark - Disk Mark). I found the program runs just around 1.66 times faster on the desktop. So apparently, the CPU is the bottleneck.
It seems there's only 15% benefit from the i7 Core vs Intel Core2 Duo architecture (most of the performance boost is due to the straight CPU frequency). Is there anything I can do in the code to increase the benefit of the new architecture?
EDIT: forgot to mention that I use ActivePython 3.1.2 if that matters.

The increasing performance of hardware brings in most cases automatically results in benefit to user applications. The much maligned "GIL" means that you may not be able to take advantage of multicores with CPython unless you design your program to take advantage via various multiprocessing modules / libraries.
SO discussion on the same : Does python support multiprocessor/multicore programming?
A related collation of solutions on python wiki: http://wiki.python.org/moin/ParallelProcessing

Split your processing into multiple threads. Your particular i7 should be able to support up to 8 threads in parallel.

Consider repeating on regular HDD's - that SSD could well result in a substantial performance difference depending on caches, and the nature of that data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.