I reconized numpy can link with blas, and I thought of why not using gpu accelerated blas library.
Did anyone use to do so?
Update (2014-05-22)
AMD has produced a beta release of AMD Core Math Library (ACML) version 6.0 that can offload FFT and BLAS functions to a GPU by using clMath internally. The announcement is here: ACML Beta 6.0 Release Leverages the Power of Heterogeneous Compute. The caveat here is that input data must be transferred from CPU to GPU and output data returned to the CPU on each BLAS or FFT call. Therefore, AMD has a bunch of scripts for tuning when a problem is large enough that ACML will use the GPU instead of the CPU.
For the sake of completeness, I'll also mention that Nvidia supports similar functionality with its nvBLAS library but that relies on cuBLAS and CUDA so it won't work on anything but Nvidia GPUs.
Original answer
Unfortunately, AMD's GPU accelerate BLAS library cannot directly link to Numpy or any other application expecting a standard CPU-based BLAS library. The reason is that existing GPU BLAS libraries all require one to first copy the matrices to the GPU before calling the BLAS functions. This requires that someone modify Numpy to do this copying.
Edit: CLyther looks like it can replace some of what Numpy does and converts everything to OpenCL. See here: http://srossross.github.io/Clyther/for_numpy_users.html
If memory servers, pyCuda at least, probably also pyOpenCL can work with numPy
Related
I have written a piece of scientific code in python, mainly using the numpy library (especially Fast Fourier Transforms), and a bit of Cython. Nothing in CUDA or anything GPU related that I am aware of. There is no graphic interface, everything runs in the terminal (I'm using WSL2 on Windows). The whole code is mostly about number crunching, nothing fancy at all.
When I run my program, I see that CPU usage is ~ 100% (to be expected of course), but GPU usage also rises, to around 5%.
Is it possible that a part of the work gets offloaded automatically to the GPU? How else can I explain this small but consistent increase in GPU usage ?
Thanks for the help
No, there is no automatic offloading in Numpy, at least not with the standard Numpy implementation. Note that some specific FFT libraries can use the GPU, but the standard implementation of Numpy uses its own implementation of FFT called PocketFFT based on FFTPack that do not use the GPU. Cython do not perform any automatic implicit GPU offloading. The code need to do that explicitly/manually.
No GPU offloading are automatically performed because GPUs are not faster than CPUs for all tasks and offloading data to the GPU is expensive, especially with small arrays (due to the relatively high-latency of the PCI bus and kernel calls in such a case). Moreover, this is hard to do efficiently even in case where the GPUs could be theoretically faster.
The 5% GPU usage is relative to the frequency of the GPU which is often configured to use an adaptative frequency. For example my discrete Nv-1660S GPU frequency is currently 300 MHz while it can automatically reach 1.785 GHz. Using actually 5% of a GPU running at a 17% of its maximum frequency with a 2D rendering of a terminal is totally possible. On my machine, printing lines in a for loop at 10 FPS in a Windows terminal takes 6% of my GPU still running at low-frequency (0-1% without running anything).
If you want to check the frequency of your GPU and the load there are plenty of tools for that starting from vendor tools often installed with the driver to softwares like GPU-z on Windows. For Nvidia GPU, you can list the processes currently using you GPU with nvidia-smi (it should be rocm-smi on AMD GPUs).
Today i was asking to me if is possible to do matrix calculation using gpu instead cpu because i know that a gpu is designed to do them faster then a cpu.
I searched on the net and i found notices about the matrix calculation using gpu with different python's libraries but my question is exists a documentation that descibes how should we write code to comunicate with a gpu.
I'm asking that because i want to develop my own one to better understand how gpu work and to try something different.
Thanks to all.
I solved that problem with OpenCL
OpenCL is a standard library that the vendor of the GPU's implement by their own. Like NVIDIA support openCL and other features thanks to CUDA library.
Here a good guide to get start
I have an Intel Graphics Card (Intel(R) HD Graphics 520, also am on Windows 10) and as far as I know I can't use CUDA unless I have a NVIDIA GPU. The purpose is to use Theano's GPU capabilities (for deep learning which is why I need GPU power).
Is there a workaround that somehow allows me to use CUDA with my current GPU?
If not is there another API that I can use with my current GPU for Theano (in Python 2.7)?
Or as a last option, using another language entirely, such as Java that has an API that allows for GPU use that I can use?
Figuring this out would be very helpful, because even though I just started with deep learning, I will probably get to the point where I need GPU parallel processing power to get results without waiting days at a minimum.
In order:
No. You must have a supported NVIDIA GPU to use CUDA.
As pointed out in comments, there is an alternative backend for Theano which uses OpenCL and which might work on your GPU
Intel support OpenCL on your GPU, so any language bindings for the OpenCL APIs, or libraries with in-built OpenCL would be a possible solution in this case
[This answer has been assembled from comments and added as a community wiki entry in order to get it off the unanswered queue for the CUDA tag].
I implemented a recursive autoencoder with Theano and tested it on both Linux and Windows. It tooks ~3 hours, 2.3G memory on Linux, while ~9 hours, 0.5G memory on Windows. config.allow_gc=True for both cases.
It could be a Python issue, as discussed in the thread: Why is python so much slower on windows?
Is there any specific setting in Theano that could slow things down on Windows as well?
Thanks,
Ya
It could be that they use different BLAS librairies. From memory, autoencoder bottleneck is the matrix product, that call BLAS. Different BLAS implementation can have up to 10x speed difference.
So check if you used the same BLAS. I would recommand to install python via EPD/Canopy or Anaconda python packages. There not free version link to a good blas and Theano reuse it. The now free version is free for academic.
I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.
What I'm trying is here:
M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M)
This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?
In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:
Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
Have a NumPy compiled to use such an implementation.
Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).
A few caveats apply:
Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.
As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.
First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot
But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP