Python CUDA parallize multiple SVD's of small matrices

Python CUDA parallize multiple SVD's of small matrices - python

I've seen a similar post on stackoverflow which tackles the problem in C++: Parallel implementation for multiple SVDs using CUDA
I want to do exactly the same in python, is that possible? I have multiple matrices (approximately 8000 with size 15x3) and each of them I want to decompose using the SVD. This takes years on a CPU. Is it possible to do that in python? My computer has an NVIDIA GPU installed. I already had a look at several libraries such as numba, pycuda, scikit-cuda, cupy but didnt found a way to implement my plan with that libraries. I would be very glad for some help.

cuPy gives access to cuSolver, including a batched SVD:
https://docs.cupy.dev/en/stable/reference/generated/cupy.linalg.svd.html

Related

How to make python run using Windows GPU?

I've been trying to improve the performance of my python scripts and would like to run some using my computer's built-in GPU. However, my computer is Windows 10 and its GPU is not CUDA compatible. From what I've seen, it seems that the GPU must be CUDA compatible in order for it to run python scripts. Is there any way to utilize my GPU for said purposes? If not, are there other programming languages in which I can do this?

The GPU is a proccessing unit for graphics. It most likely won't help except for drawing polygons, transfering data, or massive data sets. The closest you can get is importing a module (depending on your needs), that uses C++ to interact with the GPU (such as OpenCL), or coding interactions yourself (much more complicated).
To answer your 2nd question, C++ or C# should work with your GPU.
Please specify what script you are trying to run for more detail
Good luck!

implementing python code on GPU from spyder

According to knowledge with tf.device('/GPU') can be used for implementing tensor-flow in GPU. Is there any similar is there any way for implementing any python code on GPU(Cuda) ? or should I use pycuda?

For parallel processing in python some intermideate libraries or packages needed to be there that sit between the code and the gpu/cpu for parallel executions. Some popular packages are pycuda, numba etc. If you want to do gpu programming using simple python syntax without using other frameworks like tensorflow, then take a look at this.

Matrix calculation using gpu

Today i was asking to me if is possible to do matrix calculation using gpu instead cpu because i know that a gpu is designed to do them faster then a cpu.
I searched on the net and i found notices about the matrix calculation using gpu with different python's libraries but my question is exists a documentation that descibes how should we write code to comunicate with a gpu.
I'm asking that because i want to develop my own one to better understand how gpu work and to try something different.
Thanks to all.

I solved that problem with OpenCL
OpenCL is a standard library that the vendor of the GPU's implement by their own. Like NVIDIA support openCL and other features thanks to CUDA library.
Here a good guide to get start

Prebuilt numpy with BLAS/ATLAS?

I'm implementing a real-time LMS algorithm, and numpy.dot takes more time than my sampling time, so I need numpy to be faster (my matrices are 1D and 100 long).
I've read about building numpy with ATLAS and such, but never done such thing and spent all my day trying to do it, with zero succes...
Can someone explain why there aren't builds with ATLAS included? Can anyone provide me with one? Is there any other way to speed up dot product?
I've tried numba, and scipy.linalg.gemm_dot but none of them seemed to speed things up.
my system is Windows8.1 with Intel processor

If you download the official binaries, they should come linked with ATLAS. If you want to make sure, check the output of np.show_config(). The problem is that ATLAS (Automatically Tuned Linear Algebra System) checks many different combinations and algorithms, and keeps the best at compile time. So, when you run a precompiled ATLAS, you are running it optimised for a computer different than yours.
So, your options to improve dot are:
Compile ATLAS yourself. On Windows it may be a bit challenging, but it is doable. Note: you must use the same compiler used to compile Python. That is, if you decide to go for MinGW, you need to get Python compiled with MinGW, or build it yourself.
Try Christopher Gohlke's Numpy. It is linked against MKL, that is much faster than ATLAS (and does all the optimisations at run time).
Try Continuum analytics' Conda with accelerate (also linked with MKL). It costs money, unless you are an academic. In Linux, Conda is slower than system python because they have to use an old compiler for compatibility purposes; I don't know if that is the case on Windows.
Use Linux. Your Python life will be much easier, setting up the system to compile stuff is very easy. Also, setting up Cython is simple too, and then you can compile your whole algorithm, and probably get further speed up.
The note regarding Cython is valid for Windows too, it is just more difficult to get it working. I tried a few years ago (when I used Windows), and failed after a few days; I don't know if the situation has improved.
Alternative:
You are doing the dot product of two vectors. Then, np.dot is probably not the most efficient way. I would give a shot to spell it out in plain Python (vec1*vec2).sum() (could be very good for Numba, this expression it can actually optimise) or using numexpr:
ne.evaluate(`sum(vec1 * vec2)`)
Numexpr will also parallelise the expression automatically.

Multithreading on numpy/pandas matrix multiplication?

I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.
What I'm trying is here:
M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M)
This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?

In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:
Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
Have a NumPy compiled to use such an implementation.
Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).
A few caveats apply:
Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.
As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.

First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot
But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.