Writing kernel functions using PyOpenCL with AVX2 or AVX512?

Writing kernel functions using PyOpenCL with AVX2 or AVX512? - python

Am I restricted to using AVX2 or AVX512 depending on what family type my CPU is (if it is AVX2 or AVX512)?
I am writing an openCL program in Python using the PyOpenCL package, and I want to optimize the AVX2 SIMD technology. I know AVX2 is 256-bit instruction and AVX512 is 512-bit, so when I write my kernel function should I only use double4 variables in order to implement AVX2-style instruction? And vice-versa, double8 variables for AVX-512 style?
And my next question is: Am I restricted to what my CPU type supports? If it supports AVX-256, will I not be able to run double8 variables parallelization in my kernel function?
Sorry if my question is confusing because I am still in the process of learning this.
Thanks

It probably makes more sense to answer your questions in reverse order:
Am I restricted to what my CPU type supports? If it supports AVX-256, will I not be able to run double8 variables parallelization in my kernel function?
No, all OpenCL implementations which support double floating point types also allow you to write code using double8 types. What that gets compiled down to is entirely up to the implementation though.
If your CPU supports AVX-512, and your OpenCL implementation does too, there's a good chance it'll attempt to emit AVX-512 instructions. If CPU or implementation only support AVX2, it will internally probably attempt to break down your code into operating on each half of your double8 individually.
when I write my kernel function should I only use double4 variables in order to implement AVX2-style instruction? And vice-versa, double8 variables for AVX-512 style?
Your first reference for this sort of question should always be the OpenCL optimisation manual for your specific OpenCL implementation. For Intel's CPU runtime, this seems to be the relevant resource.
Depending on what your code does, the OpenCL implementation may be able to autovectorise your code even if your kernel uses scalar types, operating on arrays of doubles, assuming you've submitted a suitable number of work items.
If your code is naturally representable by vector types such as double8 and double4, go ahead and use them. The implementation will be able to split code using double8 into instructions using double4 internally as I've mentioned. You may find that this causes more register pressure, so using larger than necessary types can be slightly counterproductive. If you only want to write one variant of the code, go for the bigger vectors though - again, if the code can naturally be represented that way. If you have to go through contortions to make it fit, chances are you won't gain much.
If care about a few percent performance difference, you'll need to perform detailed profiling and try lots of different approaches anyway. It always heavily depends on your specific code, so it's very difficult to give general advice.

Related

Solving large sparse linear system of quations Python vs Matlab [duplicate]

I want to compute magnetic fields of some conductors using the Biot–Savart law and I want to use a 1000x1000x1000 matrix. Before I use MATLAB, but now I want to use Python. Is Python slower than MATLAB ? How can I make Python faster?
EDIT:
Maybe the best way is to compute the big array with C/C++ and then transfering them to Python. I want to visualise then with VPython.
EDIT2: Which is better in my case: C or C++?

You might find some useful results at the bottom of this link
http://wiki.scipy.org/PerformancePython
From the introduction,
A comparison of weave with NumPy, Pyrex, Psyco, Fortran (77 and 90) and C++ for solving Laplace's equation.
It also compares MATLAB and seems to show similar speeds to when using Python and NumPy.
Of course this is only a specific example, your application might be allow better or worse performance. There is no harm in running the same test on both and comparing.
You can also compile NumPy with optimized libraries such as ATLAS which provides some BLAS/LAPACK routines. These should be of comparable speed to MATLAB.
I'm not sure if the NumPy downloads are already built against it, but I think ATLAS will tune libraries to your system if you compile NumPy,
http://www.scipy.org/Installing_SciPy/Windows
The link has more details on what is required under the Windows platform.
EDIT:
If you want to find out what performs better, C or C++, it might be worth asking a new question. Although from the link above C++ has best performance. Other solutions are quite close too i.e. Pyrex, Python/Fortran (using f2py) and inline C++.
The only matrix algebra under C++ I have ever done was using MTL and implementing an Extended Kalman Filter. I guess, though, in essence it depends on the libraries you are using LAPACK/BLAS and how well optimised it is.
This link has a list of object-oriented numerical packages for many languages.
http://www.oonumerics.org/oon/

NumPy and MATLAB both use an underlying BLAS implementation for standard linear algebra operations. For some time both used ATLAS, but nowadays MATLAB apparently also comes with other implementations like Intel's Math Kernel Library (MKL). Which one is faster by how much depends on the system and how the BLAS implementation was compiled. You can also compile NumPy with MKL and Enthought is working on MKL support for their Python distribution (see their roadmap). Here is also a recent interesting blog post about this.
On the other hand, if you need more specialized operations or data structures then both Python and MATLAB offer you various ways for optimization (like Cython, PyCUDA,...).
Edit: I corrected this answer to take into account different BLAS implementations. I hope it is now a fair representation of the current situation.

The only valid test is to benchmark it. It really depends on what your platform is, and how well the Biot-Savart Law maps to Matlab or NumPy/SciPy built-in operations.
As for making Python faster, Google's working on Unladen Swallow, a JIT compiler for Python. There are probably other projects like this as well.

As per your edit 2, I recommend very strongly that you use Fortran because you can leverage the available linear algebra subroutines (Lapack and Blas) and it is way simpler than C/C++ for matrix computations.
If you prefer to go with a C/C++ approach, I would use C, because you presumably need raw performance on a presumably simple interface (matrix computations tend to have simple interfaces and complex algorithms).
If, however, you decide to go with C++, you can use the TNT (the Template Numerical Toolkit, the C++ implementation of Lapack).
Good luck.

If you're just using Python (with NumPy), it may be slower, depending on which pieces you use, whether or not you have optimized linear algebra libraries installed, and how well you know how to take advantage of NumPy.
To make it faster, there are a few things you can do. There is a tool called Cython that allows you to add type declarations to Python code and translate it into a Python extension module in C. How much benefit this gets you depends a bit on how diligent you are with your type declarations - if you don't add any at all, you won't see much of any benefit. Cython also has support for NumPy types, though these are a bit more complicated than other types.
If you have a good graphics card and are willing to learn a bit about GPU computing, PyCUDA can also help. (If you don't have an nvidia graphics card, I hear there is a PyOpenCL in the works as well). I don't know your problem domain, but if it can be mapped into a CUDA problem then it should be able to handle your 10^9 elements nicely.

And here is an updated "comparison" between MATLAB and NumPy/MKL based on some linear algebra functions:
http://dpinte.wordpress.com/2010/03/16/numpymkl-vs-matlab-performance/
The dot product is not that slow ;-)

I couldn't find much hard numbers to answer this same question so I went ahead and did the testing myself. The results, scripts, and data sets used are all available here on my post on MATLAB vs Python speed for vibration analysis.
Long story short, the FFT function in MATLAB is better than Python but you can do some simple manipulation to get comparable results and speed. I also found that importing data was faster in Python compared to MATLAB (even for MAT files using the scipy.io).

I would also like to point out that Python (+NumPy) can easily interface with Fortran via the F2Py module, which basically nets you native Fortran speeds on the pieces of code you offload into it.

Where is #cupy.fuse cupy python decorator documented?

I've seen some demos of #cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. (
This is why one might be better off using numba #jit)
#cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated.
https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with #fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with #numba.jit then subsequently decorating with #fuse. Or would again the #jit render the resulting python in a non-fusable form?
What breaks #fuse? What are the pitfalls? is #fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py

SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
fusing kernels is such a huge advance I don't understand when I would ever not want to use #fuse. isn't it always better? When is
it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empty_like(x) does not work, nor does referring to globals. Hence it simply cannot be applied universally.
I'm wondering about it's composability
will #fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
I see that a bug fix in the release notes says that it can now handle calling other functions decorated with #fuse. But this does
not say if their kernels are fused or remain separate.
Answer: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whether you do or do not.
Why isn't it documented?
Answer: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.

Python and C++ performance comparison

In a lecture I've encountered the following problem:
Given a simple program which computes the sum of a column in a large data set, performance of a python and a c++ implementation are being compared. The main bottleneck should be reading the data. The computation itself is rather simple. On first execution, the python version is about 2 times slower than c++ which makes sense.
Then on the second execution, the c++ program speeds up from 4 seconds to 1 second because apparently the "first execution is I/O bound, second is CPU bound". This still makes sense since probably the file contents were cached omitting the slow reading from disk.
However, the python implementation did not speed up at all on the second run, despite the warm cache. I know python is slow, but is it that slow? Does this mean that executing this simple computation in python is slower than reading about .7 GB from disk?
If this is always the case, I'm wondering why the biggest deep learning frameworks I know (PyTorch, tensorflow) have python apis. For real time object detection for example, it must be slower to parse the input (read frames from a video, maybe preprocess) to the network and to interpret the output, than performing the forward propagation itself on a gpu.
Have I misunderstood something? Thank you.

That's not so easy to answer without implementation details, but in general, python is known for it's much less cache friendliness, because you mostly haven't the option to low-level optimize cache behaviour in python. However, this isn't always correct. You propably can optimize the cache friendliness in python directly, or you use parts of c++ code for critical sections. But always consider, that you can just optimize your code better in C++. So if you have really critical code parts, where you want to achieve every percent of speed and effiency, you should use C++. That's the reason, that many programs use both, C++ for raw performance things and python for a nice interface and program structure.

What size of Clyther overhead?

I'm thinking about using Clyther for a high performance task. It is exciting to write OpenCL kernels using only python, but I'm wondering about the performance gap.
What are tasks that Clyther is good at? Bad at? Are Clyther-generated kernels good or not?
Is it possible to find some benchmarks?

As the documentation states, the main entry points for CLyther are its clyther.task and clyther.kernel decorators - once a function is decorated with one of these the function will be compiled to OpenCL when called.
CLyther is a compiler of a subset of the Python language. It compiles your Python subset code into OpenCL, so the actual run time of the kernel will not (or should not) differ much between interfaces to OpenCL. The actual overhead of CLyther (as with all interfaces with Python) comes from calling the OpenCL functions, or the moving of data between CLyther/Python and OpenCL.
Benchmarks showing CLyther's performance are available in the documentation. The source tarball contains the C++ and FORTRAN edition of the benchmarked program, a Laplace equation solver, so you can use them to reproduce the benchmark results yourself.
Personally, I believe that you can use CLyther effectively on the majority of problems in need of OpenCL computation.

Mixing cython and turbogears 2.1

Is it possible to integrate Cython and TG2? I have one computation (written in python) which is heavily numerical and would largly benefit from rewriting into C or cython.

Without you having additionally specificity in your question and not knowing what exactly you mean by 'integrate', all I can offer is that cython provides a fairly simple way of (often dramatically) speeding up certain code written in python either via static typing or calling external c/c++ libraries. If there is only a single numerical calculation that can be written in cython and then called from within TG2, then this is a good candidate for using cython. Your mileage will vary though depending on how much of it can be written in something that translates to pure C, versus something that relies heavily on the Python C-API.
Some (many actually) numerical calculations are also amenable to the type computations that numpy excels at, so if you haven't tried it, that may be another option.
In general though if you want a detailed answer, you should put an equivalent amount of detail in the question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.