What size of Clyther overhead?

What size of Clyther overhead? - python

I'm thinking about using Clyther for a high performance task. It is exciting to write OpenCL kernels using only python, but I'm wondering about the performance gap.
What are tasks that Clyther is good at? Bad at? Are Clyther-generated kernels good or not?
Is it possible to find some benchmarks?

As the documentation states, the main entry points for CLyther are its clyther.task and clyther.kernel decorators - once a function is decorated with one of these the function will be compiled to OpenCL when called.
CLyther is a compiler of a subset of the Python language. It compiles your Python subset code into OpenCL, so the actual run time of the kernel will not (or should not) differ much between interfaces to OpenCL. The actual overhead of CLyther (as with all interfaces with Python) comes from calling the OpenCL functions, or the moving of data between CLyther/Python and OpenCL.
Benchmarks showing CLyther's performance are available in the documentation. The source tarball contains the C++ and FORTRAN edition of the benchmarked program, a Laplace equation solver, so you can use them to reproduce the benchmark results yourself.
Personally, I believe that you can use CLyther effectively on the majority of problems in need of OpenCL computation.

Related

Solving large sparse linear system of quations Python vs Matlab [duplicate]

I want to compute magnetic fields of some conductors using the Biot–Savart law and I want to use a 1000x1000x1000 matrix. Before I use MATLAB, but now I want to use Python. Is Python slower than MATLAB ? How can I make Python faster?
EDIT:
Maybe the best way is to compute the big array with C/C++ and then transfering them to Python. I want to visualise then with VPython.
EDIT2: Which is better in my case: C or C++?

You might find some useful results at the bottom of this link
http://wiki.scipy.org/PerformancePython
From the introduction,
A comparison of weave with NumPy, Pyrex, Psyco, Fortran (77 and 90) and C++ for solving Laplace's equation.
It also compares MATLAB and seems to show similar speeds to when using Python and NumPy.
Of course this is only a specific example, your application might be allow better or worse performance. There is no harm in running the same test on both and comparing.
You can also compile NumPy with optimized libraries such as ATLAS which provides some BLAS/LAPACK routines. These should be of comparable speed to MATLAB.
I'm not sure if the NumPy downloads are already built against it, but I think ATLAS will tune libraries to your system if you compile NumPy,
http://www.scipy.org/Installing_SciPy/Windows
The link has more details on what is required under the Windows platform.
EDIT:
If you want to find out what performs better, C or C++, it might be worth asking a new question. Although from the link above C++ has best performance. Other solutions are quite close too i.e. Pyrex, Python/Fortran (using f2py) and inline C++.
The only matrix algebra under C++ I have ever done was using MTL and implementing an Extended Kalman Filter. I guess, though, in essence it depends on the libraries you are using LAPACK/BLAS and how well optimised it is.
This link has a list of object-oriented numerical packages for many languages.
http://www.oonumerics.org/oon/

NumPy and MATLAB both use an underlying BLAS implementation for standard linear algebra operations. For some time both used ATLAS, but nowadays MATLAB apparently also comes with other implementations like Intel's Math Kernel Library (MKL). Which one is faster by how much depends on the system and how the BLAS implementation was compiled. You can also compile NumPy with MKL and Enthought is working on MKL support for their Python distribution (see their roadmap). Here is also a recent interesting blog post about this.
On the other hand, if you need more specialized operations or data structures then both Python and MATLAB offer you various ways for optimization (like Cython, PyCUDA,...).
Edit: I corrected this answer to take into account different BLAS implementations. I hope it is now a fair representation of the current situation.

The only valid test is to benchmark it. It really depends on what your platform is, and how well the Biot-Savart Law maps to Matlab or NumPy/SciPy built-in operations.
As for making Python faster, Google's working on Unladen Swallow, a JIT compiler for Python. There are probably other projects like this as well.

As per your edit 2, I recommend very strongly that you use Fortran because you can leverage the available linear algebra subroutines (Lapack and Blas) and it is way simpler than C/C++ for matrix computations.
If you prefer to go with a C/C++ approach, I would use C, because you presumably need raw performance on a presumably simple interface (matrix computations tend to have simple interfaces and complex algorithms).
If, however, you decide to go with C++, you can use the TNT (the Template Numerical Toolkit, the C++ implementation of Lapack).
Good luck.

If you're just using Python (with NumPy), it may be slower, depending on which pieces you use, whether or not you have optimized linear algebra libraries installed, and how well you know how to take advantage of NumPy.
To make it faster, there are a few things you can do. There is a tool called Cython that allows you to add type declarations to Python code and translate it into a Python extension module in C. How much benefit this gets you depends a bit on how diligent you are with your type declarations - if you don't add any at all, you won't see much of any benefit. Cython also has support for NumPy types, though these are a bit more complicated than other types.
If you have a good graphics card and are willing to learn a bit about GPU computing, PyCUDA can also help. (If you don't have an nvidia graphics card, I hear there is a PyOpenCL in the works as well). I don't know your problem domain, but if it can be mapped into a CUDA problem then it should be able to handle your 10^9 elements nicely.

And here is an updated "comparison" between MATLAB and NumPy/MKL based on some linear algebra functions:
http://dpinte.wordpress.com/2010/03/16/numpymkl-vs-matlab-performance/
The dot product is not that slow ;-)

I couldn't find much hard numbers to answer this same question so I went ahead and did the testing myself. The results, scripts, and data sets used are all available here on my post on MATLAB vs Python speed for vibration analysis.
Long story short, the FFT function in MATLAB is better than Python but you can do some simple manipulation to get comparable results and speed. I also found that importing data was faster in Python compared to MATLAB (even for MAT files using the scipy.io).

I would also like to point out that Python (+NumPy) can easily interface with Fortran via the F2Py module, which basically nets you native Fortran speeds on the pieces of code you offload into it.

Writing kernel functions using PyOpenCL with AVX2 or AVX512?

Am I restricted to using AVX2 or AVX512 depending on what family type my CPU is (if it is AVX2 or AVX512)?
I am writing an openCL program in Python using the PyOpenCL package, and I want to optimize the AVX2 SIMD technology. I know AVX2 is 256-bit instruction and AVX512 is 512-bit, so when I write my kernel function should I only use double4 variables in order to implement AVX2-style instruction? And vice-versa, double8 variables for AVX-512 style?
And my next question is: Am I restricted to what my CPU type supports? If it supports AVX-256, will I not be able to run double8 variables parallelization in my kernel function?
Sorry if my question is confusing because I am still in the process of learning this.
Thanks

It probably makes more sense to answer your questions in reverse order:
Am I restricted to what my CPU type supports? If it supports AVX-256, will I not be able to run double8 variables parallelization in my kernel function?
No, all OpenCL implementations which support double floating point types also allow you to write code using double8 types. What that gets compiled down to is entirely up to the implementation though.
If your CPU supports AVX-512, and your OpenCL implementation does too, there's a good chance it'll attempt to emit AVX-512 instructions. If CPU or implementation only support AVX2, it will internally probably attempt to break down your code into operating on each half of your double8 individually.
when I write my kernel function should I only use double4 variables in order to implement AVX2-style instruction? And vice-versa, double8 variables for AVX-512 style?
Your first reference for this sort of question should always be the OpenCL optimisation manual for your specific OpenCL implementation. For Intel's CPU runtime, this seems to be the relevant resource.
Depending on what your code does, the OpenCL implementation may be able to autovectorise your code even if your kernel uses scalar types, operating on arrays of doubles, assuming you've submitted a suitable number of work items.
If your code is naturally representable by vector types such as double8 and double4, go ahead and use them. The implementation will be able to split code using double8 into instructions using double4 internally as I've mentioned. You may find that this causes more register pressure, so using larger than necessary types can be slightly counterproductive. If you only want to write one variant of the code, go for the bigger vectors though - again, if the code can naturally be represented that way. If you have to go through contortions to make it fit, chances are you won't gain much.
If care about a few percent performance difference, you'll need to perform detailed profiling and try lots of different approaches anyway. It always heavily depends on your specific code, so it's very difficult to give general advice.

Julia for image processing and speech recognition

I recently stumbled upon Julia Language and I was surprised to see their claims. It claims to be many folds faster than languages like Python, which I'm currently using for machine learning algorithms on speech recogniton.
Their claim for example Fibonacci sequence is about 30 times faster than in Python.
I dont understand what is that it makes it so much faster. However what I really want to know is whether someone has actually verified this. If that is the case I would move from Python to Julia Language for the speech recognizer I am building as I am hitting the slow threshold of my response and cannot afford to make it anymore slower.
Also are there any projects on github(or anywhere) where I can find Julia projects which do heavy number crunching like image processing, speech recognition etc.
I searched upon some other link such as this one , but could not decide effectively. I can run the programs and verify their claims, but if someone has already done so, it would be helpful to me.

It's not that surprising; Python made a number of decisions that make it a fairly slow language, and Guido van Rossum, its creator, says "It is usually much more effective to take that one piece and replace that one function or module with a little bit of code you wrote in C or C++ rather than rewriting your entire system in a faster language, because for most of what you're doing, the speed of the language is irrelevant." As a general rule, any language that is concerned with speed will be faster than Python: C, C++, Ada, Java, Scala, Clojure, a number of other languages, all show up more then an order of magnitude faster in typical implementations than Python in benchmarks. Unless the authors of Julia have completely failed in their attempts to make a faster language, it will be faster than Python.

I dont understand what is that it makes it so much faster.
People often assume Julia is fast due to some sort of superior JIT compiler and that Python could be optimized in the same way since they are both dynamic languages.
However it is actually all due to clever language design choices. All functions in Julia are multiple dispatch unlike Python functions. That means at runtime Julia can pick one particular code implementation for every possible combination of arguments. Since types are immutable in Julia the compiled machine code handling a particular case of argument combinations can be cached.
In Julia all libraries have also been designed with type stability in mind, meaning for a given set of argument types a function will always return objects of the same type. E.g. so if you do a mathematical operation on two integers and it returns and integers, then most of the time the JIT compiler will know that it ALWAYS returns an integer and thus don't have to create lots of code to check for every possible case.
If you are uncertain about the performance of Julia, I suggest you actually look at some small code examples and play with the code generator. With #code_native, you can see the assembly code Julia generates for a function. You will find it is possible to emit basically the same code as a C compiler would do. Quite dynamic and complex looking code in Julia surprisingly often just turns into a few assembly code instructions.
The thing with Julia is that you can't just look at the performance of some random library existing doing what you want. It could be a quick port or not well optimized. It is easy to write slow Julia code if you have no idea what you are doing. The key thing is that Julia gives you quite good tools for writing highly optimized code.
If you need to you can typically manage to get performance close to C. But of course you have to spend some time optimizing your code. Julia has functions which allows you to analyze your functions and tell you were you did something causing potential performance problems.
Fortunately Julia performance benefits from lots of small functions, so you can typically analyze performance issues in very small isolated cases.

Regarding packages, there is a link to all registered "external packages" from the homepage. You may find some of the things you want there.

Object Tracking: MATLAB vs. Python Numpy

I will soon be starting a final year Engineering project, consisting of the real-time tracking of objects moving on a 2D-surface. The objects will be registered by my algorithm using feature extraction.
I am trying to do some research to decide whether I should use MATLAB or use Python Numpy (Numerical Python). Some of the factors I am taking into account:
1.) Experience
I have reasonable experience in both, but perhaps more experience in image processing using Numpy. However, I have always found MATLAB to be very intuitive and easy to pick up.
2.) Real-Time abilities
It is very important that my choice be able to support the real-time acquisition of video data from an external camera. I found this link for MATLAB showing how to do it. I am sure that the same would be possible for Python, perhaps using the OpenCV library?
3.) Performance
I have heard, although never used, that MATLAB can easily split independent calculations across multiple cores. I should think that this would be very useful, and I am not sure whether the same is equally simple for Numpy?
4.) Price
I know that there is a cost associated with MATLAB, but I will be working at a university and thus will have access to full MATLAB without any cost to myself, so price is not a factor.
I would greatly appreciate any input from anyone who has done something similar, and what your experience was.
Thanks!

Python (with NumPy, SciPy and MatPlotLib) is the new Matlab. So I strongly recommend Python over Matlab.
I made the change over a year ago and I am very happy with the results.
Here it is a short pro/con list for Python and Matlab
Python pros:
Object Oriented
Easy to write large and "real" programs
Open Source (so it's completely free to use)
Fast (most of the heavy computation algorithms have a python wrapper to connect with C libraries e.g. NumPy, SciPy, SciKits, libSVM, libLINEAR)
Comfortable environment, highly configurable (iPython, python module for VIM, ...)
Fast growing community of Python users. Tons of documentation and people willing to help
Python cons:
Could be a pain to install (especially some modules in OS X)
Plot manipulation is not as nice/easy as in Matlab, especially 3D plots or animations
It's still a script language, so only use it for (fast) prototyping
Python is not designed for multicore programming
Matlab pros:
Very easy to install
Powerful Toolboxes (e.g. SignalProcessing, Systems Biology)
Unified documentation, and personalized support as long as you buy the licence
Easy to have plot animations and interactive graphics (that I find really useful for running experiments)
Matlab cons:
Not free (and expensive)
Based on Java + X11, which looks extremely ugly (ok, I accept I'm completely biased here)
Difficult to write large and extensible programs
A lot of Matlab users are switching to Python :)

I would recommend python.
I switched from MATLAB -> python about 1/2 way through my phd, and do not regret it. At the most simplistic, python is a much nicer language, has real objects, etc.
If you expect to be doing any parts of your code in c/c++ I would definitely recommend python. The mex interface works, but if your build gets complicated/big it starts to be a pain and I never sorted out how to effectively debug it. I also had great difficulty with mex+allocating large blocks interacting with matlab's memory management (my inability to fix that issue is what drove me to switch).
As a side note/self promotion, I have Crocker-Grier in c++ (with swig wrappers) and pure python.

If you're experienced with both languages it's not really a decision criterion.
Matlab has problems coping with real time settings especially since most computer vision algorithms are very costly. This is the advantage of using a tried and tested library such as OpenCV where many of the algorithms you'll be using are efficiently implemented. Matlab offers the possibility of compiling code into Mex-files but that is a lot of work.
Matlab has parallel for loops parfor which makes multicore processing easy (or at least easier). But the question is if that will suffice to get real-time speeds.
No comment.
The main advantage of Matlab is that you'll obtain a running program very quickly due to its good documentation. But I found that code reusability is bad with Matlab unless you put a heavy emphasis on it.
I think the final decision has to be if you have to/can run your algorithm real-time which I doubt in Matlab, but that depends on what methods you're planning to use.

Others have made a lot of great comments (I've opined on this topic before in another answer https://stackoverflow.com/a/5065585/392949) , but I just wanted to point out that Python has a number of really excellent tools for parallel computing/splitting up work across multiple cores. Here's a short and by no means comprehensive list:
IPython Parallel toolkit: http://ipython.org/ipython-doc/dev/parallel/index.html
mpi4py: https://code.google.com/p/mpi4py
The multiprocessing module in the standard library: http://docs.python.org/library/multiprocessing.html
pyzmq: http://zeromq.github.com/pyzmq/ (what the IPython parallel toolkit is based on)
parallel python (pp): http://www.parallelpython.com/
Cython's wrapping of openmp: http://docs.cython.org/src/userguide/parallelism.html
You will also probably find cython to be much to be a vastly superior tool compared to what Matlab has to offer if you ever need to interface external C-libraries or write C-extensions, and it has excellent numpy support built right in.
There is a list with a bunch of other options here:
http://wiki.python.org/moin/ParallelProcessing

best way to extend python / numpy performancewise

As there are multitude of ways to write binary modules for python, i was hopping those of you with experience could advice on the best approach if i wish to improve the performance of some segments of the code as much as possible.
As i understand, one can either write an extension using the python/numpy C-api, or wrap some already written pure C/C++/Fortran function to be called from the python code.
Naturally, tools like Cython are the easiest way to go, but i assume that writing the code by hand gives better control and provide better performance.
The question, and it may be to general, is which approach to use. Write a C or C++ extension? wrap external C/C++ functions or use callback to python functions?
I write this question after reading chapter 10 in Langtangen's "Python scripting for computational science" where there is a comparison of several methods to interface between python and C.

I would say it depends on your skills/experience and your project.
If this is very ponctual and you are profficient in C/C++ and you have already written python wrapper, then write your own extension and interface it.
If you are going to work with Numpy on other project, then go for the Numpy C-API, it's extensive and rather well documented but it is also quite a lot of documentation to process.
At least I had a lot of difficulty processing it, but then again I suck at C.
If you're not really sure go Cython, far less time consuming and the performance are in most cases very good. (my choice)
From my point of view you need to be a good C coder to do better than Cython with the 2 previous implementation, and it will be much more complexe and time consuming.
So are you a great C coder ?
Also it might be worth your while to look into pycuda or some other GPGPU stuff if you're looking for performance, depending on your hardware of course.

A good comparison of several different approaches can be found here. I have tried both cython, and wrapping my own fortran code using f2py. I found that f2py was the better way to go for my purposes. This was partly influenced by the fact that I understand fortran, but honestly modern dialects such as fortran 90 will look reasonably similar to python code using numpy and shouldn't be that hard to pick up.
With cython you start with the slow, pure python code, then you have to go through a tedious process of instrumenting your code, finding out where all the calls to the python API are, and entering the relevant cython keywords at the right places to turn it into faster C code. With fortran, you just write normal code and you are already getting full compiled speed without going a messy iterative process.
In addition, certain array operations in cython still result in slow calls to the Python API, particularly those involved in slice operations. In constrast, arrays in fortran are native types which the compiler understand and can optimise. Having said that, cython is advancing pretty rapidly, so this may change in the future.
The biggest downside I have found with f2py is that it doesn't support arrays of derived types (analogous to numpy's recarray). There was some hope that fwrap would be a replacement for f2py which would solve this issues, but it appears to be on the backburner at the moment. Incidentally it is based on cython.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.