Prebuilt numpy with BLAS/ATLAS?

Prebuilt numpy with BLAS/ATLAS? - python

I'm implementing a real-time LMS algorithm, and numpy.dot takes more time than my sampling time, so I need numpy to be faster (my matrices are 1D and 100 long).
I've read about building numpy with ATLAS and such, but never done such thing and spent all my day trying to do it, with zero succes...
Can someone explain why there aren't builds with ATLAS included? Can anyone provide me with one? Is there any other way to speed up dot product?
I've tried numba, and scipy.linalg.gemm_dot but none of them seemed to speed things up.
my system is Windows8.1 with Intel processor

If you download the official binaries, they should come linked with ATLAS. If you want to make sure, check the output of np.show_config(). The problem is that ATLAS (Automatically Tuned Linear Algebra System) checks many different combinations and algorithms, and keeps the best at compile time. So, when you run a precompiled ATLAS, you are running it optimised for a computer different than yours.
So, your options to improve dot are:
Compile ATLAS yourself. On Windows it may be a bit challenging, but it is doable. Note: you must use the same compiler used to compile Python. That is, if you decide to go for MinGW, you need to get Python compiled with MinGW, or build it yourself.
Try Christopher Gohlke's Numpy. It is linked against MKL, that is much faster than ATLAS (and does all the optimisations at run time).
Try Continuum analytics' Conda with accelerate (also linked with MKL). It costs money, unless you are an academic. In Linux, Conda is slower than system python because they have to use an old compiler for compatibility purposes; I don't know if that is the case on Windows.
Use Linux. Your Python life will be much easier, setting up the system to compile stuff is very easy. Also, setting up Cython is simple too, and then you can compile your whole algorithm, and probably get further speed up.
The note regarding Cython is valid for Windows too, it is just more difficult to get it working. I tried a few years ago (when I used Windows), and failed after a few days; I don't know if the situation has improved.
Alternative:
You are doing the dot product of two vectors. Then, np.dot is probably not the most efficient way. I would give a shot to spell it out in plain Python (vec1*vec2).sum() (could be very good for Numba, this expression it can actually optimise) or using numexpr:
ne.evaluate(`sum(vec1 * vec2)`)
Numexpr will also parallelise the expression automatically.

Related

Optimizing heavy computations in Python code for package to be distributed (with Numba or Cython)

I have a Python package that I'm distributing and I need to include in it a function that does some heavy computation that I can't find programmed in Numpy as Scipy (namely, I need to include a function to compute a variogram with two variables, also called a cross-variogram).
Since these have to be calculated for arrays of over 20000 elements, I need to optimize the code. I have successfully optimized the code (very easily) using Numba and I'm also trying to optimize it using Cython. From what I've read, there is little difference on the final run-time with both, just the steps change.
The problem is: optimizing this code on my computer is relatively easy, but I don't know how to include the code and its optimized (compiled) version in my github package for other users.
I'm thinking I'm going to have to put only the python/cython source code and tweak the setup.py around so it re-compiles in for every user that install the package. If that is the case, I'm not sure if I should use Numba or Cython since Numba seems so much easier to use (at least from by experience) but is such a hassle to install (I don't want to force my users to install anaconda!).
To sum up, two questions:
1 Should this particular piece of code indeed be re-compiled in every user's computer?
2 If so, it is more portable to use Numba or Cython? If not, should I just provide the .so I compiled in my computer?

Multithreading on numpy/pandas matrix multiplication?

I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.
What I'm trying is here:
M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M)
This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?

In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:
Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
Have a NumPy compiled to use such an implementation.
Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).
A few caveats apply:
Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.
As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.

First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot
But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP

Python's freeze.py doesn't install on Windows

I have been looking for the freeze.py utility which is supposed to come bundled with Python 3 in a Python 3.3 Windows install (albeit with distribute and pip installed) and haven't found it. The utility can be downloaded directly out of the Python svn repository here, but I'm wondering: does freeze come with a standard Windows Python 3 install?

It looks like Windows binary installations of Python don't come with the freeze tool. And there's apparently a good reason for this. According to the freeze README in the source tree:
Under Windows 95 or NT, you must use the -p option and point it to the top of the Python source tree.
If you read the whole section, it comes down to this: On Windows, freeze only works if you've built Python from source, and have the resulting tree sitting around to be used for freezing. So, there's no good reason to give you freeze in binary installations.
Meanwhile, I probably should have asked this in the first place, but… are you sure you want freeze in the first place?
The freeze utility is very out of date (you might have guessed that from the README talking about requiring VC++ 5.0, Windows 95 or NT 4.0, etc.). It also never worked that well on Windows (as you can tell from the documentation describing it as a utility "… to compile executables for Unix systems"). And there's just a lot of things it can't handle, or handles badly. At this point should probably be considered more as example code than as a useful tool.
There are a number of third-party alternatives out there: cx_freeze, py2exe, PyInstaller, etc. If you search PyPI for "freeze" (and other terms that seem reasonable), you will find a bunch of these alternatives. If your goal is to create a standalone executable out of your Python script (which, btw, freeze can never do on Windows anyway), experiment with a few of these and pick the one you like best.
If your goal is something different, the right tool will be different—you might be better off using venv or just zipping up a user site-packages directory or creating a local PyPI server.
In the comments, you said:
What I was actually looking for is a tool to convert Python code to C code. Apparently, that's impossible.
It's not impossible, it's just not what freeze (or its successors/competitors) does. Cython compiles almost a strict superset of Python to C code, although it's C code that uses Python runtime objects (except where you explicitly statically declare variables and functions with C types). If C++ is an acceptable alternative to C, Shed Skin compiles a restricted subset of Python 2.6 (using native C++ objects, and using type inference so you don't have to statically declare your types).
The question is why you want to compile Python code to C.
If you're looking to optimize some slow code, Cython is great at speeding up small pieces of bottleneck code. It takes a bit of effort (deciding what to move to Cython, what static type declarations to put in, etc.), but the curve of payoff to effort is pretty solid. Shed Skin takes a lot less effort—if it works, it just speeds up everything, automatically—but it also means you can't write a lot of idiomatic Python code in the first place. But really, before looking at either, you should consider PyPy, a complete implementation of Python 2.7.3 (and hopefully 3.3 soon) in a JIT-compiling interpreter, that often offers similar speedups, with pretty much no tradeoffs at all. Or, alternatively, you may just need to rewrite slow code to take advantage of already-optimized libraries (numpy instead of mapping over lists, itertools instead of explicit loops, lxml instead of html.parse, …).
If you're looking to write Python code that can interact directly with C code, without all the headaches of ctypes (or manually building Python bindings), Cython scores again. Cython code can effectively natively call both Python code and C code, and the compiler makes it all work like magic.
If you're looking to get C code that you can read, maintain, and improve on… there, you're out of luck. And this one may actually be impossible. Idiomatic Python code is just so different from idiomatic C code that it's hard to imagine how you could translate one into the other.
If you're wondering what the underlying problem is:
As far as I can tell, freeze makes a lot of assumptions about how things are laid out. It should be enough to have any Python installation that can build C extension modules and embedding apps, but it's not, because freeze goes under the covers and expects that building to work in specific ways. A standard binary installation on almost every *nix platform ends up looking like what freeze expects,* but a standard binary installation on Windows looks completely different.
It's not impossible to hack things up using Windows symlinks (at least if you have Vista or later and a drive with a modern version of NTFS) to get everything organized the way freeze expects (I found a blog where someone did that with 2.7.1…), but really, I don't think it's worth trying. It will be a lot of work (especially if you're just learning this stuff), and there's no guarantee you won't immediately run into another problem.
* This isn't actually true. On a Mac, both Apple's pre-installed Python and the binary installers at python.org actually give you the files organized as a Mac framework—but they provide a bunch of symlinks that simulate the traditional layout, which is good enough. On most linux distros, and many other platforms, the binary python package doesn't include any of the development files at all—but once you install an add-on binary package named something like python-devel, then you've got the right layout. Anyway, none of this matters to you, because if you wanted to learn about dpkg dependencies or framework builds you wouldn't be using Windows, right?

Performance differences between python from package and python compiled from source

I would like to know if there are any documented performance differences between a Python interpreter that I can install from an rpm (or using yum) and a Python interpreter compiled from sources (with a priori well set flags for compilations).
I am using a Redhat 6.3 machine as Django/Apache/Mod_WSGI production server. I have already properly compiled everything in different setups and in different orders. However, I usually keep the build-dev dependencies on such machine. For some various ego-related (and more or less practical) reasons, I would like to use Python-2.7.3. By default, Redhat comes with Python-2.6.6. I think I could go with it but it would hurt me somehow (I would have to drop and find a replacement for a few libraries and my ego).
However, besides my ego and dependencies, I would like to know what would be the impact in terms of performance for a Django server.

If you compile with the exact same flags that were used to compile the RPM version, you will get a binary that's exactly as fast. And you can get those flags by looking at the RPM's spec file.
However, you can sometimes do better than the pre-built version. For example, you can let the compiler optimize for your specific CPU, instead of for "general 386 compatible" (or whatever the RPM was optimized for). Of course if you don't know what you're doing (or are doing it on purpose), it's always possible to build something slower than the pre-built version, too.
Meanwhile, 2.7.3 is faster in a few areas than 2.6.6. Most of them usually won't affect you, but if they do, they'll probably be a big win.
Finally, for the vast majority of Python code, the speed of the Python interpreter itself isn't relevant to your overall performance or scalability. (And when it is, you probably want to try PyPy, Jython, or IronPython to replace CPython.) This is especially true for a WSGI service. If you're not doing anything slow, Apache will probably be the bottleneck. If you are doing anything slow, it's probably something I/O bound and well outside of Python's control (like reading files).
Ultimately, the only way you can know how much gain you get is by trying it both ways and performance testing. But if you just want a rule of thumb, I'd say expect a 0% gain, and be pleasantly surprised if you get lucky.

Speeding up Parts of Existing Python App with PyPy or Shedskin

I am looking to bring speed improvements to an existing application and I'm looking for advice on my possible options. The application is written in Python, uses wxPython, and is packaged with py2exe (I only target windows platforms). Parts of the application are computationally intensive and run too slowly in interpreted Python. I am not familiar with C, so porting parts of the code over is not really an option for me.
So my question is basically do I have a clear picture of my options as I outline below, or am I approaching this from the wrong direction?
Running with pypy: Today I started experimenting with Pypy - the results are exciting, in that I can run large parts of the code from the pypy interpreter and I'm seeing 5x+ speed improvements with no code changes. However, if I understand correctly, (a) Pypy with wxpython support is still a work in progress, and (b) I cannot compile it down to an exe for distribution anyway. So unless I'm mistaken, this seems like a no-go for me? There's no way to package things up so parts of it are executed with pypy?
Converting code to RPython, translating with pypy So the next option seems to be actually rewriting parts of the code to the pypy restricted language, which seems like a pretty large job. But if I do that, parts of the code can then be compiled to an executable (?) and then I can access the code through ctypes (?).
Other restricted options Shedskin seems to be a popular alternative here, does this fit my requirements better? Other options seem to be Cpython, Psyco, and Unladen, but they are all superseded or no longer maintained.

Using PyPy indeed rules out py2exe and similar tools, at least until one is ported (AFAIK there is no active work on that). Still, as PyPy binaries do not need to be installed, you might get away with a more complicated distribution that includes both your Python source code and a PyPy binary+stdlib and uses a small wrapper (batch file, executable) to ease launching. I can't comment on whether WxPython on PyPy is mature enough to be used, but perhaps someone on pypy-dev, wxpython-dev or either one's IRC channel can give a recommendation if you describe your situation.
Translating your code into RPython does not seem viable to me. The translation toolchain is not really a tool for general purpose development, and producing a C dll for embedding/ctypes seems nontrivial. Also, RPython code really is low-level, making your Python code restricted enough may amount to rewriting half of it.
As for other restricted options: You seem to mix up CPython (the original Python interpreter written in C) with Cython (a compiler for a Python-like language that emits C code suitable for CPython extension modules). Both projects are active. I'm not very familiar with Shedskin, but it seems to be a tool for developing whole programs, with little or no interaction with non-restricted Python code. Cython seems a much better fit: Although it requires manual type annotations and lower-level code to achieve really good performance, it's trivial to use from Python: The very purpose of the project is producing extension modules.

I would definitely look into Cython, I've been playing with it some and have seen speedups of ~100x over pure python. Use the profile module to find the bottlenecks first. Usually the loops are the biggest chances to increase speed when going to Cython. You should also look into seeing if you can use array/vector operations in Numpy instead of loops, if so that can also give extreme performance boosts. For instance:
a = range(1000000)
for i in range(len(a)):
a[i] += 5
is slow, real slow. On the other hand:
a = numpy.arange(10000000)
a = a +5
is fast, real fast.

Correction: shedskin can be used to generare extention modules, as well as whole programs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.