I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.
What I'm trying is here:
M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M)
This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?
In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:
Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
Have a NumPy compiled to use such an implementation.
Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).
A few caveats apply:
Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.
As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.
First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot
But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP
Related
I got some speedup on my code when I linked my numpy to MKL. It's still not fast enough so we are considering using cython. The approach I have in mind is to use CythonGSL to perform the expensive functions in cython using gsl's blas functions. However there's a chance this is a waste of time because numpy is already making MKL do some of its work.
However I don't know how much and exactly what is being done by MKL. The expensive bits of my code are np.sums and np.dots. I suspect by linking MKL the code is already the most optimized it can be, but I'm not sure. So can someone that knows about what numpy + MKL's behavior tell me if I'm probably wasting my time by doing a cython implementation?
Don't do it! There is 0 gains to be made by going to GSL with BLAS operations. It is just linked to some other implementation depending on how you built it. What des the code look like and why do you think that it is slow? Hav ea look here in the meantime.
Benchmarking (python vs. c++ using BLAS) and (numpy)
People have all sorts of assumptions that they make, why things are fast/slow. It usually becomes obvious, where the problem is, when you see code and realize that they might do unnecessary copying of matrices etc.
I have a Python package that I'm distributing and I need to include in it a function that does some heavy computation that I can't find programmed in Numpy as Scipy (namely, I need to include a function to compute a variogram with two variables, also called a cross-variogram).
Since these have to be calculated for arrays of over 20000 elements, I need to optimize the code. I have successfully optimized the code (very easily) using Numba and I'm also trying to optimize it using Cython. From what I've read, there is little difference on the final run-time with both, just the steps change.
The problem is: optimizing this code on my computer is relatively easy, but I don't know how to include the code and its optimized (compiled) version in my github package for other users.
I'm thinking I'm going to have to put only the python/cython source code and tweak the setup.py around so it re-compiles in for every user that install the package. If that is the case, I'm not sure if I should use Numba or Cython since Numba seems so much easier to use (at least from by experience) but is such a hassle to install (I don't want to force my users to install anaconda!).
To sum up, two questions:
1 Should this particular piece of code indeed be re-compiled in every user's computer?
2 If so, it is more portable to use Numba or Cython? If not, should I just provide the .so I compiled in my computer?
There is a project called JyNI that allows you to run NumPy in Jython. However I haven't come across anywhere on how to get NumPy into Jython. I've tried 'pip install numpy' (which will work for normal python 3.4.3) but gives an error about a missing py3k module. Does anybody have a bit more information about this?
JyNI does state NumPy-support as its main goal, but cannot do it yet, as long as it is still in alpha-state.
However until it is mature enough you can use NumPy via
JEP (https://github.com/mrj0/jep) or
JPY (https://github.com/bcdev/jpy).
Alternatively you can use a Java numerical library for your computation, e.g. one of these:
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are Java-libs that do numerical processing natively backed by blas or lapack (i.e. the same backends NumPy uses), so the performance should equal that of NumPy more or less. However they don't feature such a nice multiarray implementation as NumPy does afaik.
If you need NumPy indirectly to fulfill dependencies of some other framework, these solutions won't do it out of the box. If the dependencies are only marginal you can maybe rewrite/substitute the corresponding calls based on one of the named projects. Otherwise you'll have to wait for JyNI...
If you can make some framework running on Jython this way, please consider to make your work publicly available, ideally as a fork of the framework.
I'm implementing a real-time LMS algorithm, and numpy.dot takes more time than my sampling time, so I need numpy to be faster (my matrices are 1D and 100 long).
I've read about building numpy with ATLAS and such, but never done such thing and spent all my day trying to do it, with zero succes...
Can someone explain why there aren't builds with ATLAS included? Can anyone provide me with one? Is there any other way to speed up dot product?
I've tried numba, and scipy.linalg.gemm_dot but none of them seemed to speed things up.
my system is Windows8.1 with Intel processor
If you download the official binaries, they should come linked with ATLAS. If you want to make sure, check the output of np.show_config(). The problem is that ATLAS (Automatically Tuned Linear Algebra System) checks many different combinations and algorithms, and keeps the best at compile time. So, when you run a precompiled ATLAS, you are running it optimised for a computer different than yours.
So, your options to improve dot are:
Compile ATLAS yourself. On Windows it may be a bit challenging, but it is doable. Note: you must use the same compiler used to compile Python. That is, if you decide to go for MinGW, you need to get Python compiled with MinGW, or build it yourself.
Try Christopher Gohlke's Numpy. It is linked against MKL, that is much faster than ATLAS (and does all the optimisations at run time).
Try Continuum analytics' Conda with accelerate (also linked with MKL). It costs money, unless you are an academic. In Linux, Conda is slower than system python because they have to use an old compiler for compatibility purposes; I don't know if that is the case on Windows.
Use Linux. Your Python life will be much easier, setting up the system to compile stuff is very easy. Also, setting up Cython is simple too, and then you can compile your whole algorithm, and probably get further speed up.
The note regarding Cython is valid for Windows too, it is just more difficult to get it working. I tried a few years ago (when I used Windows), and failed after a few days; I don't know if the situation has improved.
Alternative:
You are doing the dot product of two vectors. Then, np.dot is probably not the most efficient way. I would give a shot to spell it out in plain Python (vec1*vec2).sum() (could be very good for Numba, this expression it can actually optimise) or using numexpr:
ne.evaluate(`sum(vec1 * vec2)`)
Numexpr will also parallelise the expression automatically.
We have some Java code we want to use with new code we plan to write in Python, hence our interest in using Jython. However we also want to use numpy and pandas libraries to do complex statistical analysis in this Python code.
Is it possible to call numpy and pandas from Jython?
Keep an eye in JyNI which is at alpha.2 version, as of March-2014.
Not directly.
One option which I've used in the past is to use jsonrpclib (which works for both) to communicate between python and jython. There's even a server builtin which makes things quite simple. You'll just need to figure out whether the gains of using numpy are worth the additional overhead.
Especially if you don't want to use raw Numpy, but other Python frameworks that depend on it, JyNI will be the way to go once it is mature. However, it is not yet capable to import Numpy.
Until then you can use Numpy from Java by embedding CPython. See the Numpy4J-project for this (I didn't test it myself though).
You can't use numpy from Jython at this time. But if you're willing to use CPython instead of Jython, there are some open source Java projects that work with numpy (and presumably pandas).
Jep
jpy
JyNI