Python FFTN slow in comparison to MATLAB

Python FFTN slow in comparison to MATLAB - python

Dear stackoverflow community!
In a previous stackoverflow question, I mentioned that python's np.fft.fftn() routine seems somehow slow compared to MATLAB, provided that the datacubes are rather big (grids of dimension 512x512x1921, datatype float) (see Comparatively slow python numpy 3D Fourier Transformation). I think that MATLAB adopts the FFTW algorithm and could therefore be faster (~5s compared to ~185s (time.time())), so I was suggested to try pyFFTW for a time reduction.
The problem now is that at my work place python packages are implemented via anaconda for a large number of computers and the pyFFTW package cannot be easily integrated therein. There's somehow also a problem that long datatypes are not recognized and therefore compilation does not work at all. pyFFTW also conflicts with the internal FFTW implementation. Even if somehow installed, it would be overriden by the next update of the system.
I'm however not sure whether the different algorithm alone would explain the difference in computation time. As already written in the previous questions, I really need these FFTs for my work.
Another issue concerns the striding of the output array of np.fft.fftn(), which is switched to FORTRAN structure automatically (which again is the opposite of the default). This causes low performace when operating on the output in combination with C-strided grids (see Python numpy.fft changes strides).
So as a follow-up to my original questions, I want to ask you:
(MAIN) What other reasons might there for python to be so much slower? What can be done about it? I'd like to stay with python if possible and rather not switch to MATLAB just because of such things...
(SIDE) Is there any keyword to preserve striding? Using scipy is not a good option and copying the array to a new one to get the strides correctly also seems an unnecessarily complicated step requiring additional computation time.
Thanks for the help!

Related

Solving large sparse linear system of quations Python vs Matlab [duplicate]

I want to compute magnetic fields of some conductors using the Biot–Savart law and I want to use a 1000x1000x1000 matrix. Before I use MATLAB, but now I want to use Python. Is Python slower than MATLAB ? How can I make Python faster?
EDIT:
Maybe the best way is to compute the big array with C/C++ and then transfering them to Python. I want to visualise then with VPython.
EDIT2: Which is better in my case: C or C++?

You might find some useful results at the bottom of this link
http://wiki.scipy.org/PerformancePython
From the introduction,
A comparison of weave with NumPy, Pyrex, Psyco, Fortran (77 and 90) and C++ for solving Laplace's equation.
It also compares MATLAB and seems to show similar speeds to when using Python and NumPy.
Of course this is only a specific example, your application might be allow better or worse performance. There is no harm in running the same test on both and comparing.
You can also compile NumPy with optimized libraries such as ATLAS which provides some BLAS/LAPACK routines. These should be of comparable speed to MATLAB.
I'm not sure if the NumPy downloads are already built against it, but I think ATLAS will tune libraries to your system if you compile NumPy,
http://www.scipy.org/Installing_SciPy/Windows
The link has more details on what is required under the Windows platform.
EDIT:
If you want to find out what performs better, C or C++, it might be worth asking a new question. Although from the link above C++ has best performance. Other solutions are quite close too i.e. Pyrex, Python/Fortran (using f2py) and inline C++.
The only matrix algebra under C++ I have ever done was using MTL and implementing an Extended Kalman Filter. I guess, though, in essence it depends on the libraries you are using LAPACK/BLAS and how well optimised it is.
This link has a list of object-oriented numerical packages for many languages.
http://www.oonumerics.org/oon/

NumPy and MATLAB both use an underlying BLAS implementation for standard linear algebra operations. For some time both used ATLAS, but nowadays MATLAB apparently also comes with other implementations like Intel's Math Kernel Library (MKL). Which one is faster by how much depends on the system and how the BLAS implementation was compiled. You can also compile NumPy with MKL and Enthought is working on MKL support for their Python distribution (see their roadmap). Here is also a recent interesting blog post about this.
On the other hand, if you need more specialized operations or data structures then both Python and MATLAB offer you various ways for optimization (like Cython, PyCUDA,...).
Edit: I corrected this answer to take into account different BLAS implementations. I hope it is now a fair representation of the current situation.

The only valid test is to benchmark it. It really depends on what your platform is, and how well the Biot-Savart Law maps to Matlab or NumPy/SciPy built-in operations.
As for making Python faster, Google's working on Unladen Swallow, a JIT compiler for Python. There are probably other projects like this as well.

As per your edit 2, I recommend very strongly that you use Fortran because you can leverage the available linear algebra subroutines (Lapack and Blas) and it is way simpler than C/C++ for matrix computations.
If you prefer to go with a C/C++ approach, I would use C, because you presumably need raw performance on a presumably simple interface (matrix computations tend to have simple interfaces and complex algorithms).
If, however, you decide to go with C++, you can use the TNT (the Template Numerical Toolkit, the C++ implementation of Lapack).
Good luck.

If you're just using Python (with NumPy), it may be slower, depending on which pieces you use, whether or not you have optimized linear algebra libraries installed, and how well you know how to take advantage of NumPy.
To make it faster, there are a few things you can do. There is a tool called Cython that allows you to add type declarations to Python code and translate it into a Python extension module in C. How much benefit this gets you depends a bit on how diligent you are with your type declarations - if you don't add any at all, you won't see much of any benefit. Cython also has support for NumPy types, though these are a bit more complicated than other types.
If you have a good graphics card and are willing to learn a bit about GPU computing, PyCUDA can also help. (If you don't have an nvidia graphics card, I hear there is a PyOpenCL in the works as well). I don't know your problem domain, but if it can be mapped into a CUDA problem then it should be able to handle your 10^9 elements nicely.

And here is an updated "comparison" between MATLAB and NumPy/MKL based on some linear algebra functions:
http://dpinte.wordpress.com/2010/03/16/numpymkl-vs-matlab-performance/
The dot product is not that slow ;-)

I couldn't find much hard numbers to answer this same question so I went ahead and did the testing myself. The results, scripts, and data sets used are all available here on my post on MATLAB vs Python speed for vibration analysis.
Long story short, the FFT function in MATLAB is better than Python but you can do some simple manipulation to get comparable results and speed. I also found that importing data was faster in Python compared to MATLAB (even for MAT files using the scipy.io).

I would also like to point out that Python (+NumPy) can easily interface with Fortran via the F2Py module, which basically nets you native Fortran speeds on the pieces of code you offload into it.

Expokit realization on Python

I am looking for a Pythonic realization of Expokit, which is a software package that provides matrix exponential routines for small dense or very large sparse matrices, real or complex, i.e. it finds
w(t) = exp(t*A)*v
This package had been realized in Fortran and Matlab and can be found here https://www.maths.uq.edu.au/expokit/
I have found a python wrapper expokitpy
https://github.com/weinbe58/expokitpy and a Krylov subspace methods package KryPy https://github.com/andrenarchy/krypy. Both seem to be relevant, however neither of them goes with good enough documentation (for me) to do time-evolution.
Does somebody have a working solution with the packages mentioned above or similar?

In case this is still useful to someone, it looks like there was an effort to incorporate expokit within scipy which has now stalled and is looking for somebody to finish. Though here are some instructions to compile with Fortran and then run via Python, with good results.
It seems also to have been adopted by slepc4py, which is then used by quimb, which seems useful if you need it for quantum information (or just use its expm and expm_multiply methods).

Seasonal-Trend-Loess Method for Time Series in Python

Does anyone know if there is a Python-based procedure to decompose time series utilizing STL (Seasonal-Trend-Loess) method?
I saw references to a wrapper program to call the stl function
in R, but I found that to be unstable and cumbersome from the environment set-up perspective (Python and R together). Also, link was 4 years old.
Can someone point out something more recent (e.g. sklearn, spicy, etc.)?

I haven't tried STLDecompose but I took a peek at it and I believe it uses a general purpose loess smoother. This is hard to do right and tends to be inefficient. See the defunct STL-Java repo.
The pyloess package provides a python wrapper to the same underlying Fortran that is used by the original R version. You definitely don't need to go through a bridge to R to get this same functionality! This package is not actively maintained and I've occasionally had trouble getting it to build on some platforms (thus the fork here). But once built, it does work and is the fastest one you're likely to find. I've been tempted to modify it to include some new features, but just can't bring myself to modify the Fortran (which is pre-processed RATFOR - very assembly-language like Fortran, and I can't find a RATFOR preprocessor anywhere).
I wrote a native Java implementation, stl-decomp-4j, that can be called from python using the pyjnius package. This started as a direct port of the original Fortran, refactored to a more modern programming style. I then extended it to allow quadratic loess interpolation and to support post-decomposition smoothing of the seasonal component, features that are described in the original paper but that were not put into the Fortran/R implementation. (They apparently are in the S-plus implementation, but few of us have access to that.) The key to making this efficient is that the loess smoothing simplifies when the points are equidistant and the point-by-point smoothing is done by simply modifying the weights that one is using to do the interpolation.
The stl-decomp-4j examples include one Jupyter notebook demonstrating how to call this package from python. I should probably formalize that as a python package but haven't had time. Quite willing to accept pull requests. ;-)
I'd love to see a direct port of this approach to python/numpy. Another thing on my "if I had some spare time" list.

Here you can find an example of Seasonal-Trend decomposition using LOESS (STL), from statsmodels.
Basicaly it works this way:
from statsmodels.tsa.seasonal import STL
stl = STL(TimeSeries, seasonal=13)
res = stl.fit()
fig = res.plot()

There is indeed:
https://github.com/jrmontag/STLDecompose
In the repo you will find a jupyter notebook for usage of the package.

RSTL is a Python port of R's STL: https://github.com/ericist/rstl. It works pretty well except it is 3~5 times slower than R's STL according to the author.
If you just want to get lowess trend line, you can just use Statsmodels' lowess function
https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html.

sympy compiling functions with large matrices

I have been using sympy to work with systems of differential equations. I write the equations symbolically, use autowrap to compile them through cython, and then pass the resulting function to the scipy ODE solver. One of the major benefits of doing this is that I can solve for the jacobian symbolically using the sympy jacobian function, compile it, and it to the ODE solver as well.
This has been working great for systems of about 30 variables. Recently I tried doing it with 150 variables, and what happened was that I ran out of memory when compiling the jacobian function. This is on Windows with anaconda and the microsoft Visual C++ 14 tools for python. Basically during compilation of the jacobian, which is now a 22000-element vector, memory usage during the linking step went up to about 7GB (on my 8GB laptop) before finally crashing out.
Does someone have some suggestions before I go and try on a machine with more memory? Are other operating systems or other C compilers likely to improve the situation?
I know lots of people do this type of work, so if there's an answer, it will be beneficial to a good chunk of the community.
Edit: response to some of Jonathan's comments:
Yes, I'm fully aware that this is an N^2 problem. The jacobian is a matrix of all partial derivatives, so it will have size N^2. There is no real way around this scaling. However, a 22000-element array is not nearly at the level that would create a memory problem during runtime -- I only have the problem during compilation.
Basically there are three levels that we can address this at.
1) solve the ODE problem without the jacobian, or somehow split up the jacobian to not have a 150x150 matrix. That would address the very root, but it certainly limits what I can do, and I'm not yet convinced that it's impossible to compile the jacobian function
2) change something about the way sympy automatically generates C code, to split it up into multiple chunks, use more functions for intermediate expressions, to somehow make the .c file smaller. People with more sympy experience might have some ideas on this.
3) change something about the way the C is compiled, so that less memory is needed.
I thought that by posting a separate question more oriented around #3 (literal referencing of large array -- compiler out of memory) , I would get a different audience answering. That is in fact exactly what happened. Perhaps the answer to #3 is "you can't" but that's also useful information.

Following a lot of the examples posted at http://www.sympy.org/scipy-2017-codegen-tutorial/ I was able to get this to compile.
The key things were
1) instead of using autowrap, write the C code directly with more control over it. Among other things, this allows passing the argument list as a vector instead of expanding it. This took some effort to get working (setting up the compiler flags through distutils, etc, etc) but in the end it worked well. Having the repo from the course linked above as an example helped a lot.
2) using common subexpression elimination (sympy.cse) to dramatically reduce the size of the expressions for the jacobian elements.
(1) by itself didn't do that much to help in this case (although I was able to use it to vastly improve performance of smaller models). The code was still 200 MB instead of the original 300 MB. But combining it with (2) (cse) I was able to get it down to a meager 1.7 MB (despite 14000 temporary variables).
The cse takes about 20-30 minutes on my laptop. After that, it compiles quickly.

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.