parallel algorithm for generalized nonsymmetric eigenproblems

parallel algorithm for generalized nonsymmetric eigenproblems - python

I need to efficiently solve large nonsymmetric generalized eigenvalue/eigenvector problems.
A x = lambda B x
A, B - general real matrices
A - dense
B - mostly sparse
x - the eigenvector
lambda - the eigenvalue
Could someone help me by:
Informing me if the nonsymmetric generalized eigenvalue/eigenvector problems is known to be parallelized. (What are some good algorithms and libraries implementing them if any);
Telling me if scalapack is an alternative to dense nonsymmetric eigenproblems;
Suggesting some good computational alternatives to test the use of both sparse matrices and linear-algebra algorithms;
Suggesting an alternative linear algebra construction that I could use (if there are no simple routines call, perhaps there is a good solution that is not so simple).
I tested code efficiency using matlab, python and C programming. Matlab is said to have builtin lapack functionality. I used intel provided python, with scipy and numpy linking to intel MKL lapack and blas libraries. I also used C code linking to intel MKL lapack and blas libraries.
I was able to check that for non-generalized eigenvalue problems, the code ran in parallel. I had as many threads as physical cores in my machine. That told me that LAPACK uses parallel code in certain routines. (Either LAPACK itself or the optimized versions shipped within matlab and intel MKL oneapi libraries.
When I started to run generalized eigenvalue routines, I observed that the code ran with only one thread. I tested in matlab and python as distributed by intel.
I'd like to investigate this further, but first I need to know if it's possible even in theory to run generalized nonsymmetric eigen decompositions in parallel.
I've seen that scipy have routines for the reduction of a pair of general matrices to a pair of hessenberg/upper triagular matrices. It seems that from hessenberg form, that eigenvalue/eigenvector problems are computationally easier.
Hessenberg for a single matrix runs in parallel. But hessenberg for a pair of matrices, runs only in sequence with one thread. (tested in python scipy). And again, I hit a wall. Which raises the question: is this problem parallelizable?
Other source of optimization for the problem I have is that I have one of the matrices dense and the other is mostly sparse. I'm still not sure how to exploit this. Are there good implementations of sparse matrices and state of the art linear algebra algorithms that work well together?
Thank you very much for any help supplied! Including books and scientific papers.

For MKL-provided routines, you can run it in either parallel mode by using /Qmkl compiler option (Intel compilers) or in sequential mode using /Qmkl:sequential option. For more details, you can refer to the MKL developer reference manual for more details https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-onemkl.html regarding linking options
Here is the link which shows the MKL routines that use sparse matrices https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines.html

Related

Solving large sparse linear system of quations Python vs Matlab [duplicate]

I want to compute magnetic fields of some conductors using the Biot–Savart law and I want to use a 1000x1000x1000 matrix. Before I use MATLAB, but now I want to use Python. Is Python slower than MATLAB ? How can I make Python faster?
EDIT:
Maybe the best way is to compute the big array with C/C++ and then transfering them to Python. I want to visualise then with VPython.
EDIT2: Which is better in my case: C or C++?

You might find some useful results at the bottom of this link
http://wiki.scipy.org/PerformancePython
From the introduction,
A comparison of weave with NumPy, Pyrex, Psyco, Fortran (77 and 90) and C++ for solving Laplace's equation.
It also compares MATLAB and seems to show similar speeds to when using Python and NumPy.
Of course this is only a specific example, your application might be allow better or worse performance. There is no harm in running the same test on both and comparing.
You can also compile NumPy with optimized libraries such as ATLAS which provides some BLAS/LAPACK routines. These should be of comparable speed to MATLAB.
I'm not sure if the NumPy downloads are already built against it, but I think ATLAS will tune libraries to your system if you compile NumPy,
http://www.scipy.org/Installing_SciPy/Windows
The link has more details on what is required under the Windows platform.
EDIT:
If you want to find out what performs better, C or C++, it might be worth asking a new question. Although from the link above C++ has best performance. Other solutions are quite close too i.e. Pyrex, Python/Fortran (using f2py) and inline C++.
The only matrix algebra under C++ I have ever done was using MTL and implementing an Extended Kalman Filter. I guess, though, in essence it depends on the libraries you are using LAPACK/BLAS and how well optimised it is.
This link has a list of object-oriented numerical packages for many languages.
http://www.oonumerics.org/oon/

NumPy and MATLAB both use an underlying BLAS implementation for standard linear algebra operations. For some time both used ATLAS, but nowadays MATLAB apparently also comes with other implementations like Intel's Math Kernel Library (MKL). Which one is faster by how much depends on the system and how the BLAS implementation was compiled. You can also compile NumPy with MKL and Enthought is working on MKL support for their Python distribution (see their roadmap). Here is also a recent interesting blog post about this.
On the other hand, if you need more specialized operations or data structures then both Python and MATLAB offer you various ways for optimization (like Cython, PyCUDA,...).
Edit: I corrected this answer to take into account different BLAS implementations. I hope it is now a fair representation of the current situation.

The only valid test is to benchmark it. It really depends on what your platform is, and how well the Biot-Savart Law maps to Matlab or NumPy/SciPy built-in operations.
As for making Python faster, Google's working on Unladen Swallow, a JIT compiler for Python. There are probably other projects like this as well.

As per your edit 2, I recommend very strongly that you use Fortran because you can leverage the available linear algebra subroutines (Lapack and Blas) and it is way simpler than C/C++ for matrix computations.
If you prefer to go with a C/C++ approach, I would use C, because you presumably need raw performance on a presumably simple interface (matrix computations tend to have simple interfaces and complex algorithms).
If, however, you decide to go with C++, you can use the TNT (the Template Numerical Toolkit, the C++ implementation of Lapack).
Good luck.

If you're just using Python (with NumPy), it may be slower, depending on which pieces you use, whether or not you have optimized linear algebra libraries installed, and how well you know how to take advantage of NumPy.
To make it faster, there are a few things you can do. There is a tool called Cython that allows you to add type declarations to Python code and translate it into a Python extension module in C. How much benefit this gets you depends a bit on how diligent you are with your type declarations - if you don't add any at all, you won't see much of any benefit. Cython also has support for NumPy types, though these are a bit more complicated than other types.
If you have a good graphics card and are willing to learn a bit about GPU computing, PyCUDA can also help. (If you don't have an nvidia graphics card, I hear there is a PyOpenCL in the works as well). I don't know your problem domain, but if it can be mapped into a CUDA problem then it should be able to handle your 10^9 elements nicely.

And here is an updated "comparison" between MATLAB and NumPy/MKL based on some linear algebra functions:
http://dpinte.wordpress.com/2010/03/16/numpymkl-vs-matlab-performance/
The dot product is not that slow ;-)

I couldn't find much hard numbers to answer this same question so I went ahead and did the testing myself. The results, scripts, and data sets used are all available here on my post on MATLAB vs Python speed for vibration analysis.
Long story short, the FFT function in MATLAB is better than Python but you can do some simple manipulation to get comparable results and speed. I also found that importing data was faster in Python compared to MATLAB (even for MAT files using the scipy.io).

I would also like to point out that Python (+NumPy) can easily interface with Fortran via the F2Py module, which basically nets you native Fortran speeds on the pieces of code you offload into it.

Is there a way to further improve sparse solution times using python?

I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.
To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.
For comparison, I have used the following solvers:
Matlab and Octave, direct solver: The canonical x=A\b.
Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1)) (not preconditioner is used).
Scipy (Python), direct solver: linalg.spsolve(A, b) where A is previously formatted in csr_matrix format.
Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)
Scipy (Python), UMFPACK solver: spsolve(A, b) using from scikits.umfpack import spsolve. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev.
Furthermore, the aforementioned python solvers are tested in:
Windows.
Linux.
Mac OS.
Conditions:
Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
Timing is done ten times for each system and an average and a standard deviation is computed.
Hardware:
Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU #2.6GHz 2.59GHz, 32 Gb RAM DDR4.
Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.
Results:
Observations:
Matlab A\b is the fastest despite being in an older computer.
There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?
If you want to reproduce the tests, I leave here very simple scripts.
For matlab/octave:
IJS=load('KbN1M.txt');
b=load('FbN1M.txt');
I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);
Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
tic
A=sparse(I,J,S);
tsparse(i)=toc;
tic
x=A\b;
tsolve_direct(i)=toc;
tic
x2=pcg(A,b,1e-5,size(b,1));
tsolve_pcg(i)=toc;
end
save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg
For python:
import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX
b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')
I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]
I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])
Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
t = time.time()
A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
A=sp.csr_matrix(A)
time_sparse[i,0]=time.time()-t
t = time.time()
x=linalg.spsolve(A, b)
time_direct[i,0] = time.time() - t
t = time.time()
x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
time_conj[i,0] = time.time() - t
t = time.time()
x3 = spsolve(A, b) #ONLY IN LINUX
time_umfpack[i,0] = time.time() - t
np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')
Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.
P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver
Issues using the scipy.sparse.linalg linear system solvers
How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg?
But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.
EDIT_1:
I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance.
To do this, one needs to add:
from sparse_dot_mkl import sparse_qr_solve_mkl
and
sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))
to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system.

I will try to answer to myself. To provide an answer, I tried an even more demanding example, with a matrix of size of (N,N) of about half a million by half a million and the corresponding vector (N,1). This, however, is much less sparse (more dense) than the one provided in the question. This matrix stored in ascii is of about 1.7 Gb, compared to the one of the example, which is of about 0.25 Gb (despite its "size" is larger). See its shape here,
Then, I tried to solve Ax=b using again Matlab, Octave and Python using the aforementioned the direct solvers from scipy, the intel MKL wrapper, the UMFPACK from Tim Davis.
My first surprise is that both Matlab and Octave could solve the systems using the A\b, which is not for certain that it is a direct solver, since it chooses the best solver based on the characteristics of the matrix, see Matlab's x=A\b. However, the python's linalg.spsolve , the MKL wrapper and the UMFPACK were throwing out-of-memory errors in Windows and Linux. In mac, the linalg.spsolve was somehow computing a solution, and alghouth it was with a very poor performance, it never through memory errors. I wonder if the memory is handled differently depending on the OS. To me, it seems that mac swapped memory to the hard drive rather than using it from the RAM. The performance of the CG solver in Python was rather poor, compared to the matlab. However, to improve the performance in the CG solver in python, one can get a huge improvement in performance if A=0.5(A+A') is computed first (if one obviously, have a symmetric system). Using a preconditioner in Python did not help. I tried using the sp.linalg.spilu method together with sp.linalg.LinearOperator to compute a preconditioner, but the performance was rather poor. In matlab, one can use the incomplete Cholesky decomposition.
For the out-of-memory problem the solution was to use an LU decomposition and solve two nested systems, such as Ax=b, A=LL', y=L\b and x=y\L'.
I put here the min. solution times,
Matlab mac, A\b = 294 s.
Matlab mac, PCG (without conditioner)= 17.9 s.
Matlab mac, PCG (with incomplete Cholesky conditioner) = 9.8 s.
Scipy mac, direct = 4797 s.
Octave, A\b = 302 s.
Octave, PCG (without conditioner)= 28.6 s.
Octave, PCG (with incomplete Cholesky conditioner) = 11.4 s.
Scipy, PCG (without A=0.5(A+A'))= 119 s.
Scipy, PCG (with A=0.5(A+A'))= 12.7 s.
Scipy, LU decomposition using UMFPACK (Linux) = 3.7 s total.
So the answer is YES, there are ways to improve the solution times in scipy. The use of the wrappers for UMFPACK (Linux) or intel MKL QR solver is highly recommended, if the memmory of the workstation allows it. Otherwise, performing A=0.5(A+A') prior to using the conjugate gradient solver can have a positive effect in the solution performance if one is dealing with symmetric systems.
Let me know if someone would be interested in having this new system, so I can upload it.

Eigenvectors of a large sparse matrix in tensorflow

I was wondering if there is a way to calculate the first few eigenvectors of a very large sparse matrix in tensorflow, hoping that it might be faster than scipy's implementation of ARPACK, which doesn't seem to support parallel computing. At least, as far as I noticed.

I believe you should rather look into PETCs4py or SLEPc4py.
They are python binding of PETSc (Portable, Extensible Toolkit for
Scientific Computation) and SLEPc (Scalable Library for Eigenvalue Problem Computations).
PETSc and SLEPc support MPI and therefore PETCs4py and SLEPc4py do too.
I believe you will find useful examples in examples

Limits on Complex Sparse Linear Algebra in Python

I am prototyping numerical algorithms for linear programming and matrix manipulation with very large (100,000 x 100,000) very sparse (0.01% fill) complex (a+b*i) matrices with symmetric structure and asymmetric values. I have been happily using MATLAB for seven years, but have been receiving suggestions to switch to Python since it is open source.
I understand that there are many different Python numeric packages available, but does Python have any limits for handling these types of matrices and solving linear optimization problems in real time at high speed? Does Python have a sparse complex matrix solver comparable in speed to MATLAB's backslash A\b operator? (I have written Gaussian and LU codes, but A\B is always at least 5 times faster than anything else that I have tried and scales linearly with matrix size.)

Probably your sparse solvers were slower than A\b at least in part due to the interpreter overhead of MATLAB scripts. Internally MATLAB uses UMFPACK's multifrontal solver for LU() function and A\b operator (see UMFPACK manual).
You should try scipy.sparse package with scipy.sparse.linalg for the assortment of solvers available. In particular, spsolve() function has an option to call UMFPACK routine instead of the builtin SuperLU solver.
... solving linear optimization problems in real time at high speed?
Since you have time constraints you might want to consider iterative solvers instead of direct ones.
You can get an idea of the performance of SuperLU implementation in spsolve and iterative solvers available in SciPy from another post on this site.

Parallel exact matrix diagonalization with Python

Is anyone aware of an implemented version (perhaps using scipy/numpy) of parallel exact matrix diagonalization (equivalently, finding the eigensystem)? If it helps, my matrices are symmetric and sparse. I would hate to spend a day reinventing the wheel.
EDIT:
My matrices are at least 10,000x10,000 (but, preferably, at least 20 times larger). For now, I only have access to a 4-core Intel machine (with hyperthreading, so 2 processes per core), ~3.0Ghz each with 12GB of RAM. I may later have access to a 128-core node ~3.6Ghz/core with 256GB of RAM, so single machine/multiple cores should do it (for my other parallel tasks, I have been using multiprocessing). I would prefer for the algorithms to scale well.
I do need exact diagonalization, so scipy.sparse routines are not be good for me (tried, didn't work well). I have been using numpy.linalg.eigh (I see only single core doing all the computations).
Alternatively (to the original question): is there an online resource where I can find out more about compiling SciPy so as to insure parallel execution?

For symmetric sparse matrix eigenvalue/eigenvector finding, you may use scipy.sparse.linalg.eigsh. It uses ARPACK behind the scenes, and there are parallel ARPACK implementations. AFAIK, SciPy can be compiled with one if your scipy installation uses the serial version.
However, this is not a good answer, if you need all eigenvalues and eigenvectors for the matrix, as the sparse version uses the Lanczos algorithm.
If your matrix is not overwhelmingly large, then just use numpy.linalg.eigh. It uses LAPACK or BLAS and may use parallel code internally.
If you end up rolling your own, please note that SciPy/NumPy does all the heavy lifting with different highly optimized linear algebra packages, not in pure Python. Due to this the performance and degree of parallelism depends heavily on the libraries your SciPy/NumPy installation is compiled with.
(Your question does not reveal if you just want to have parallel code running on several processors, or on several computers. Also, the size of your matrix has a big impact on the best method. So, this answer may be completely off-the-mark.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.