Is there a way to further improve sparse solution times using python?

Is there a way to further improve sparse solution times using python? - python

I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.
To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.
For comparison, I have used the following solvers:
Matlab and Octave, direct solver: The canonical x=A\b.
Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1)) (not preconditioner is used).
Scipy (Python), direct solver: linalg.spsolve(A, b) where A is previously formatted in csr_matrix format.
Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)
Scipy (Python), UMFPACK solver: spsolve(A, b) using from scikits.umfpack import spsolve. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev.
Furthermore, the aforementioned python solvers are tested in:
Windows.
Linux.
Mac OS.
Conditions:
Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
Timing is done ten times for each system and an average and a standard deviation is computed.
Hardware:
Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU #2.6GHz 2.59GHz, 32 Gb RAM DDR4.
Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.
Results:
Observations:
Matlab A\b is the fastest despite being in an older computer.
There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?
If you want to reproduce the tests, I leave here very simple scripts.
For matlab/octave:
IJS=load('KbN1M.txt');
b=load('FbN1M.txt');
I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);
Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
tic
A=sparse(I,J,S);
tsparse(i)=toc;
tic
x=A\b;
tsolve_direct(i)=toc;
tic
x2=pcg(A,b,1e-5,size(b,1));
tsolve_pcg(i)=toc;
end
save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg
For python:
import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX
b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')
I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]
I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])
Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
t = time.time()
A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
A=sp.csr_matrix(A)
time_sparse[i,0]=time.time()-t
t = time.time()
x=linalg.spsolve(A, b)
time_direct[i,0] = time.time() - t
t = time.time()
x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
time_conj[i,0] = time.time() - t
t = time.time()
x3 = spsolve(A, b) #ONLY IN LINUX
time_umfpack[i,0] = time.time() - t
np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')
Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.
P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver
Issues using the scipy.sparse.linalg linear system solvers
How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg?
But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.
EDIT_1:
I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance.
To do this, one needs to add:
from sparse_dot_mkl import sparse_qr_solve_mkl
and
sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))
to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system.

I will try to answer to myself. To provide an answer, I tried an even more demanding example, with a matrix of size of (N,N) of about half a million by half a million and the corresponding vector (N,1). This, however, is much less sparse (more dense) than the one provided in the question. This matrix stored in ascii is of about 1.7 Gb, compared to the one of the example, which is of about 0.25 Gb (despite its "size" is larger). See its shape here,
Then, I tried to solve Ax=b using again Matlab, Octave and Python using the aforementioned the direct solvers from scipy, the intel MKL wrapper, the UMFPACK from Tim Davis.
My first surprise is that both Matlab and Octave could solve the systems using the A\b, which is not for certain that it is a direct solver, since it chooses the best solver based on the characteristics of the matrix, see Matlab's x=A\b. However, the python's linalg.spsolve , the MKL wrapper and the UMFPACK were throwing out-of-memory errors in Windows and Linux. In mac, the linalg.spsolve was somehow computing a solution, and alghouth it was with a very poor performance, it never through memory errors. I wonder if the memory is handled differently depending on the OS. To me, it seems that mac swapped memory to the hard drive rather than using it from the RAM. The performance of the CG solver in Python was rather poor, compared to the matlab. However, to improve the performance in the CG solver in python, one can get a huge improvement in performance if A=0.5(A+A') is computed first (if one obviously, have a symmetric system). Using a preconditioner in Python did not help. I tried using the sp.linalg.spilu method together with sp.linalg.LinearOperator to compute a preconditioner, but the performance was rather poor. In matlab, one can use the incomplete Cholesky decomposition.
For the out-of-memory problem the solution was to use an LU decomposition and solve two nested systems, such as Ax=b, A=LL', y=L\b and x=y\L'.
I put here the min. solution times,
Matlab mac, A\b = 294 s.
Matlab mac, PCG (without conditioner)= 17.9 s.
Matlab mac, PCG (with incomplete Cholesky conditioner) = 9.8 s.
Scipy mac, direct = 4797 s.
Octave, A\b = 302 s.
Octave, PCG (without conditioner)= 28.6 s.
Octave, PCG (with incomplete Cholesky conditioner) = 11.4 s.
Scipy, PCG (without A=0.5(A+A'))= 119 s.
Scipy, PCG (with A=0.5(A+A'))= 12.7 s.
Scipy, LU decomposition using UMFPACK (Linux) = 3.7 s total.
So the answer is YES, there are ways to improve the solution times in scipy. The use of the wrappers for UMFPACK (Linux) or intel MKL QR solver is highly recommended, if the memmory of the workstation allows it. Otherwise, performing A=0.5(A+A') prior to using the conjugate gradient solver can have a positive effect in the solution performance if one is dealing with symmetric systems.
Let me know if someone would be interested in having this new system, so I can upload it.

Related

parallel algorithm for generalized nonsymmetric eigenproblems

I need to efficiently solve large nonsymmetric generalized eigenvalue/eigenvector problems.
A x = lambda B x
A, B - general real matrices
A - dense
B - mostly sparse
x - the eigenvector
lambda - the eigenvalue
Could someone help me by:
Informing me if the nonsymmetric generalized eigenvalue/eigenvector problems is known to be parallelized. (What are some good algorithms and libraries implementing them if any);
Telling me if scalapack is an alternative to dense nonsymmetric eigenproblems;
Suggesting some good computational alternatives to test the use of both sparse matrices and linear-algebra algorithms;
Suggesting an alternative linear algebra construction that I could use (if there are no simple routines call, perhaps there is a good solution that is not so simple).
I tested code efficiency using matlab, python and C programming. Matlab is said to have builtin lapack functionality. I used intel provided python, with scipy and numpy linking to intel MKL lapack and blas libraries. I also used C code linking to intel MKL lapack and blas libraries.
I was able to check that for non-generalized eigenvalue problems, the code ran in parallel. I had as many threads as physical cores in my machine. That told me that LAPACK uses parallel code in certain routines. (Either LAPACK itself or the optimized versions shipped within matlab and intel MKL oneapi libraries.
When I started to run generalized eigenvalue routines, I observed that the code ran with only one thread. I tested in matlab and python as distributed by intel.
I'd like to investigate this further, but first I need to know if it's possible even in theory to run generalized nonsymmetric eigen decompositions in parallel.
I've seen that scipy have routines for the reduction of a pair of general matrices to a pair of hessenberg/upper triagular matrices. It seems that from hessenberg form, that eigenvalue/eigenvector problems are computationally easier.
Hessenberg for a single matrix runs in parallel. But hessenberg for a pair of matrices, runs only in sequence with one thread. (tested in python scipy). And again, I hit a wall. Which raises the question: is this problem parallelizable?
Other source of optimization for the problem I have is that I have one of the matrices dense and the other is mostly sparse. I'm still not sure how to exploit this. Are there good implementations of sparse matrices and state of the art linear algebra algorithms that work well together?
Thank you very much for any help supplied! Including books and scientific papers.

For MKL-provided routines, you can run it in either parallel mode by using /Qmkl compiler option (Intel compilers) or in sequential mode using /Qmkl:sequential option. For more details, you can refer to the MKL developer reference manual for more details https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-onemkl.html regarding linking options
Here is the link which shows the MKL routines that use sparse matrices https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines.html

Limits on Complex Sparse Linear Algebra in Python

I am prototyping numerical algorithms for linear programming and matrix manipulation with very large (100,000 x 100,000) very sparse (0.01% fill) complex (a+b*i) matrices with symmetric structure and asymmetric values. I have been happily using MATLAB for seven years, but have been receiving suggestions to switch to Python since it is open source.
I understand that there are many different Python numeric packages available, but does Python have any limits for handling these types of matrices and solving linear optimization problems in real time at high speed? Does Python have a sparse complex matrix solver comparable in speed to MATLAB's backslash A\b operator? (I have written Gaussian and LU codes, but A\B is always at least 5 times faster than anything else that I have tried and scales linearly with matrix size.)

Probably your sparse solvers were slower than A\b at least in part due to the interpreter overhead of MATLAB scripts. Internally MATLAB uses UMFPACK's multifrontal solver for LU() function and A\b operator (see UMFPACK manual).
You should try scipy.sparse package with scipy.sparse.linalg for the assortment of solvers available. In particular, spsolve() function has an option to call UMFPACK routine instead of the builtin SuperLU solver.
... solving linear optimization problems in real time at high speed?
Since you have time constraints you might want to consider iterative solvers instead of direct ones.
You can get an idea of the performance of SuperLU implementation in spsolve and iterative solvers available in SciPy from another post on this site.

performance in linear algebra with python

Benchmarks of different languages and related questions are everywhere on the Internet. However, I still cannot figure out an answer of whether I should switch to C in my program.
Basically, The most time consuming part in my program involves a lot of matrix inverse and matrix multiplication. I have several plans:
stick with numpy.
use C with LAPACK/BLAS.
rewrite my python program and change the most time consuming part into C and then use python to call C.
I know numpy is just something wrapped around LAPACK/BLAS. So will 2 or 3 be substantially(500%) faster than 1?

I just wanted to ask a very similar question when i saw yours. I have tested this question from various directions. From quite some time I am trying to beat numpy.dot function by my code.
I have large complex matrices and their multiplication is the primary bottleneck of my program. I have tested following methods
simple c code.
cython code with various optimizations, using cblas.
python 32 bit and 64 bit versions and found that 64 bit version is 1.5-2 times faster than the 32 bit.
ananconda's MKL implementation but no luck there also.
einsum for the matrix multiplication
python 3 and python 2.7 are same python 3 # operator is also same
numpy.dot(a,b,c) is marginally faster than c=numpy.dot(a,b)
by far the numpy.dot is the best. It beat every other method, sometimes marginally (einsum) but mostly significantly.
During my research i come across one article namely
Ultrafast matrix multiplication which tells that apple's altivec implementation can multiply 2500x2500 matrix in less than a second. On my PC with intel core i3 4th generation 2.3 GHZ 4 gb ram it took 73 seconds using numpy.dot hence I am still searching for faster implementation on PC.

Parallel exact matrix diagonalization with Python

Is anyone aware of an implemented version (perhaps using scipy/numpy) of parallel exact matrix diagonalization (equivalently, finding the eigensystem)? If it helps, my matrices are symmetric and sparse. I would hate to spend a day reinventing the wheel.
EDIT:
My matrices are at least 10,000x10,000 (but, preferably, at least 20 times larger). For now, I only have access to a 4-core Intel machine (with hyperthreading, so 2 processes per core), ~3.0Ghz each with 12GB of RAM. I may later have access to a 128-core node ~3.6Ghz/core with 256GB of RAM, so single machine/multiple cores should do it (for my other parallel tasks, I have been using multiprocessing). I would prefer for the algorithms to scale well.
I do need exact diagonalization, so scipy.sparse routines are not be good for me (tried, didn't work well). I have been using numpy.linalg.eigh (I see only single core doing all the computations).
Alternatively (to the original question): is there an online resource where I can find out more about compiling SciPy so as to insure parallel execution?

For symmetric sparse matrix eigenvalue/eigenvector finding, you may use scipy.sparse.linalg.eigsh. It uses ARPACK behind the scenes, and there are parallel ARPACK implementations. AFAIK, SciPy can be compiled with one if your scipy installation uses the serial version.
However, this is not a good answer, if you need all eigenvalues and eigenvectors for the matrix, as the sparse version uses the Lanczos algorithm.
If your matrix is not overwhelmingly large, then just use numpy.linalg.eigh. It uses LAPACK or BLAS and may use parallel code internally.
If you end up rolling your own, please note that SciPy/NumPy does all the heavy lifting with different highly optimized linear algebra packages, not in pure Python. Due to this the performance and degree of parallelism depends heavily on the libraries your SciPy/NumPy installation is compiled with.
(Your question does not reveal if you just want to have parallel code running on several processors, or on several computers. Also, the size of your matrix has a big impact on the best method. So, this answer may be completely off-the-mark.)

MATLAB twice as fast as Numpy

I am an engineering grad student currently making the transition from MATLAB to Python for the purposes of numerical simulation. I was under the impression that for basic array manipulation, Numpy would be as fast as MATLAB. However, it appears for two different programs I write that MATLAB is a little under twice as fast as Numpy. The test code I am using for Numpy (Python 3.3) is:
import numpy as np
import time
a = np.random.rand(5000,5000,3)
tic = time.time()
a[:,:,0] = a[:,:,1]
a[:,:,2] = a[:,:,0]
a[:,:,1] = a[:,:,2]
toc = time.time() - tic
print(toc)
Whereas for MATLAB 2012a I am using:
a = rand(5000,5000,3);
tic;
a(:,:,1) = a(:,:,2);
a(:,:,3) = a(:,:,1);
a(:,:,2) = a(:,:,3);
toc
The algorithm I am using is the one used on a NASA website comparing Numpy and MATLAB. The website shows that Numpy surpasses MATLAB in terms of speed for this algorithm. Yet my results show a 0.49 s simulation time for Numpy and a 0.29 s simulation time for MATLAB. I also have run a Gauss-Seidel solver on both Numpy and Matlab and I get similar results (16.5 s vs. 9.5 s)
I am brand new to Python and am not extremely literate in terms of programming. I am using the WinPython 64 bit Python distribution but have also tried Pythonxy to no avail.
One thing I have read which should improve performance is building Numpy using MKL. Unfortunately I have no idea how to do this on Windows. Do I even need to do this?
Any suggestions?

That comparison ends up being apples to oranges due to caching, because it is more efficient to transfer or do some work on contiguous chunks of memory. This particular benchmark is memory bound, since in fact no computation is done, and thus the percentage of cache hits is key to achieve good performance.
Matlab lays the data in column-major order (Fortran order), so a(:,:,k) is a contiguous chunk of memory, which is fast to copy.
Numpy defaults to row-major order (C order), so in a[:,:,k] there are big jumps between elements and that slows down the memory transfer. Actually, the data layout can be chosen. In my laptop, creating the array with a = np.asfortranarray(np.random.rand(5000,5000,3)) leds to a 5x speed up (1 s vs 0.19 s).
This result should be very similar both for numpy-MKL and plain numpy because MKL is a fast LAPACK implementation and here you're not calling any function that uses it (MKL definitely helps when solving linear systems, computing dot products...).
I don't really know what's going on on the Gauss Seidel solver, but some time ago I wrote an answer to a question titled Numpy running at half the speed of MATLAB that talks a little bit about MKL, FFT and Matlab's JIT.

You are attempting to recreate the NASA experiment, however you have changed many of the variables. For example:
Your hardware and operating system is different (www.nccs.nasa.gov/dali_front.html)
Your Python version is different (2.5.3 vs 3.3)
Your MATLAB version is different (2008 vs 2012)
Assuming the NASA results are correct, the difference in results is due to one or more of these changed variables. I recommend you:
Retest with the SciPy prebuilt binaries.
Research if any improvements were made to MATLAB relative to this type of calculation.
Also, you may find this link useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.