performance in linear algebra with python

performance in linear algebra with python - python

Benchmarks of different languages and related questions are everywhere on the Internet. However, I still cannot figure out an answer of whether I should switch to C in my program.
Basically, The most time consuming part in my program involves a lot of matrix inverse and matrix multiplication. I have several plans:
stick with numpy.
use C with LAPACK/BLAS.
rewrite my python program and change the most time consuming part into C and then use python to call C.
I know numpy is just something wrapped around LAPACK/BLAS. So will 2 or 3 be substantially(500%) faster than 1?

I just wanted to ask a very similar question when i saw yours. I have tested this question from various directions. From quite some time I am trying to beat numpy.dot function by my code.
I have large complex matrices and their multiplication is the primary bottleneck of my program. I have tested following methods
simple c code.
cython code with various optimizations, using cblas.
python 32 bit and 64 bit versions and found that 64 bit version is 1.5-2 times faster than the 32 bit.
ananconda's MKL implementation but no luck there also.
einsum for the matrix multiplication
python 3 and python 2.7 are same python 3 # operator is also same
numpy.dot(a,b,c) is marginally faster than c=numpy.dot(a,b)
by far the numpy.dot is the best. It beat every other method, sometimes marginally (einsum) but mostly significantly.
During my research i come across one article namely
Ultrafast matrix multiplication which tells that apple's altivec implementation can multiply 2500x2500 matrix in less than a second. On my PC with intel core i3 4th generation 2.3 GHZ 4 gb ram it took 73 seconds using numpy.dot hence I am still searching for faster implementation on PC.

Related

Is there a way to further improve sparse solution times using python?

I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.
To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.
For comparison, I have used the following solvers:
Matlab and Octave, direct solver: The canonical x=A\b.
Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1)) (not preconditioner is used).
Scipy (Python), direct solver: linalg.spsolve(A, b) where A is previously formatted in csr_matrix format.
Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)
Scipy (Python), UMFPACK solver: spsolve(A, b) using from scikits.umfpack import spsolve. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev.
Furthermore, the aforementioned python solvers are tested in:
Windows.
Linux.
Mac OS.
Conditions:
Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
Timing is done ten times for each system and an average and a standard deviation is computed.
Hardware:
Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU #2.6GHz 2.59GHz, 32 Gb RAM DDR4.
Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.
Results:
Observations:
Matlab A\b is the fastest despite being in an older computer.
There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?
If you want to reproduce the tests, I leave here very simple scripts.
For matlab/octave:
IJS=load('KbN1M.txt');
b=load('FbN1M.txt');
I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);
Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
tic
A=sparse(I,J,S);
tsparse(i)=toc;
tic
x=A\b;
tsolve_direct(i)=toc;
tic
x2=pcg(A,b,1e-5,size(b,1));
tsolve_pcg(i)=toc;
end
save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg
For python:
import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX
b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')
I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]
I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])
Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
t = time.time()
A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
A=sp.csr_matrix(A)
time_sparse[i,0]=time.time()-t
t = time.time()
x=linalg.spsolve(A, b)
time_direct[i,0] = time.time() - t
t = time.time()
x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
time_conj[i,0] = time.time() - t
t = time.time()
x3 = spsolve(A, b) #ONLY IN LINUX
time_umfpack[i,0] = time.time() - t
np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')
Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.
P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver
Issues using the scipy.sparse.linalg linear system solvers
How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg?
But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.
EDIT_1:
I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance.
To do this, one needs to add:
from sparse_dot_mkl import sparse_qr_solve_mkl
and
sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))
to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system.

I will try to answer to myself. To provide an answer, I tried an even more demanding example, with a matrix of size of (N,N) of about half a million by half a million and the corresponding vector (N,1). This, however, is much less sparse (more dense) than the one provided in the question. This matrix stored in ascii is of about 1.7 Gb, compared to the one of the example, which is of about 0.25 Gb (despite its "size" is larger). See its shape here,
Then, I tried to solve Ax=b using again Matlab, Octave and Python using the aforementioned the direct solvers from scipy, the intel MKL wrapper, the UMFPACK from Tim Davis.
My first surprise is that both Matlab and Octave could solve the systems using the A\b, which is not for certain that it is a direct solver, since it chooses the best solver based on the characteristics of the matrix, see Matlab's x=A\b. However, the python's linalg.spsolve , the MKL wrapper and the UMFPACK were throwing out-of-memory errors in Windows and Linux. In mac, the linalg.spsolve was somehow computing a solution, and alghouth it was with a very poor performance, it never through memory errors. I wonder if the memory is handled differently depending on the OS. To me, it seems that mac swapped memory to the hard drive rather than using it from the RAM. The performance of the CG solver in Python was rather poor, compared to the matlab. However, to improve the performance in the CG solver in python, one can get a huge improvement in performance if A=0.5(A+A') is computed first (if one obviously, have a symmetric system). Using a preconditioner in Python did not help. I tried using the sp.linalg.spilu method together with sp.linalg.LinearOperator to compute a preconditioner, but the performance was rather poor. In matlab, one can use the incomplete Cholesky decomposition.
For the out-of-memory problem the solution was to use an LU decomposition and solve two nested systems, such as Ax=b, A=LL', y=L\b and x=y\L'.
I put here the min. solution times,
Matlab mac, A\b = 294 s.
Matlab mac, PCG (without conditioner)= 17.9 s.
Matlab mac, PCG (with incomplete Cholesky conditioner) = 9.8 s.
Scipy mac, direct = 4797 s.
Octave, A\b = 302 s.
Octave, PCG (without conditioner)= 28.6 s.
Octave, PCG (with incomplete Cholesky conditioner) = 11.4 s.
Scipy, PCG (without A=0.5(A+A'))= 119 s.
Scipy, PCG (with A=0.5(A+A'))= 12.7 s.
Scipy, LU decomposition using UMFPACK (Linux) = 3.7 s total.
So the answer is YES, there are ways to improve the solution times in scipy. The use of the wrappers for UMFPACK (Linux) or intel MKL QR solver is highly recommended, if the memmory of the workstation allows it. Otherwise, performing A=0.5(A+A') prior to using the conjugate gradient solver can have a positive effect in the solution performance if one is dealing with symmetric systems.
Let me know if someone would be interested in having this new system, so I can upload it.

sympy compiling functions with large matrices

I have been using sympy to work with systems of differential equations. I write the equations symbolically, use autowrap to compile them through cython, and then pass the resulting function to the scipy ODE solver. One of the major benefits of doing this is that I can solve for the jacobian symbolically using the sympy jacobian function, compile it, and it to the ODE solver as well.
This has been working great for systems of about 30 variables. Recently I tried doing it with 150 variables, and what happened was that I ran out of memory when compiling the jacobian function. This is on Windows with anaconda and the microsoft Visual C++ 14 tools for python. Basically during compilation of the jacobian, which is now a 22000-element vector, memory usage during the linking step went up to about 7GB (on my 8GB laptop) before finally crashing out.
Does someone have some suggestions before I go and try on a machine with more memory? Are other operating systems or other C compilers likely to improve the situation?
I know lots of people do this type of work, so if there's an answer, it will be beneficial to a good chunk of the community.
Edit: response to some of Jonathan's comments:
Yes, I'm fully aware that this is an N^2 problem. The jacobian is a matrix of all partial derivatives, so it will have size N^2. There is no real way around this scaling. However, a 22000-element array is not nearly at the level that would create a memory problem during runtime -- I only have the problem during compilation.
Basically there are three levels that we can address this at.
1) solve the ODE problem without the jacobian, or somehow split up the jacobian to not have a 150x150 matrix. That would address the very root, but it certainly limits what I can do, and I'm not yet convinced that it's impossible to compile the jacobian function
2) change something about the way sympy automatically generates C code, to split it up into multiple chunks, use more functions for intermediate expressions, to somehow make the .c file smaller. People with more sympy experience might have some ideas on this.
3) change something about the way the C is compiled, so that less memory is needed.
I thought that by posting a separate question more oriented around #3 (literal referencing of large array -- compiler out of memory) , I would get a different audience answering. That is in fact exactly what happened. Perhaps the answer to #3 is "you can't" but that's also useful information.

Following a lot of the examples posted at http://www.sympy.org/scipy-2017-codegen-tutorial/ I was able to get this to compile.
The key things were
1) instead of using autowrap, write the C code directly with more control over it. Among other things, this allows passing the argument list as a vector instead of expanding it. This took some effort to get working (setting up the compiler flags through distutils, etc, etc) but in the end it worked well. Having the repo from the course linked above as an example helped a lot.
2) using common subexpression elimination (sympy.cse) to dramatically reduce the size of the expressions for the jacobian elements.
(1) by itself didn't do that much to help in this case (although I was able to use it to vastly improve performance of smaller models). The code was still 200 MB instead of the original 300 MB. But combining it with (2) (cse) I was able to get it down to a meager 1.7 MB (despite 14000 temporary variables).
The cse takes about 20-30 minutes on my laptop. After that, it compiles quickly.

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

What could be wrong with this matrix multiplication benchmark (Matlab vs Numpy)?

Here's the code I wrote for comparing performance of numpy vs Matlab. It just measures the average time taken for matrix multiplication (1701x576 matrix M1 * 576x576 matrix M2).
Matlab version : (M1 is (1701x576) while M2 is (576x576) matrix)
function r = benchmark(M1,M2)
total_time=0;
for i=1:4
for j=1:1500
tic;
a=M1*M2;
tim=toc;
total_time =total_time+tim;
end
end
avg_time = total_time/4
r=avg_time
end
Python version :
def benchmark():
iters = range(1500)
for i in range(4):
for j in iters:
tic = time.time()
a=M1.dot(M2);
toc = time.time() - tic
t_time=t_time+toc;
return t_time/4
Matlab version takes almost ~18.2s , while Python takes ~19.3s . Ive repeated this test multiple times , and Matlab was always performing better than Python (even if smaller difference) in all cases . My understanding is Numpy uses efficient and compiled code for vector operations , and is supposed to be faster than Matlab.
Then , why could Matlab perform faster than Numpy ? The test was done on a 32 core machine .
Where did I go wrong ? or is this expected for Numpy to be slower than Matlab.
Are there ways to improve the performance for Python ?
Edit : Updated the matlab code to fix the loop index/return value error . The error was the result of me trying to edit the names in the snipper to make it presentable just before posting(a bad idea everytime :) ).

[edited to remove the mention of loops; that was my mistake]
Couple things--
First, the multicore nature of the machine doesn't really matter unless you're explicitly using those extra cores (or linking NumPy against a BLAS library that uses multiple cores -- thanks #ali_m). If you're not, it'll run about as fast on a 32-core machine as it will on a 4-core machine (assuming the clock speeds of the cores themselves are roughly equal).
Second, using purely off-the-shelf Matlab vs off-the-shelf NumPy, Matlab generally beats out NumPy. This is a very general statement, though; YMMV. Also, speaking of Matlab, there does indeed appear to be a bug in the looping indices.
Third, this may not be the best benchmark for performance; there may be some unseen caching issues taking place under the hood that aren't obvious. A better one would be to randomly generate the matrices on-the-fly in each iteration and multiply them, but even this could be problematic depending on the random number generator.

There are two major issues I can see with the test.
The first is that you are using global variable lookup in Python while you are using local variable lookup in MATLAB. Global variable lookup in Python is relatively slow. Making sure the variables are local like they are in MATLAB will affect the performance.
The second is that you are re-doing the same calculation over and over. MATLAB has a JIT for loops and numpy has a cache for calculations, both of which can reduce the time for repeated calculations.
So to make the comparisons more equal and reliable, you should create new, random matrices each time through the loop. This will prevent caching and JIT from messing up your results, and will make sure the variables are all local.

There is a bug in your Matlab code. It appears that you are using the same loop control variable in nested loops.
The outer loop actually only runs once.
Edit: The outer loop actually runs the correct number of times. The two loop control variables seem to be independent.

MATLAB twice as fast as Numpy

I am an engineering grad student currently making the transition from MATLAB to Python for the purposes of numerical simulation. I was under the impression that for basic array manipulation, Numpy would be as fast as MATLAB. However, it appears for two different programs I write that MATLAB is a little under twice as fast as Numpy. The test code I am using for Numpy (Python 3.3) is:
import numpy as np
import time
a = np.random.rand(5000,5000,3)
tic = time.time()
a[:,:,0] = a[:,:,1]
a[:,:,2] = a[:,:,0]
a[:,:,1] = a[:,:,2]
toc = time.time() - tic
print(toc)
Whereas for MATLAB 2012a I am using:
a = rand(5000,5000,3);
tic;
a(:,:,1) = a(:,:,2);
a(:,:,3) = a(:,:,1);
a(:,:,2) = a(:,:,3);
toc
The algorithm I am using is the one used on a NASA website comparing Numpy and MATLAB. The website shows that Numpy surpasses MATLAB in terms of speed for this algorithm. Yet my results show a 0.49 s simulation time for Numpy and a 0.29 s simulation time for MATLAB. I also have run a Gauss-Seidel solver on both Numpy and Matlab and I get similar results (16.5 s vs. 9.5 s)
I am brand new to Python and am not extremely literate in terms of programming. I am using the WinPython 64 bit Python distribution but have also tried Pythonxy to no avail.
One thing I have read which should improve performance is building Numpy using MKL. Unfortunately I have no idea how to do this on Windows. Do I even need to do this?
Any suggestions?

That comparison ends up being apples to oranges due to caching, because it is more efficient to transfer or do some work on contiguous chunks of memory. This particular benchmark is memory bound, since in fact no computation is done, and thus the percentage of cache hits is key to achieve good performance.
Matlab lays the data in column-major order (Fortran order), so a(:,:,k) is a contiguous chunk of memory, which is fast to copy.
Numpy defaults to row-major order (C order), so in a[:,:,k] there are big jumps between elements and that slows down the memory transfer. Actually, the data layout can be chosen. In my laptop, creating the array with a = np.asfortranarray(np.random.rand(5000,5000,3)) leds to a 5x speed up (1 s vs 0.19 s).
This result should be very similar both for numpy-MKL and plain numpy because MKL is a fast LAPACK implementation and here you're not calling any function that uses it (MKL definitely helps when solving linear systems, computing dot products...).
I don't really know what's going on on the Gauss Seidel solver, but some time ago I wrote an answer to a question titled Numpy running at half the speed of MATLAB that talks a little bit about MKL, FFT and Matlab's JIT.

You are attempting to recreate the NASA experiment, however you have changed many of the variables. For example:
Your hardware and operating system is different (www.nccs.nasa.gov/dali_front.html)
Your Python version is different (2.5.3 vs 3.3)
Your MATLAB version is different (2008 vs 2012)
Assuming the NASA results are correct, the difference in results is due to one or more of these changed variables. I recommend you:
Retest with the SciPy prebuilt binaries.
Research if any improvements were made to MATLAB relative to this type of calculation.
Also, you may find this link useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.