I want to generate a random matrix of shape (1e7, 800). But I find numpy.random.rand() becomes very slow at this scale. Is there a quicker way?
A simple way to do that is to write a multi-threaded implementation using Numba:
import numba as nb
import random
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def genRandom(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
This is 6.4 times faster than np.random.rand() on my 6-core machine.
Note that using 32-bit floats may help to speed up a bit the computation although the precision will be lower.
Numba is a good option, another option that might work well is dask.array, which will create lazy blocks of numpy arrays and perform parallel computations on blocks. On my machine I get a factor of 2 improvement in speed (for 1e6 x 1e3 matrix since I don't have enough memory on my machine).
rows = 10**6
cols = 10**3
import dask.array as da
x = da.random.random(size=(rows, cols)).compute() # takes about 5 seconds
# import numpy as np
# x = np.random.rand(rows, cols) # takes about 10 seconds
Note that .compute at the end is only to bring the computed array into memory, however in general you can continue to exploit the parallel computations with dask to get much faster calculations (that can also scale beyond a single machine), see docs.
An attempt to find an answer from answers given till now:
I just wrote a script which is compiled from already given (by SultanOrazbayev and Jérôme Richard) answers and contains 3 functions for each numba, dask and numpy approach and measure the time spent for n number of different sized arrays.
The code
import dask.array as da
import matplotlib.pyplot as plt
import numba as nb
import timeit
import numpy as np
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def nmb(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
def nmp(n, m):
return np.random.random((n, m))
def dask(n, m):
return da.random.random(size=(n, m)).compute()
if __name__ == '__main__':
data = []
for i in range(1, 16):
dmm = 2 ** i
s_nmb = timeit.default_timer()
nmb(dmm, dmm)
e_nmb = timeit.default_timer()
s_nmp = timeit.default_timer()
nmp(dmm, dmm)
e_nmp = timeit.default_timer()
s_dask = timeit.default_timer()
dask(dmm, dmm)
e_dask = timeit.default_timer()
data.append([
dmm,
e_nmb - s_nmb,
e_nmp - s_nmp,
e_dask - s_dask
])
data = np.array(data)
plt.plot(data[:, 0], data[:, 1], "-r", label="Numba")
plt.plot(data[:, 0], data[:, 2], "-g", label="Numpy")
plt.plot(data[:, 0], data[:, 3], "-b", label="Dask")
plt.xlabel("Number of Element on each axes")
plt.ylabel("Time spent (s)")
plt.legend()
plt.show()
The result
Related
I'm applying an integration function using scipy.integrate to two 2D arrays. Here's the example:
from scipy import integrate
import numpy as np
def integrate_lno2(top, bottom, peak_height, peak_width):
return integrate.quad(lambda x: np.exp(-np.power(x - peak_height, 2)/(2*np.power(peak_width, 2))), top, bottom)[0]
# change row and col to test speed
row = 100; col = 100; peak_height=300; peak_width=60
top = np.linspace(100, 200, row*col).reshape(row, col)
bottom = np.linspace(800, 900, row*col).reshape(row, col)
res = np.zeros((row, col))
for i in range(row):
for j in range(col):
res[i, j] = integrate_lno2(top[i, j], bottom[i, j], peak_height, peak_width)
If the shape of 2D arrays increase, the for loop can be slow. I have found the numba integrand example, however it doesn't accept the upper and lower limit.
Like in this previous answer, you can use Numba to speed up the lambda calls that are very slow due to big Numpy overheads (Numpy is not optimized to operate on scalar and is very slow to do that). Even better: you can tell to Numba to generate a C function which can be called directly from Scipy with a very small overhead (since it almost completely remove the overhead of the slow CPython interpreter). You can also also pre-compute the division by a variable and convert it to a multiplication (faster).
Here is the resulting code:
import numba as nb
import numpy as np
import scipy as sp
factor = -1.0 / (2 * np.power(peak_width, 2))
# change row and col to test speed
row = 100; col = 100; peak_height=300; peak_width=60
#nb.cfunc('float64(float64)')
def compute_numba(x):
return np.exp(np.power(x - peak_height, 2) * factor)
compute_c = sp.LowLevelCallable(compute_numba.ctypes)
def integrate_lno2(top, bottom):
return sp.integrate.quad(compute_c, top, bottom)[0]
top = np.linspace(100, 200, row*col).reshape(row, col)
bottom = np.linspace(800, 900, row*col).reshape(row, col)
res = np.zeros((row, col))
for i in range(row):
for j in range(col):
res[i, j] = integrate_lno2(top[i, j], bottom[i, j])
The computing loop is roughly 100 times faster on my machine.
I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.
I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.
I'm trying to use cupy with GPU (NVIDIA K6000 with 2880 cores) to accelerate some for-loops over several 3-dimensional arrays in Python; as shown by the example code below. I define a function that takes the 3D arrays as input argument, performs the for-loops, and outputs an array. I run two cases and compare the execution times: (1) jit-compiled function containing for-loop processing two numpy arrays (A_cpu, B_cpu), in parallel on 4 CPU cores, and (2) the for-loop is called on two cupy arrays (A_gpu, B_gpu), on the 2880 GPU cores. I get the same values from each case; however, the cupy GPU execution is significantly slower than the numpy CPU execution; as shown in the results below. I am not able to use array broadcasting (A[i, :, :]) to eliminate any of the {i, j, k} for-loops, because array indexing is not uniform across operations.
I should add that the math operations in the actual code are much more complex and involve around 12 different 3D arrays; however, for the sake of brevity here I just showed the array-indexing interdependence. Also, running the actual code with #jit(parallel=True) significantly accelerates the code on 4-cores CPU compared to non-parallel jit, which is why I suspect it should be amenable to even more massive parallelization on 2880-cores GPU.
The example Python code is below:
import numpy as np
import cupy as cp
from time import time
from numba import cuda, jit, prange
M = 100
N = 100
P = 10
A_cpu = np.random.rand(M, N, P)
B_cpu = np.random.rand(M, N, P)
A_gpu = cp.asarray(A_cpu)
B_gpu = cp.asarray(B_cpu)
print('Being CPU test:')
#jit(parallel=True)
def cpu_loop(A_cpu, B_cpu):
for i in prange(1, M-1):
for j in prange(1, N-1):
for k in prange(1, P-1):
A_cpu[i, j, k] = (A_cpu[i-1, j-1, k+1]*2 - A_cpu[i, j+1, k-1]/2) / (B_cpu[i+1, j+1, k-1]*2 - B_cpu[i, j-1, k+1]/2)
return(A_cpu)
start = time()
cpu_res = cpu_loop(A_cpu, B_cpu)
end = time()
cpu_total_time = end - start
print('CPU result = ', cpu_res[M//2, N//2, P//2])
print('CPU total_time = {:2.3f} seconds'.format(cpu_total_time))
print('Begin GPU test:')
def gpu_loop(A_gpu, B_gpu):
for i in prange(1, M-1):
for j in prange(1, N-1):
for k in prange(1, P-1):
A_gpu[i, j, k] = (A_gpu[i-1, j-1, k+1]*2 - A_gpu[i, j+1, k-1]/2) / (B_gpu[i+1, j+1, k-1]*2 - B_gpu[i, j-1, k+1]/2)
return(A_gpu)
start = time()
gpu_res = gpu_loop(A_gpu, B_gpu)
end = time()
gpu_total_time = end - start
print('GPU result = ', gpu_res[M//2, N//2, P//2])
print('GPU total_time = {:2.3f} seconds'.format(gpu_total_time))
The output of above code is below:
Being CPU test:
CPU result = 208.3083524104198
CPU total_time = 2.117 seconds
Begin GPU test:
GPU result = 208.3083524104198
GPU total_time = 33.462 seconds
Any insights or advice is greatly appreciated in advance!
My ultimate goal is to accelerate the computation of a matrix-vector product in Python, potentially by using a CUDA-enabled GPU. The matrix A is about 15k x 15k and sparse (density ~ 0.05), and the vector x is 15k elements and dense, and I am computing Ax. I have to perform this computation many times, so making it as fast as possible would be ideal.
My current non-GPU “optimization” is to represent A as a scipy.sparse.csc_matrix object, and then simply computing A.dot(x), but I was hoping to speed this up on a VM with a couple NVIDIA GPUs attached, and using only Python if possible (i.e. not writing out the detailed kernel functions by hand). I’ve succeeded in accelerating dense matrix-vector products using the cudamat library, but not for the sparse case. There are a handful of suggestions for the sparse case online, such as using pycuda, or scikit-cuda, or anaconda’s accelerate package, but there’s not a ton of information so it’s hard to know where to begin.
I don’t need greatly detailed instructions, but if anyone has solved this before and could provide a “big picture” roadmap for the simplest way of doing this, or has an idea of the sort of speed up a sparse GPU-based matrix-vector product would have over scipy’s sparse algorithms, that would be very helpful.
Another alternative is to use the CuPy package. It has the same interface as numpy/ scipy (wich is nice) and (for me at least), it turned out to be much easier to install than pycuda.
The new code would look something like this:
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu
A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector #(numpy.ndarray)
A_gpu = csr_gpu(A) #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu
for i in range(niter):
x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing
UPDATE: cupy now supports AMD GPUs as well through their ROCm stack, so that's an added bonus
As pointed out in comments, NVIDIA ship the cuSPARSE library which includes functions for sparse matrix products with dense vectors.
Numba now has Python bindings for the cuSparse library via the pyculib package.
Thanks for the suggestions.
I managed to get pyculib’s csrmm (matrix multiplication for compressed sparse row formatted matrices) operation to work using the following (using 2 NVIDIA K80 GPUs on Google Cloud Platform), but unfortunately wasn’t able to achieve a speedup.
I assume this is because most of the time in the csrmm function is spent transferring data to/from the GPU, as opposed to actually doing the computations. Unfortunately, I couldn’t figure out any straightforward pyculib way to get the arrays onto the GPU on the first place and keep them there over iterations. The code I used is:
import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time
def spmv_cuda(a_sparse, b, sp, count):
"""Compute a_sparse x b."""
# args to csrmm call
trans_a = 'N' # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
m = a_sparse.shape[0] # num rows in a
n = b.shape[1] # num cols in b, c
k = a_sparse.shape[1] # num cols in a
nnz = len(a_sparse.data) # num nonzero in a
alpha = 1 # no scaling
descr_a = sp.matdescr( # matrix descriptor
indexbase=0, # 0-based indexing
matrixtype='G', # 'general': no symmetry or triangular structure
)
csr_val_a = a_sparse.data # csr data
csr_row_ptr_a = a_sparse.indptr # csr row pointers
csr_col_ind_a = a_sparse.indices # csr col idxs
ldb = b.shape[0]
beta = 0
c = np.empty((m, n), dtype=a_sparse.dtype)
ldc = b.shape[0]
# call function
tic = time()
for ii in range(count):
sp.csrmm(
transA=trans_a,
m=m,
n=n,
k=k,
nnz=nnz,
alpha=alpha,
descrA=descr_a,
csrValA=csr_val_a,
csrRowPtrA=csr_row_ptr_a,
csrColIndA=csr_col_ind_a,
B=b,
ldb=ldb,
beta=beta,
C=c,
ldc=ldc)
toc = time()
return c, toc - tic
# run benchmark
COUNT = 20
N = 5000
P = 0.1
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(np.float32)
sp = Sparse()
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))
# pyculib sparse
c, t = spmv_cuda(a_sparse, b, sp, COUNT)
print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))
which yields the output:
Constructing objects...
scipy sparse matrix multiplication took 0.05158638954162598 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Testing pyculib sparse matrix multiplication...
pyculib sparse matrix multiplication took 0.12598299980163574 seconds
c = [ 122.29483032 127.83659363 128.75003052 130.6912384 124.98326111]
As you can see, pyculib is more than twice as slow, even though the matrix multiplication is on the GPU. Again, probably because of overhead involved in transferring data to/from GPU at each iteration.
An alternative solution I found, however, was to use Andreas Kloeckner’s pycuda library, which yielded a 50x speed up!
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time
def spmv_cuda(a_sparse, b, count):
dtype = a_sparse.dtype
m = a_sparse.shape[0]
print('moving objects to GPU...')
spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)
dev_pool = DeviceMemoryPool()
d_b = gpuarray.to_gpu(b, dev_pool.allocate)
d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)
print('executing spmv operation...\n')
tic = time()
for ii in range(count):
d_c.fill(0)
d_c = spmv(d_b, d_c)
toc = time()
return d_c.get(), toc - tic
# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(DTYPE)
# numpy dense
tic = time()
for ii in range(COUNT):
c = np.dot(a_dense, b)
toc = time()
print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))
which yields this output:
numpy dense matrix multiplication took 0.2290663719177246 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
scipy sparse matrix multiplication took 0.24468040466308594 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
moving objects to GPU...
executing spmv operation...
pycuda sparse matrix multiplication took 0.004545450210571289 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Note 1: pycuda requires the following dependencies:
pymetis: install using: pip install pymetis
nvcc: install using: sudo apt install nvidia-cuda-toolkit
Note 2: for some reason pip install pycuda fails to install the file pkt_build_cython.pyx, so you’ll need to download/copy it yourself from https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx.
Another solution is to use tensorflow's matrix multiplication functions. Once GPU-enabled tensorflow is up and running, these work out-of-the-box.
After installing CUDA and tensorflow-gpu (a couple of involved but straightforward tutorials are here and here), you can use tensorflow's SparseTensor class and sparse_tensor_dense_matmul function as follows:
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time
Make sure GPU is detected:
gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n {}'.format(gpus))
Output:
GPU DEVICES:
['/device:GPU:0']
Benchmarks:
from scipy.sparse import csr_matrix
ITERS = 30
N = 20000
P = 0.1 # matrix density
Using scipy:
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N)
tic = time()
for ii in range(ITERS):
c = a_sparse.dot(b)
toc = time()
elapsed = toc - tic
print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
Scipy spmv product took 0.06693172454833984 seconds per iteration.
Using GPU-enabled tensorflow:
with tf.device('/device:GPU:0'):
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
indices = np.transpose(a_dense.nonzero())
values = a_dense[indices[:, 0], indices[:, 1]]
dense_shape = a_dense.shape
a_sparse = tf.SparseTensor(indices, values, dense_shape)
b = tf.constant(np.random.rand(N, 1))
tic = time()
for ii in range(ITERS):
c = tf.sparse_tensor_dense_matmul(a_sparse, b)
toc = time()
elapsed = toc - tic
print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
GPU spmv product took 0.0011811971664428711 seconds per iteration.
Quite a nice speed-up, it turns out.
It takes 0.02 seconds for Matlab to compute the inverse of a diagonal matrix using the sparse command.
P = diag(1:10000);
P = sparse(P);
tic;
A = inv(P);
toc
However, for the Python code it takes forever - several minutes.
import numpy as np
import time
startTime = time.time()
P = np.diag(range(1,10000))
A = np.linalg.inv(P)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
I tried to use Scipy.sparse module but it did not help. The running time dropped, but only to 40 seconds.
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
startTime = time.time()
P = np.diag(range(1,10000))
P_sps = sps.coo_matrix(P)
A = spsl.inv(P_sps)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
Is it possible to run the code as fast as it runs in Matlab?
Here is the answer. When you run inv in matlab for a sparse matrix, matlab check different properties of the matrix to optimize the calculation. For a sparse diagonal matrix, you can run the followin code to see what is matlab doing
n = 10000;
a = diag(1:n);
a = sparse(a);
I = speye(n,n);
spparms('spumoni',1);
ainv = inv(a);
spparms('spumoni',0);
Matlab will print the following:
sp\: bandwidth = 0+1+0.
sp\: is A diagonal? yes.
sp\: do a diagonal solve.
So matlab is inverting only the diagonal.
How does Scipy invert the matrix??
Here we have the code:
...
from scipy.sparse.linalg import spsolve
...
def inv(A):
"""
Some comments...
"""
I = speye(A.shape[0], A.shape[1], dtype=A.dtype, format=A.format)
Ainv = spsolve(A, I)
return Ainv
and spsolve
# Cover the case where b is also a matrix
Afactsolve = factorized(A)
tempj = empty(M, dtype=int)
x = A.__class__(b.shape)
for j in range(b.shape[1]):
xj = Afactsolve(squeeze(b[:, j].toarray()))
w = where(xj != 0.0)[0]
tempj.fill(j)
x = x + A.__class__((xj[w], (w, tempj[:len(w)])),
shape=b.shape, dtype=A.dtype)
i.e., scipy factorize A and then solve a set of linear systems where the right hand sides are the coordinate vectors (forming the identity matrix). Ordering all the solutions in a matrix we obtain the inverse of the initial matrix.
If matlab is taken advantage of the diagonal structure of the matrix, but scipy is not (of course scipy is also using the structure of the matrix, but in a less efficient way, at least for the example), matlab should be much faster.
EDIT
To be sure, as #P.Escondido propossed, we will try a minor modification in matrix A, to trace the matlab procedure when the matrix is not diagonal:
n = 10000; a = diag(1:n); a = sparse(a); ainv = sparse(n,n);
spparms('spumoni',1);
a(100,10) = 500; a(10,1000) = 200;
ainv = inv(a);
spparms('spumoni',0);
It prints out the following:
sp\: bandwidth = 90+1+990.
sp\: is A diagonal? no.
sp\: is band density (0.00) > bandden (0.50) to try banded solver? no.
sp\: is A triangular? no.
sp\: is A morally triangular? yes.
sp\: permute and solve.
sp\: sprealloc in sptsolve: 10000 10000 10000 15001
How about splu(), it's faster but need a dense array and return dense array:
Create a random matrix:
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
from numpy.random import randint
N = 1000
i = np.arange(N)
j = np.arange(N)
v = np.ones(N)
i2 = randint(0, N, N)
j2 = randint(0, N, N)
v2 = np.random.rand(N)
i = np.concatenate((i, i2))
j = np.concatenate((j, j2))
v = np.concatenate((v, v2))
A = sps.coo_matrix((v, (i, j)))
A = A.tocsc()
%time B = spsl.inv(A)
calculate inverse matrix by splu():
%%time
lu = spsl.splu(A)
eye = np.eye(N)
B2 = lu.solve(eye)
check the result:
np.allclose(B.todense(), B2.T)
Here is the %time output:
inv: 2.39 s
splv: 193 ms
You are witholding crucial information from your software: the fact that the matrix is diagonal makes it super easy to invert: you simply invert each element of its diagonal:
P = np.diag(range(1,10000))
A = np.diag(1.0/np.arange(1,10000))
Of course, this is only valid for diagonal matrices...
If you try with that the result will be better:
import numpy as np
import time
import scipy.sparse as sps
import scipy.sparse.linalg as spsl
P = np.diag(range(1,10000))
P_sps = sps.coo_matrix(P)
startTime = time.time()
A = spsl.inv(P_sps)
runningTime = (time.time()-startTime)/60
print "The script was running for %f minutes" % runningTime
Now you can compare with your matlab script.