Understanding shared memory use for improvement in Numba

Understanding shared memory use for improvement in Numba - python

I'm trying to learn more about the use of shared memory to improve performance in some cuda kernels in Numba, for this I was looking at the Matrix multiplication Example in the Numba documentation and tried to implement to see the gain.
This is my test implementation, I'm aware that the example in the documentation has some issues that I followed from Here, so I copied the fixed example code.
from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *
#cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)
The time of execution of the above kernels are essentially the same, actually the matmul was making faster for some larger input matrices. I would like to know what I'm missing in order to see the gain as the documentation suggests.
Thanks,
Bruno.

I made a performance mistake in the code I put in that other answer. I've now fixed it. In a nutshell this line:
tmp = 0.
caused numba to create a 64-bit floating point variable tmp. That triggered other arithmetic in the kernel to be promoted from 32-bit floating point to 64-bit floating point. That is inconsistent with the rest of the arithmetic and also inconsistent with the intent of the demonstration in the other answer. This error affects both kernels.
When I change it in both kernels to
tmp = float32(0.)
both kernels get noticeably faster, and on my GTX960 GPU, your test case shows that the shared code runs about 2x faster than the non-shared code (but see below).
The non-shared kernel also has a performance issue related to memory access patterns. Similar to the indices swap in that other answer, for this particular scenario only, we can rectify this problem simply by reversing the assigned indices:
j, i = cuda.grid(2)
in the non-shared kernel. This allows that kernel to perform approximately as well as it can, and with that change the shared kernel runs about 2x faster than the non-shared kernel. Without that additional change to the non-shared kernel, the performance of the non-shared kernel is much worse.

Related

How to use use cuda to optimize my matrix multiplication code [duplicate]

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

There are arguably at least two errors in that posted code:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x is out of range or that y is out of range. The and should have been an or.
It is illegal to use cuda.syncthreads() in conditional code, if all the threads in the block cannot participate in that statement. The previous return statement in item 1 above (even if corrected from and to or) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
Therefore, to fix these issues, we cannot use just a simple return statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A or B) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C.
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(Note the use of and here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h#y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
I've also reordered the sense of x and y (and usage of tx and ty) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
EDIT: As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp variable. Changing that to a proper 32-bit float variable makes an important performance difference.

Python for-loops of 3D cupy arrays on GPU, when array broadcast is not possible

I'm trying to use cupy with GPU (NVIDIA K6000 with 2880 cores) to accelerate some for-loops over several 3-dimensional arrays in Python; as shown by the example code below. I define a function that takes the 3D arrays as input argument, performs the for-loops, and outputs an array. I run two cases and compare the execution times: (1) jit-compiled function containing for-loop processing two numpy arrays (A_cpu, B_cpu), in parallel on 4 CPU cores, and (2) the for-loop is called on two cupy arrays (A_gpu, B_gpu), on the 2880 GPU cores. I get the same values from each case; however, the cupy GPU execution is significantly slower than the numpy CPU execution; as shown in the results below. I am not able to use array broadcasting (A[i, :, :]) to eliminate any of the {i, j, k} for-loops, because array indexing is not uniform across operations.
I should add that the math operations in the actual code are much more complex and involve around 12 different 3D arrays; however, for the sake of brevity here I just showed the array-indexing interdependence. Also, running the actual code with #jit(parallel=True) significantly accelerates the code on 4-cores CPU compared to non-parallel jit, which is why I suspect it should be amenable to even more massive parallelization on 2880-cores GPU.
The example Python code is below:
import numpy as np
import cupy as cp
from time import time
from numba import cuda, jit, prange
M = 100
N = 100
P = 10
A_cpu = np.random.rand(M, N, P)
B_cpu = np.random.rand(M, N, P)
A_gpu = cp.asarray(A_cpu)
B_gpu = cp.asarray(B_cpu)
print('Being CPU test:')
#jit(parallel=True)
def cpu_loop(A_cpu, B_cpu):
for i in prange(1, M-1):
for j in prange(1, N-1):
for k in prange(1, P-1):
A_cpu[i, j, k] = (A_cpu[i-1, j-1, k+1]*2 - A_cpu[i, j+1, k-1]/2) / (B_cpu[i+1, j+1, k-1]*2 - B_cpu[i, j-1, k+1]/2)
return(A_cpu)
start = time()
cpu_res = cpu_loop(A_cpu, B_cpu)
end = time()
cpu_total_time = end - start
print('CPU result = ', cpu_res[M//2, N//2, P//2])
print('CPU total_time = {:2.3f} seconds'.format(cpu_total_time))
print('Begin GPU test:')
def gpu_loop(A_gpu, B_gpu):
for i in prange(1, M-1):
for j in prange(1, N-1):
for k in prange(1, P-1):
A_gpu[i, j, k] = (A_gpu[i-1, j-1, k+1]*2 - A_gpu[i, j+1, k-1]/2) / (B_gpu[i+1, j+1, k-1]*2 - B_gpu[i, j-1, k+1]/2)
return(A_gpu)
start = time()
gpu_res = gpu_loop(A_gpu, B_gpu)
end = time()
gpu_total_time = end - start
print('GPU result = ', gpu_res[M//2, N//2, P//2])
print('GPU total_time = {:2.3f} seconds'.format(gpu_total_time))
The output of above code is below:
Being CPU test:
CPU result = 208.3083524104198
CPU total_time = 2.117 seconds
Begin GPU test:
GPU result = 208.3083524104198
GPU total_time = 33.462 seconds
Any insights or advice is greatly appreciated in advance!

Is there have any methods/techniques to improve the matrix multiplication speed in python numpy

I'm currently working on an project aim at finding blur region by using walsh hadamard transform. The basic idea is pixel-wise extract local patch and apply walsh hadamard transform to this local patch. In order to do Walsh hadamard transform, I prior generate the hadamard matrix H and do H×T(local_patch)×H_transpose computation. This operation cost 5ms per pixel which is time consuming. I'm wondering is there have some technique to speed up the matrix multiplication process in numpy python or using some other fast walsh hadamard trainsform technique to replace the H×T×H'. Any help would be appreciated.
for i in range(h):
for j in range(w):
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi

It sounds like this is a job for Numba, check out their 5-minute starting guide.
In short, Numba compiles the first call of a function into a fast-callable format, so that every subsequent call of the same function is at light speed. Numba also has options which can make function calls at ludicrous speed. The options that will pertain to your example are likely fastmath and parallel.
As a starting point, here's what your new numba function might look like:
#njit(fastmath=True, parallel=True)
def lightning_fast_numba_function:
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
for i in range(h):
for j in range(w):
lighting_fast_numba_function()
Other options you may consider are using np.nditer instead of range. But, dont hesitate to cross-check options using Numpy's iteration docs.
Lastly, I noticed a Wikipedia article for your alg has a fast section, with Python code. Might find it useful.

How can I accelerate a sparse matrix by dense vector product, currently implemented via scipy.sparse.csc_matrix.dot, using CUDA?

My ultimate goal is to accelerate the computation of a matrix-vector product in Python, potentially by using a CUDA-enabled GPU. The matrix A is about 15k x 15k and sparse (density ~ 0.05), and the vector x is 15k elements and dense, and I am computing Ax. I have to perform this computation many times, so making it as fast as possible would be ideal.
My current non-GPU “optimization” is to represent A as a scipy.sparse.csc_matrix object, and then simply computing A.dot(x), but I was hoping to speed this up on a VM with a couple NVIDIA GPUs attached, and using only Python if possible (i.e. not writing out the detailed kernel functions by hand). I’ve succeeded in accelerating dense matrix-vector products using the cudamat library, but not for the sparse case. There are a handful of suggestions for the sparse case online, such as using pycuda, or scikit-cuda, or anaconda’s accelerate package, but there’s not a ton of information so it’s hard to know where to begin.
I don’t need greatly detailed instructions, but if anyone has solved this before and could provide a “big picture” roadmap for the simplest way of doing this, or has an idea of the sort of speed up a sparse GPU-based matrix-vector product would have over scipy’s sparse algorithms, that would be very helpful.

Another alternative is to use the CuPy package. It has the same interface as numpy/ scipy (wich is nice) and (for me at least), it turned out to be much easier to install than pycuda.
The new code would look something like this:
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu
A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector #(numpy.ndarray)
A_gpu = csr_gpu(A) #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu
for i in range(niter):
x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing
UPDATE: cupy now supports AMD GPUs as well through their ROCm stack, so that's an added bonus

As pointed out in comments, NVIDIA ship the cuSPARSE library which includes functions for sparse matrix products with dense vectors.
Numba now has Python bindings for the cuSparse library via the pyculib package.

Thanks for the suggestions.
I managed to get pyculib’s csrmm (matrix multiplication for compressed sparse row formatted matrices) operation to work using the following (using 2 NVIDIA K80 GPUs on Google Cloud Platform), but unfortunately wasn’t able to achieve a speedup.
I assume this is because most of the time in the csrmm function is spent transferring data to/from the GPU, as opposed to actually doing the computations. Unfortunately, I couldn’t figure out any straightforward pyculib way to get the arrays onto the GPU on the first place and keep them there over iterations. The code I used is:
import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time
def spmv_cuda(a_sparse, b, sp, count):
"""Compute a_sparse x b."""
# args to csrmm call
trans_a = 'N' # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
m = a_sparse.shape[0] # num rows in a
n = b.shape[1] # num cols in b, c
k = a_sparse.shape[1] # num cols in a
nnz = len(a_sparse.data) # num nonzero in a
alpha = 1 # no scaling
descr_a = sp.matdescr( # matrix descriptor
indexbase=0, # 0-based indexing
matrixtype='G', # 'general': no symmetry or triangular structure
)
csr_val_a = a_sparse.data # csr data
csr_row_ptr_a = a_sparse.indptr # csr row pointers
csr_col_ind_a = a_sparse.indices # csr col idxs
ldb = b.shape[0]
beta = 0
c = np.empty((m, n), dtype=a_sparse.dtype)
ldc = b.shape[0]
# call function
tic = time()
for ii in range(count):
sp.csrmm(
transA=trans_a,
m=m,
n=n,
k=k,
nnz=nnz,
alpha=alpha,
descrA=descr_a,
csrValA=csr_val_a,
csrRowPtrA=csr_row_ptr_a,
csrColIndA=csr_col_ind_a,
B=b,
ldb=ldb,
beta=beta,
C=c,
ldc=ldc)
toc = time()
return c, toc - tic
# run benchmark
COUNT = 20
N = 5000
P = 0.1
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(np.float32)
sp = Sparse()
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))
# pyculib sparse
c, t = spmv_cuda(a_sparse, b, sp, COUNT)
print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))
which yields the output:
Constructing objects...
scipy sparse matrix multiplication took 0.05158638954162598 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Testing pyculib sparse matrix multiplication...
pyculib sparse matrix multiplication took 0.12598299980163574 seconds
c = [ 122.29483032 127.83659363 128.75003052 130.6912384 124.98326111]
As you can see, pyculib is more than twice as slow, even though the matrix multiplication is on the GPU. Again, probably because of overhead involved in transferring data to/from GPU at each iteration.
An alternative solution I found, however, was to use Andreas Kloeckner’s pycuda library, which yielded a 50x speed up!
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time
def spmv_cuda(a_sparse, b, count):
dtype = a_sparse.dtype
m = a_sparse.shape[0]
print('moving objects to GPU...')
spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)
dev_pool = DeviceMemoryPool()
d_b = gpuarray.to_gpu(b, dev_pool.allocate)
d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)
print('executing spmv operation...\n')
tic = time()
for ii in range(count):
d_c.fill(0)
d_c = spmv(d_b, d_c)
toc = time()
return d_c.get(), toc - tic
# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(DTYPE)
# numpy dense
tic = time()
for ii in range(COUNT):
c = np.dot(a_dense, b)
toc = time()
print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))
which yields this output:
numpy dense matrix multiplication took 0.2290663719177246 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
scipy sparse matrix multiplication took 0.24468040466308594 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
moving objects to GPU...
executing spmv operation...
pycuda sparse matrix multiplication took 0.004545450210571289 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Note 1: pycuda requires the following dependencies:
pymetis: install using: pip install pymetis
nvcc: install using: sudo apt install nvidia-cuda-toolkit
Note 2: for some reason pip install pycuda fails to install the file pkt_build_cython.pyx, so you’ll need to download/copy it yourself from https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx.

Another solution is to use tensorflow's matrix multiplication functions. Once GPU-enabled tensorflow is up and running, these work out-of-the-box.
After installing CUDA and tensorflow-gpu (a couple of involved but straightforward tutorials are here and here), you can use tensorflow's SparseTensor class and sparse_tensor_dense_matmul function as follows:
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time
Make sure GPU is detected:
gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n {}'.format(gpus))
Output:
GPU DEVICES:
['/device:GPU:0']
Benchmarks:
from scipy.sparse import csr_matrix
ITERS = 30
N = 20000
P = 0.1 # matrix density
Using scipy:
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N)
tic = time()
for ii in range(ITERS):
c = a_sparse.dot(b)
toc = time()
elapsed = toc - tic
print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
Scipy spmv product took 0.06693172454833984 seconds per iteration.
Using GPU-enabled tensorflow:
with tf.device('/device:GPU:0'):
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
indices = np.transpose(a_dense.nonzero())
values = a_dense[indices[:, 0], indices[:, 1]]
dense_shape = a_dense.shape
a_sparse = tf.SparseTensor(indices, values, dense_shape)
b = tf.constant(np.random.rand(N, 1))
tic = time()
for ii in range(ITERS):
c = tf.sparse_tensor_dense_matmul(a_sparse, b)
toc = time()
elapsed = toc - tic
print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
Output:
GPU spmv product took 0.0011811971664428711 seconds per iteration.
Quite a nice speed-up, it turns out.

Numbapro jit calculation gives incorrect result

I have a piece of code that uses Numbapro to write a simple kernel to square the contents of two arrays of size 41724,add them together and store it into another array. All the arrays have the same size and are float32. The code is below:
import numpy as np
from numba import *
from numbapro import cuda
#cuda.jit('void(float32[:],float32[:],float32[:])')
def square_add(a,b,c):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
#Since the length of a is 41724 and the total
#threads is 41*1024 = 41984, this check is necessary
if (i>len(a)):
return
else:
c[i] = a[i]*a[i] + b[i]*b[i]
a = np.array(range(0,41724),dtype = np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c,copy=False)
#Launch the kernel; Gridsize = (1,41),Blocksize=(1,1024)
square_add[(1,41),(1,1024)](d_a,d_b,d_c)
c = d_c.copy_to_host()
print c
print len(c[0])
The values I am getting when I print the result of the operation (the array c) is completely different compared to that when I do the exact same thing in a python terminal.
I do not know what I am doing wrong here.

There a two problems here.
The first is that you are specifying a block and grid dimension for your CUDA kernel launch which is incompatible with the indexing scheme you have chosen to use in the kernel.
This:
square_add[(1,41),(1,1024)](d_a,d_b,d_c)
launches a two dimensional grid where all the threads have the same block and thread dimensions in x, and vary in only in y. This implies that
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
will yield i=0 for every thread. If you change the kernel launch to this:
square_add[(41,1),(1024,1)](d_a,d_b,d_c)
you will find that in indexing will work correctly.
The second is that c has been declared as a two dimensional array, but the kernel function signature has been declared as a one dimensional array. Under some circumstances, the numbapro runtime should detect this and raise an error.
I was able to get your example to work correctly like this:
import numpy as np
from numba import *
from numbapro import cuda
#cuda.jit('void(float32[:],float32[:],float32[:,:])')
def square_add(a,b,c):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + bx * bw
if (i<len(a)):
c[0,i] = a[i]*a[i] + b[i]*b[i]
a = np.array(range(0,41724),dtype=np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c, copy=False)
square_add[(41,1),(1024,1)](d_a,d_b,d_c)
c = d_c.copy_to_host()
print(c)
print(c.shape)
[Note I am using Python 3, so this uses new style print statements]
$ ipython numbatest.py
numbapro:1: ImportWarning: The numbapro package is deprecated in favour of the accelerate package. Please update your code to use equivalent functions from accelerate.
[[ 1.74089216e+09 1.74097562e+09 1.74105907e+09 ..., 8.70371021e+09
8.70396006e+09 8.70421094e+09]]
(1, 41724)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding shared memory use for improvement in Numba - python

Related

How to use use cuda to optimize my matrix multiplication code [duplicate]

Python for-loops of 3D cupy arrays on GPU, when array broadcast is not possible

Is there have any methods/techniques to improve the matrix multiplication speed in python numpy

How can I accelerate a sparse matrix by dense vector product, currently implemented via scipy.sparse.csc_matrix.dot, using CUDA?

Numbapro jit calculation gives incorrect result

Categories

Resources