I'm trying to use numbapro to write a simple matrix vector multiplication below:
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
n = 100
#cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
y, x = cuda.grid(2)
if y < n:
c[y] = 0.0
if x < n and y < n:
for i in range(n):
c[y] += A[y, i] * b[i]
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()
e = time()
tcuda = e - s
but I'm getting following error:
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H
I don't understand why the device to host copy is failing. Please help
Your code has multiple problems.
The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)
Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
m = 100000
n = 100
#cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
row = cuda.grid(1)
if (row < m):
sum = 0
for i in range(n):
sum += A[row, i] * b[i]
c[row] = sum
A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)
dC.to_host()
print C
e = time()
tcuda = e - s
Related
Is it possible to accelerate the following nested loop in Python (possibly with CUDA or parallel process)? The order of the elements that are appended to outputList[i] doesn't matter.
Currently, the code takes a long time to complete. Which part is slowing down the code? Is it the append() or the calculation f = c*sin(i*pi/180) + (1/c)*cos(i*pi/180)?
N = 640*480
outputList = [[] for i in range(N)]
def foo(a,b): # a, b are always integers
c = sqrt(a**2 + b**2)
for deg in range(360):
f = c*sin(deg*pi/180) + (1/c)*cos(deg*pi/180)
if (f<1):
outputList[a].append(f)
if __name__ == '__main__':
for x in range(N):
for y in range(60):
foo(x,y)
As talonmies mentioned, your first step should be using numpy.
import math
import numpy as np
def foo(a, b):
c = math.hypot(a, b)
degs = np.arange(360, dtype='f8')
rads = np.deg2rad(degs)
fs = c * np.sin(rads) + (1/c)*np.cos(rads)
below1 = fs[fs < 1.]
outputList[a].extend(below1)
There are fancier methods to do the sin + cos computation but I've kept it simple for now. Also, depending on your use case, change outputList to a 2D numpy array or a list of 1D numpy arrays
I'm working on an algorithm (for some sort of PCA) and I test its performance according to a metric. It is evaluated by Monte Carlo i.e. averaging over several samples for a choice of parameter.
Initiality, my code consists of simple loops for parameters and samples with Numpy functions for the algorithm.
I heard about the package Numba to speeed up my code. I found out that I get a 5x improvement in speed only rewriting the Numpy part to fit Numba standarts and adding the #njit decorator.
However, I tested the same improved version on another PC (both are laptops with Intel i7's) and found out that I get no speed improvement on this second PC.
I am dumbfounded about why this could be the case.
edit:
More details about the two PCs.
PC 1:
Python 3.7.6
Numba 0.48.0
Intel Core i7-8565U CPU # 1.80GHz (4 cores)
RAM 8 Go
PC 2:
Python 3.7.4
Numba 0.49.0
Intel Core i7-8750H CPU # 2.20GHz (6 cores)
RAM 16,0 Go
And here is a minimal reproducible example.
import numpy as np
from numba import jit
import matplotlib.pyplot as plt
import time
#jit(nopython=True)
def multivariate_normal(m,C):
L = np.linalg.cholesky(C)
X = np.random.randn(m.shape[0])
return(L # X + m)
#jit(nopython=True)
def gendata(p, d, k, n, sig_s):
H = np.random.randn(d,k)
u, s, vh = np.linalg.svd(H, full_matrices=False)
H = u # vh
rest = np.zeros((p-d, k))
U0 = np.concatenate((H,rest),axis=0)
X = np.empty((p,n))
for i in range(n):
b = multivariate_normal(np.zeros(p),np.identity(p))
s = multivariate_normal(np.zeros(k),(sig_s**2) * np.identity(k))
x = U0 # s + b
X[:,i] = x
return((X,U0))
#jit(nopython=True)
def geomspace(a,b,n):
seq = np.linspace(np.log(a),np.log(b),n)
return(np.exp(seq))
#jit(nopython=True)
def Afe(U,U0):
'''
Average Fraction of Energy of subspace U based on original subspace U0.
'''
Oorth,r = np.linalg.qr(U)
afe = np.trace(Oorth.T # U0 # U0.T # Oorth) / np.trace(U0 # U0.T)
return(afe)
#jit(nopython=True)
def sim2a():
p = 50
d = int(0.99 * p)
k = 5
sig_s = np.sqrt(10)
ns = geomspace(3*k,3*p,10).astype(np.int32)
echs = 50
mc_afe_pca = np.empty(ns.shape[0])
mc_afe_sphpca = np.empty(ns.shape[0])
for n in range(ns.shape[0]):
res_pca = np.empty(echs)
res_sphpca = np.empty(echs)
for e in range(echs):
#print(n,e)
X,U0 = gendata(p,d,k,ns[n],sig_s)
sX = np.diag(X.T # X)
#PCA
u, sig, vh = np.linalg.svd(X.T, full_matrices=False)
Upca = vh.T[:,:k]
res_pca[e]= Afe(Upca,U0)
#Spherical PCA
Xn = X / np.sqrt(sX)
u, sig, vh = np.linalg.svd(Xn.T, full_matrices=False)
Usph = vh.T[:,:k]
res_sphpca[e] = Afe(Usph,U0)
mc_afe_pca[n] = np.mean(res_pca)
mc_afe_sphpca[n] = np.mean(res_sphpca)
return((mc_afe_pca,mc_afe_sphpca))
#first run for compilation before measurements
mc_afe_pca,mc_afe_sphpca = sim2a()
#Graph
p = 50
k = 5
ns = geomspace(3*k,3*p,10).astype(np.int32)
plt.figure(figsize=(8,4))
plt.plot(ns,mc_afe_pca,'+-',label="PCA")
plt.plot(ns,mc_afe_sphpca,'x-',label="Spherical PCA")
plt.xlabel("n")
plt.ylabel("AFE")
plt.legend()
plt.plot()
#Measure
def sim2atimer():
start = time.time()
a,b= sim2a()
end = time.time()
print(end - start)
sim2atimer()
#Now, replace #jit by ##jit and reexecute the code to compare.
# The option Parallel=True was erased from #jit()
In fact, while working on this example, I had the idea to erase the Parallel=True option in the #jit decorator, which prompted a warning that it was not used, and that fixed the discrepancy problem.
Now, I still don't know why only one PC was affected.
I am trying to solve the inverse of a banded sparse matrix in the most efficient way so that I can incorporate this in my real-time system. I am generating sparse-banded matrices which represent a convolution operation. Currently, I am using spsolve from scipy.sparse.linalg library. I found that there is a better way by using solve_banded from the scipy.linalg library. However, solve_banded requires (l,u) which is the number of non-zero lower and upper diagonals and ab which (l + u + 1, M) array like banded matrix. I am not sure how to convert my code so that I can use solve_banded. Any help with this regard is highly appreciated.
import numpy as np
from scipy import linalg
import math
import time
from scipy.sparse import spdiags
from scipy.sparse.linalg import spsolve
def ABC(deg, fc, N):
r"""Generate sparse-banded matrices
"""
omc = 2*math.pi*fc
t = ((1-math.cos(omc))/(1+math.cos(omc)))**deg
p = 1
for k in np.arange(deg):
p = np.convolve(p,np.array([-1,1]),'full')
P = spdiags(np.kron(p,np.ones((N,1))).T, np.arange(deg+1), N-deg, N)
B = P.T.dot(P)
q = np.sqrt(t)
for k in np.arange(deg):
q = np.convolve(q,np.array([1,1]),'full')
Q = spdiags(np.kron(q,np.ones((N,1))).T, np.arange(deg+1), N-deg, N)
C = Q.T.dot(Q)
A = B + C
return A,B,C
if __name__ == '__main__':
mu = 0.1
deg = 3
wc = 0.1
for i in np.arange(1,7,1):
# some dense random vector
x = np.random.rand(10**i,1)
# generate sparse banded matrices
A,_,C = ABC(deg, wc, 10**i)
# another banded matrix
G = mu*A.dot(A.T) + C.dot(C.T)
# SCIPY SPSOLVE
st = time.time()
y = spsolve(G,x)
et = time.time()
print("SCIPY SPSOLVE: N = ", 10**i, "Time taken: ", et-st)
Results
SCIPY SPSOLVE: N = 10 Time taken: 0.0
SCIPY SPSOLVE: N = 100 Time taken: 0.0
SCIPY SPSOLVE: N = 1000 Time taken: 0.015689611434936523
SCIPY SPSOLVE: N = 10000 Time taken: 0.020943641662597656
SCIPY SPSOLVE: N = 100000 Time taken: 0.16722917556762695
SCIPY SPSOLVE: N = 1000000 Time taken: 1.7254831790924072
Solved it using solveh_banded from the scipy library. Very fast matrix inversion technique for extremely large sparse-banded matrices when the matrix is symmetric and positive definite banded matrix.
from scipy.linalg import solveh_banded
def sp_inv(A, x):
A = A.toarray()
N = np.shape(A)[0]
D = np.count_nonzero(A[0,:])
ab = np.zeros((D,N))
for i in np.arange(1,D):
ab[i,:] = np.concatenate((np.diag(A,k=i),np.zeros(i,)),axis=None)
ab[0,:] = np.diag(A,k=0)
y = solveh_banded(ab,x,lower=True)
return y
Is there any way to multiply a 2D sparse matrix by a 3D numpy array please?
For example I have this function
def myFun(x, p):
r = 2
out = x * np.log(p) + r * np.log(1-p)
return out
where x is an array of dimension 3500, 90 and p another array with dimensions 3500, 90, 70. At the moment both x and p are dense arrays and I am just broadcasting when I call the function:
out = myFun(x[..., None], p)
However array x is quite sparse, only 7% of its elements are non-zero. On the other side, p doesnt have any zeros, only floats between zero and one.
I am hoping though that with a sparse matrix (from scipy.sparse probably) I will see a speed improvement. However, I do not know how to do this operation or if this more efficient please.
I am using python 3.
Many thanks
You can exploit the sparseness of x using the where keyword.
def sprse(x, p):
r = 2
out = x * np.log(p, where=x.astype(bool)) + r * np.log(1-p)
return out
from timeit import timeit
x = np.random.uniform(-13, 1, (3500, 90, 1)).clip(0, None)
p = np.random.random((3500, 90, 70))
assert np.all(sprse(x, p)==myFun(x, p))
def f():
return myFun(x, p)
print(timeit(f, number=3))
def f():
return sprse(x, p)
print(timeit(f, number=3))
Sample run:
5.171174691990018
3.2122434769989923
You can try the following implementation. For this simple function this looks like a bit exaggerated, but I also had troubles to get numexpr to work with Intel SVML (otherwise I would prefer numexpr). This solution should give 0.07s per call on a Quadcore i7 and should scale quite well on more cores. Please also note that the first call has a compilation overhead of about 0.5s.
Installing Intel SVML
import numpy as np
import numba as nb
x = np.random.uniform(-13, 1, (3500, 90, 1)).clip(0, None)
p = np.random.random((3500, 90, 70))
#nb.njit(parallel=True,fastmath=True)
def nb_myFun_sp(x, p):
out=np.empty(p.shape,p.dtype)
r = 2.
for i in nb.prange(p.shape[0]):
for j in range(p.shape[1]):
if x[i,j,0]!=0.:
x_=x[i,j,0]
for k in range(p.shape[2]):
out[i,j,k] = x_ * np.log(p[i,j,k]) + r * np.log(1.-p[i,j,k])
else:
for k in range(p.shape[2]):
out[i,j,k] = r * np.log(1.-p[i,j,k])
return out
#nb.njit(parallel=True,fastmath=True)
def nb_myFun(x, p):
out=np.empty(p.shape,p.dtype)
r = 2.
for i in nb.prange(p.shape[0]):
for j in range(p.shape[1]):
x_=x[i,j,0]
for k in range(p.shape[2]):
out[i,j,k] = x_ * np.log(p[i,j,k]) + r * np.log(1.-p[i,j,k])
return out
I am playing around with Numba to see how much faster I can make a Python+NumPy code. My test function computes the pairwise Euclidean distances of n points in a three-dimensional space. I am getting 2 orders of magnitude speedup with Numba. If I comment out the lines where I store the distances in an array (i.e. distance[i, j] = d and distance[j, i] = d), I get 6 orders of magnitude speedup with Numba. So basically, the computations are lightning fast but accessing the array which holds the results is slow. Is there a way to speedup array access?
NumPy and Numba functions
import numpy as np
from numba import jit, float64, void
def pairwise_distance_numpy(distance, point):
numPoints = point.shape[0]
for i in range(numPoints):
for j in range(0, i):
d = 0.0
for k in range(3):
tmp = point[i, k] - point[j, k]
d += tmp*tmp
d = d**0.5
distance[i, j] = d
distance[j, i] = d
pairwise_distance_numba = jit(void(float64[:,:], float64[:,:]), nopython=True)(pairwise_distance_numpy)
Benchmark script
import numpy as np
from time import time
from pairwise_distance import pairwise_distance_numpy as pd_numpy
from pairwise_distance import pairwise_distance_numba as pd_numba
n = 1000
point = np.random.rand(n, 3)
distance = np.empty([n, n], dtype=np.float64)
pd_numpy(distance, point)
t = time()
pd_numpy(distance, point)
dt_numpy = time() - t
print('Numpy elapsed time: ', dt_numpy)
pd_numba(distance, point)
t = time()
pd_numba(distance, point)
dt_numba = time() - t
print('Numba Elapsed time: ', dt_numba)
print('Numba speedup: ', dt_numpy/dt_numba)
It seems Numba just optimized the calculations away since you're not storing the result in a variable. (from your code + your comment confirming this)
Array access in numpy should be pretty pretty fast in most cases!