Hi I'm trying to understand each step of cuda kernel. It will by nice to get all grid indexes that are occupy by data. My code is to add 2 vectors and is written in python numba.
n = 10
x = np.arange(n).astype(np.float32)
y = x + 1
setup number of threads and blocks in grid
threads_per_block = 8
blocks_per_grid = 2
Kernel
def kernel_manual_add(x, y, out):
threads_number = cuda.blockDim.x
block_number = cuda.gridDim.x
thread_index = cuda.threadIdx.x
block_index = cuda.blockIdx.x
grid_index = thread_index + block_index * threads_number
threads_range = threads_number * block_number
for i in range(grid_index, x.shape[0], threads_range):
out[i] = x[i] + y[i]
Initialize kernel:
kernel_manual_add[blocks_per_grid, threads_per_block](x, y, out)
When i try to print out grid_index i get all input indexes 2*8.
How to get grid indexes (10 of them) that are used to compute data?
The canonical way to write your kernel would be something like this
#cuda.jit
def kernel_manual_add(x, y, out):
i = cuda.grid(1)
if i < x.shape[0]:
out[i] = x[i] + y[i]
You must run at least as many threads as there are elements in the input arrays. There is no magic here, you need to calculate grid and block dimensions manually before calling the kernel. See here and here for suggestions.
Related
I am trying to implement a numerical solver for a 1D Harmonic well's ground state using the Metropolis algorithm and the Feynman Path Integral technique in Python. When I run my program, I end with a distribution of the different points that my particle has gone to; this distribution ought to match up with that of a particle trapped in a 1D harmonic well. It does not. I have gone through and rewritten my code; I have checked it to similar code used for the same purpose; it all looks like it should work, yet it doesn't.
In blue is the histogram of my results, with Density set to True; the orange line is the function describing the expected distribution
As can be seen in the image, what I have ended up with is a distribution that isn't dissimilar to what I was expecting, but it isn't the correct distribution. The code I used (see below), is based on Lepage (2005) work on the same topic, although I used a slightly different formula to describe the same physical system.
import numpy as np
import random
import matplotlib.pyplot as plt
time = 4 #time over which we evolve our function
steps = 7 #number of steps we take
epsilon = 3 #the pos & neg bounds of our rand variable
N_cor = 100 #the number of times we need to thermalise our function before we take a path
N_cf = 20000 #the number of paths we take
def S(x, j, t, s): #the action of our potential well
e = t / s
return (1/(2*e))*(x[j] - x[j - 1])**2 + ((x[j] + x[j-1])/2)**2/2
def update(x, t, s, eps):
for j in range(0, s):
old_x = x[j] #old x value
old_Sj = S(x, j, t, s) #original action value
x[j] = x[j] + random.uniform(-eps,eps) #new x value
dS = S(x, j, t, s) - old_Sj #change in action
if dS > 0 and np.exp(-dS) < random.uniform(0,1): #check for Metropolis alg
x[j] = old_x
return x
def gamma(t, s, eps, thermal_num, num_paths):
zeros = np.zeros(s) #our initial path with s steps
gamma_arr = np.empty(0) #our initial empty result
for i in range(0, 10*thermal_num): #thermalisation
zeros = update(zeros, t, s, eps)
for j in range(0, num_paths):
for i in range(0, thermal_num): #thermalising again
zeros = update(zeros, t, s, eps)
gamma_arr = np.append(gamma_arr, zeros) #add new path post thermalising
#print(zeros)
#print(gamma_arr)
return gamma_arr
test = gamma(time, steps, epsilon, N_cor, N_cf)
x = np.arange(-4, 4, 0.1)
y = 1/np.sqrt(np.pi)*np.exp(-(x**2)) #expected result
plt.hist(test, bins= 500, density = True)
plt.plot(x, y)
plt.show()
Consider the following kernel, which counts the number of elements in x which are less than or equal to the corresponding element in y.
#cuda.jit
def count_leq(x, y, out):
i = cuda.grid(1)
shared = cuda.shared.array(1, dtype=DTYPE)
if i < len(x):
shared[0] += x[i] <= y[i]
cuda.syncthreads()
out[0] = shared[0]
However, the increments from each thread are not being saved properly in the shared array.
a = cuda.to_device(np.arange(5)) # [0 1 2 3 4]
b = cuda.to_device(np.arange(5)) # [0 1 2 3 4]
out = cuda.to_device(np.zeros(1)) # [0]
count_leq[1,len(a)](a, b, out)
print(out[0]) # 1.0, but should be 5.0
What am I doing wrong here? I'm confused because cuda.shared.array is shared by all threads in a given block, right? How do I accumulate the increments using the same 1-element array?
I also tried the following, which failed with the same behavior as the above version.
#cuda.jit
def count_leq(x, y, out):
i = cuda.grid(1)
if i < len(x):
out[0] += x[i] <= y[i]
You need to perform an atomic add operation explicitly:
#cuda.jit
def count_leq(x, y, out):
i = cuda.grid(1)
if i < len(x):
cuda.atomic.add(out, 0, x[i] <= y[i])
Atomic adds are optimized on relatively new devices using for example an hardware warp reduction, but the operation tends not to scale when a large number of streaming-multiprocessors perform an atomic operations.
One solution to increase the performance of this kernel is to perform a block reduction of many values assuming the array is large enough. In practice, each thread can sum multiple items and perform one atomic operation in the end. The code should look like this (untested):
# Must be launched with different parameters since
# each threads works on more array items.
# The number of block should be 16 times smaller.
#cuda.jit
def count_leq(x, y, out):
tid = cuda.threadIdx.x
bid = cuda.blockIdx.x
bdim = cuda.blockDim.x
i = (bid * bdim * 16) + tid
s = 0
# Fast general case (far from the end of the arrays)
if i+16*bdim < len(x):
# Thread-local reduction
# This loop should be unrolled
for j in range(16):
idx = i + j * bdim
s += x[idx] <= y[idx]
# Slower corner case (close to end of the arrays: checks are needed)
else:
for j in range(16):
idx = i + j * bdim
if idx < len(x):
s += x[idx] <= y[idx]
cuda.atomic.add(out, 0, s)
Note that 16 is an arbitrary value. It is certainly faster to use a bigger value like 64 for huge array and a smaller value for relatively small arrays.
I have a fairly non trivial function I want to differentiate with autograd but I'm not quite enough of a numpy wizard to figure our how to do it without array assingment.
I also apologize that I had to make this example incredibly contrived and meaningless to be able to run standalone. The actual code I'm working with is for non linear finite elements and is trying to compute the jacobian for a complex non linear system.
import autograd.numpy as anp
from autograd import jacobian
def alpha(x):
return anp.exp(-(x - 10) ** 2) / (x + 1)
def f(x):
# Matrix getting constructed
k = anp.zeros((x.shape[0], x.shape[0]))
# loop over some random 3 dimensional vectors
for element in anp.random.randint(0, x.shape[0], (x.shape[0], 3)):
# select 3 values from x
x_ijk = anp.array([[x[i] for i in element]])
norm = anp.linalg.norm(
x_ijk # anp.vstack((element, element)).transpose()
)
# make some matrix from the element
m = element.reshape(3, 1) # element.reshape(1, 3)
# alpha is an arbitrary differentiable function R -> R
alpha_value = alpha(norm)
# combine m matricies into k scaling by alpha_value
n = m.shape[0]
for i in range(n):
for j in range(n):
k[element[i], element[j]] += m[i, j] * alpha_value
return k # x
print(jacobian(f)(anp.random.rand(10)))
# And course we get an error
# k[element[i], element[j]] += m[i, j] * alpha_value
# ValueError: setting an array element with a sequence.
I don't really understand this message since no type error is happening. I assume it must be from assignment.
After writting the above I made a trivial switch to PyTorch and the code runs just fine. But I would still prefer to use autograd
#pytorch version
import torch
from torch.autograd.gradcheck import zero_gradients
def alpha(x):
return torch.exp(x)
def f(x):
# Matrix getting constructed
k = torch.zeros((x.shape[0], x.shape[0]))
# loop over some random 3 dimensional vectors
for element in torch.randint(0, x.shape[0], (x.shape[0], 3)):
# select 3 values from x
x_ijk = torch.tensor([[1. if n == e else 0 for n in range(len(x))] for e in element]) # x
norm = torch.norm(
x_ijk # torch.stack((torch.tanh(element.float() + 4), element.float() - 4)).t()
)
m = torch.rand(3, 3)
# alpha is an arbitrary differentiable function R -> R
alpha_value = alpha(norm)
n = m.shape[0]
for i in range(n):
for j in range(n):
k[element[i], element[j]] += m[i, j] * alpha_value
print(k)
return k # x
x = torch.rand(4, requires_grad=True)
print(x, '\n')
y = f(x)
print(y, '\n')
grads = []
for val in y:
val.backward(retain_graph=True)
grads.append(x.grad.clone())
zero_gradients(x)
if __name__ == '__main__':
print(torch.stack(grads))
In Autograd, and in JAX, you are not allowed to perform array indexing assignments. See the JAX gotchas for a partial explanation of this.
PyTorch allows this functionality. If you want to run your code in autograd, you'll have to find a way to remove the offending line k[element[i], element[j]] += m[i, j] * alpha_value. If you are okay with running your code in JAX (which has essentially the same syntax as autograd, but more features), then it looks like jax.ops could be helpful for performing this sort of indexing assignment.
I am trying to implement a finite difference approximation to solve the Heat Equation, u_t = k * u_{xx}, in Python using NumPy.
Here is a copy of the code I am running:
## This program is to implement a Finite Difference method approximation
## to solve the Heat Equation, u_t = k * u_xx,
## in 1D w/out sources & on a finite interval 0 < x < L. The PDE
## is subject to B.C: u(0,t) = u(L,t) = 0,
## and the I.C: u(x,0) = f(x).
import numpy as np
import matplotlib.pyplot as plt
# parameters
L = 1 # legnth of the rod
T = 10 # terminal time
N = 10
M = 100
s = 0.25
# uniform mesh
x_init = 0
x_end = L
dx = float(x_end - x_init) / N
x = np.arange(x_init, x_end, dx)
x[0] = x_init
# time discretization
t_init = 0
t_end = T
dt = float(t_end - t_init) / M
t = np.arange(t_init, t_end, dt)
t[0] = t_init
# Boundary Conditions
for m in xrange(0, M):
t[m] = m * dt
# Initial Conditions
for j in xrange(0, N):
x[j] = j * dx
# definition of solution u(x,t) to u_t = k * u_xx
u = np.zeros((N, M+1)) # array to store values of the solution
# Finite Difference Scheme:
u[:,0] = x**2 #initial condition
for m in xrange(0, M):
for j in xrange(1, N-1):
if j == 1:
u[j-1,m] = 0 # Boundary condition
elif j == N-1:
u[j+1,m] = 0
else:
u[j,m+1] = u[j,m] + s * ( u[j+1,m] -
2 * u[j,m] + u[j-1,m] )
print u, #t, x
plt.plot(u, t)
#plt.show()
I think my code is working properly and it is producing an output. I want to plot the output of the solution u versus t (my time vector). If I can plot the graph then I am able to check if my numerical approximation agrees with the expected phenomena for the Heat Equation. However, I am getting the error that "x and y must have same first dimension". How can I correct this issue?
An additional question: Am I better off attempting to make an animation with matplotlib.animation instead of using matplotlib.plyplot ???
Thanks so much for any and all help! It is very greatly appreciated!
Okay so I had a "brain dump" and tried plotting u vs. t sort of forgetting that u, being the solution to the Heat Equation (u_t = k * u_{xx}), is defined as u(x,t) so it has values for time. I made the following correction to my code:
print u #t, x
plt.plot(u)
plt.show()
And now my programming is finally displaying an image. And here it is:
It is absolutely beautiful, isn't it?
I have currently made quite a large code in python and when I run it, it takes about 3 minutes for it to make the full calculation. Eventually, I want to increase my N to about 400 and change my m in the for loop to an even larger number - This would probably take hours to calculate which I want to cut down.
It's steps 1-6 that take a long time.
When attempting to run this with cython (I.E. importing pyximport then importing my file)
I get the following error FDC.pyx:49:19: 'range' not a valid cython language construct and
FDC.pyx:49:19: 'range' not a valid cython attribute or is being used incorrectly
from physics import *
from operator import add, sub
import pylab
################ PRODUCING CHARGES AT RANDOM IN r #############
N=11 #Number of point charges
x = zeros(N,float) #grid
y = zeros(N,float)
i=0
while i < N: #code to produce values of x and y within r
x[i] = random.uniform(0,1)
y[i] = random.uniform(0,1)
if x[i] ** 2 + y[i] ** 2 <= 1:
i+=1
print x, y
def r(x,y): #distance between particles
return sqrt(x**2 + y**2)
o = 0; k = 0; W=0 #sum of energy for initial charges
for o in range(0, N):
for k in range(0, N):
if o==k:
continue
xdist=x[o]-x[k]
ydist=y[o]-y[k]
W+= 0.5/(r(xdist,ydist))
print "Initial Energy:", W
##################### STEPS 1-6 ######################
d=0.01 #fixed change in length
charge=(x,y)
l=0; m=0; n=0
prevsW = 0.
T=100
for q in range(0,100):
T=0.9*T
for m in range(0, 4000): #steps 1 - 6 in notes looped over
xRef = random.randint(0,1) #Choosing x or y
yRef = random.randint(0,N-1) #choosing the element of xRef
j = charge[xRef][yRef] #Chooses specific axis of a charge and stores it as 'j'
prevops = None #assigns prevops as having no variable
while True: #code to randomly change charge positions and ensure they do not leave the disc
ops =(add, sub); op=random.choice(ops)
tempJ = op(j, d)
#print xRef, yRef, n, tempJ
charge[xRef][yRef] = tempJ
ret = r(charge[0][yRef],charge[1][yRef])
if ret<=1.0:
j=tempJ
#print "working", n
break
elif prevops != ops and prevops != None: #!= is 'not equal to' so that if both addition and subtraction operations dont work the code breaks
break
prevops = ops #####
o = 0; k = 0; sW=0 #New energy with altered x coordinate
for o in range(0, N):
for k in range(0, N):
if o==k:
continue
xdist = x[o] - x[k]
ydist = y[o] - y[k]
sW+=0.5/(r( xdist , ydist ))
difference = sW - prevsW
prevsW = sW
#Conditions:
p=0
if difference < 0: #accept change
charge[xRef][yRef] = j
#print 'step 5'
randomnum = random.uniform(0,1) #r
if difference > 0: #acceptance with a probability
p = exp( -difference / T )
#print 'step 6', p
if randomnum >= p:
charge[xRef][yRef] = op(tempJ, -d) #revert coordinate to original if r>p
#print charge[xRef][yRef], 'r>p'
#print m, charge, difference
o = 0; k = 0; DW=0 #sum of energy for initial charges
for o in range(0, N):
for k in range(0, N):
if o==k:
continue
xdist=x[o]-x[k]
ydist=y[o]-y[k]
DW+= 0.5/(r(xdist,ydist))
print charge
print 'Final Energy:', DW
################### plotting circle ###################
# use radians instead of degrees
list_radians = [0]
for i in range(0,360):
float_div = 180.0/(i+1)
list_radians.append(pi/float_div)
# list of coordinates for each point
list_x2_axis = []
list_y2_axis = []
# calculate coordinates
# and append to above list
for a in list_radians:
list_x2_axis.append(cos(a))
list_y2_axis.append(sin(a))
# plot the coordinates
pylab.plot(list_x2_axis,list_y2_axis,c='r')
########################################################
pylab.title('Distribution of Charges on a Disc')
pylab.scatter(x,y)
pylab.show()
What is taking time seems to be this:
for q in range(0,100):
...
for m in range(0, 4000): #steps 1 - 6 in notes looped over
while True: #code to randomly change charge positions and ensure they do not leave the disc
....
for o in range(0, N): # <----- N will be brought up to 400
for k in range(0, N):
....
....
....
....
100 x 4000 x (while loop) + 100 x 4000 x 400 x 400 = [400,000 x while loop] + [64,000,000,000]
Before looking into a faster language, maybe there is a better way to build your simulation?
Other than that, you will likely have immediate performance gains if you:
- shift to numpy arrays i/o python lists.
- Use xrange i/o range
[edit to try to answer question in the comments]:
import numpy as np, random
N=11 #Number of point charges
x = np.random.uniform(0,1,N)
y = np.random.uniform(0,1,N)
z = np.zeros(N)
z = np.sqrt(x**2 + y**2) # <--- this could maybe replace r(x,y) (called quite often in your code)
print x, y, z
You also could look into all the variables that are assigned or recalculated many many times inside your main loop (the one described above), and pull all that outside the loop(s) so it is not repeatedly assigned or recalculated.
for instance,
ops =(add, sub); op=random.choice(ops)
maybe could be replaced by
ops = random.choice(add, sub)
Lastly, and here I am out on a limb because I've never used it myself, but it might be a little bit simpler for you to use a package like Numba or Jit as opposed to cython; they allow you to decorate a critical part of your code and have it precompiled prior to execution, with none or very minor changes.