CudaAPIError 716 when trying to copy data from gpu - python

I'm learning Numba and CUDA Python. I've been following a set of youtube tutorials and have (I believe) understood the principals. My issue is with copying computed values back from my GPU. I use the following line to do this:
aVals = retVal.copy_to_host()
I've also tried using this line:
retVal.copy_to_host( aVals[:] )
Neither work and both give the same error:
numba.cuda.cudadrv.driver.CudaAPIError: [716] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
I'm reasonably confident the above lines are the issue as if I comment out the line the code runs without errors. Is there some underlying issue I'm overlooking with copying an array from GPU to CPU? Have I screwed up my arrays somewhere?
There's a lot of messing around in my code but here's a bare bones version:
import numpy as np
import time
from math import sin, cos, tan, sqrt, pi, floor
from numba import vectorize, cuda
#cuda.jit('void(double[:],double[:],double[:],double)')
def CalculatePreValues(retVal,ecc, incl, ke):
i= cuda.grid(1)
if i >= ecc.shape[0]:
return
retVal[i] = (ke/ecc[i])**(2/3)
def main():
eccen = np.ones(num_lines, dtype=np.float32)
inclin = np.ones(num_lines, dtype=np.float32)
ke = 0.0743669161
aVals = np.zeros(eccen.shape[0])
start = time.time()
retVal = cuda.device_array(aVals.shape[0])
ecc = cuda.to_device(eccen)
inc = cuda.to_device(inclin)
threadsPerBlock = 256
numBlocks = int((ecc.shape[0]+threadsPerBlock-1)/threadsPerBlock)
CalculatePreValues[numBlocks, threadsPerBlock](retVal,ecc,inc)
aVals = retVal.copy_to_host()
preCalcTime = time.time() - start
print ("Precalculation took % seconds" % preCalcTime)
print (aVals.shape[0])
if __name__=='__main__':
main()

There are several points to make here.
Firstly, the source of the error you are seeing is a runtime error coming from the kernel execution. If I run a hacky "fixed" version of your code using cuda-memcheck, I see this:
$ cuda-memcheck python ./error.py
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 8
========= at 0x00000178 in cudapy::__main__::CalculatePreValues$241(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
========= by thread (255,0,0) in block (482,0,0)
========= Address 0x7061317f8 is out of bounds
The reason is that the bounds checking in your kernel is broken:
if i > ecc.shape[0]:
return
should be
if i >= ecc.shape[0]:
return
When the question was updated to include a MCVE, it became evident that there was another issue. The kernel signature specifies double for all the arrays:
#cuda.jit('void(double[:],double[:],double[:],double)')
^^^^^^ ^^^^^^ ^^^^^^
but the type of arrays created were actually float (i.e. np.float32):
eccen = np.ones(num_lines, dtype=np.float32)
inclin = np.ones(num_lines, dtype=np.float32)
^^^^^^^^^^
This is a mismatch. Indexing into an array using double indexing, when the array has only been created with float values, will likely create out-of-bounds indexing.
The solution is to convert the created arrays to dtype=np.float64, or else convert the arrays in the signature to float:
#cuda.jit('void(float[:],float[:],float[:],double)')
to eliminate the out-of-bounds indexing.

Related

Have trouble using numba atomic operation functions (cuda.atomic.compare_and_swap)

I am trying to use Numba to write cuda kernels for my code. And somehow I wanna use the atomic operation in part of my code and I wrote a test kernel to see how cuda.atomic.compare_and_swap works. On the documentation it says this:
enter image description here
from numba import cuda
import numpy as np
#cuda.jit
def atomicCAS(N,out1):
idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
if idx >= N:
return
A = out1[idx:]
cuda.atomic.compare_and_swap(A,idx,0)
N = 1024
out1 = np.arange(N)
out1 = np.zeros(N)
dout1 = cuda.to_device(out1)
tpb = 32
bpg = int(np.ceil(N/tpb))
atomicCAS[bpg,tpb](N,dout1)
hout1 = dout1.copy_to_host()
Then I got this error:
TypingError: Invalid use of Function(<class 'numba.cuda.stubs.atomic.compare_and_swap'>) with argument(s) of type(s): (array(float64, 1d, A), int64, Literal[int](0))
* parameterized
In definition 0:
All templates rejected with literals.
In definition 1:
All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<class 'numba.cuda.stubs.atomic.compare_and_swap'>)
[2] During: typing of call at /home/qinyu/test.py (20)
This is a pretty naive code and I think I feed in the write type of variables but I got this typingerror. It worked pretty well with the other atomic operations in Numba. This is the only one that does not work for me. Can somebody help me figure out the problem or is there another alternative ways to do this? Thanks!
The key in the error message is this:
array(float64, 1d, A), int64, Literal[int](0))
CUDA atomicCAS only supports integer types. You cannot pass a floating point type.

Cython in Colab: "cdef struct Rectangle" (line 9) identified as syntax error

I have this piece of code showcased in the latest live stream of Siraj Raval:
%laod_ext Cython
%cython
#memory management helper for Cython
from cymem.cymem import Pool
#good old python
from random import random
#The cdef statement is used to declare C variables,types, and functions
cdef struct Rectangle:
#C variables
float w
float h
#the "*" is the pointer operator, it gives value stored at particular
address
#this saves memory and runs faster, since we don't have to duplicate the
data
cdef int check_rectangles_cy(Rectangle* rectangles, int n_rectangles, float
threshold):
cdef int n_out = 0
# C arrays contain no size information => we need to state it explicitly
for rectangle in rectangles[:n_rectangles]:
if rectangle.w * rectangle.h > threshold:
n_out += 1
return n_out
#python uses garbage collection instead of manual memory management
#which means developers can freely create objects
#and Python's memory manager will periodically look for any
# objects that are no longer referenced by their program
#this overhead makes demands on the runtime environment (slower)
# so manually memory management is better
def main_rectangles_fast():
cdef int n_rectangles = 10000000
cdef float threshold = 0.25
#The Poool Object will save memory addresses internally
#then free them when the object is garbage collected
cdef Pool mem = Pool()
cdef Rectangle* rectangles = <Rectangle*>mem.alloc(n_rectangles, sizeof(Rectangle))
for i in range(n_rectangles):
rectangles[i].w = random()
rectangles[i].h = random()
n_out = check_rectangles_cy(rectangles, n_rectangles, threshold)
print(n_out)
When run, this code outputs the following error:
File "ipython-input-18-25198e914011", 'line 9'
cdef struct Rectangle:
SyntaxError: invalid syntax
What have I written wrong?
Link to the video: https://www.youtube.com/watch?v=giF8XoPTMFg
You had a typo. Replace
%laod_ext Cython
by
%load_ext Cython
and things should work as expected.

assignment into arbitrary array locations using cython. assignment speed depends on value?

I'm seeing some bizarre behavior in cython code. I'm writing code to compute a forward Kalman filter, but I have a state transition model which has many 0s in it, so it would be nice to be able to calculate only certain elements of the covariance matrices.
So to test this out, I wanted to fill individual array elements using cython. To my surprise, I found
that writing output to specific array locations is very slow (function fill(...)), compared to just assigning it to a scalar variable every time (function nofill(...)) (essentially forgetting the results), and
setting C=0.1 or 31, while not affecting how long nofill(...) takes to run, the latter choice for C makes fill(...) run 2x as slowly. This is baffling to me. Can anyone explain why I'm seeing this?
Code:-
################# file way_too_slow.pyx
from libc.math cimport sin
# Setting C=0.1 or 31 doesn't change affect performance of calling nofill(...), but it makes the fill(...) slower. I have no clue why.
cdef double C = 0.1
# This function just throws away its output.
def nofill(double[::1] x, double[::1] y, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double d
with nogil:
for 0 <= i < N:
d = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
# Same function keeps its output.
# However: #1 - MUCH slower than
def fill(double[::1] x, double[::1] y, double[::1] out, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double *p_o = &out[0]
cdef double d
with nogil:
for 0 <= i < N:
p_o[i] = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
The above code is called by the python program
#################### run_way_too_slow.py
import way_too_slow as _wts
import time as _tm
N = 80000
x = _N.random.randn(N)
y = _N.random.randn(N)
out = _N.empty(N)
t1 = _tm.time()
_wts.nofill(x, y, N)
t2 = _tm.time()
_wts.fill(x, y, out, N)
t3 = _tm.time()
print "nofill() ET: %.3e" % (t2-t1)
print "fill() ET: %.3e" % (t3-t2)
print "fill() is slower by factor %.3f" % ((t3-t2)/(t2-t1))
The cython was compiled using the setup.py file
################# setup.py
from distutils.core import setup, Extension
from distutils.sysconfig import get_python_inc
from distutils.extension import Extension
from Cython.Distutils import build_ext
incdir=[get_python_inc(plat_specific=1)]
libdir = ['/usr/local/lib']
cmdclass = {'build_ext' : build_ext}
ext_modules = Extension("way_too_slow",
["way_too_slow.pyx"],
include_dirs=incdir, # include_dirs for Mac
library_dirs=libdir)
setup(
name="way_too_slow",
cmdclass = cmdclass,
ext_modules = [ext_modules]
)
Here is a typical output of running "run_way_too_slow.py" using C=0.1
>>> exf("run_way_too_slow.py")
nofill() ET: 6.700e-05
fill() ET: 6.409e-04
fill() is slower by factor 9.566
A typical run with C=31.
>>> exf("run_way_too_slow.py")
nofill() ET: 6.795e-05
fill() ET: 1.566e-03
fill() is slower by factor 23.046
As we can see
Assigning into specified array location is quite slow compared to assigning to a double.
For some reason, assigning speed seems to depend on what operation was done in the calculation - which makes no sense to me.
Any insight would be greatly appreciated.
Two things explain your observation:
A: in the first version nothing happens. The c compiler is clever enough to see, that the whole loop has no effect at all outside of the function and optimizes it out.
To enforce the execution you must make the result d visible outside, for example via:
cdef double d=0
....
d+=....
return d
It might be still slower than writing-to-an-array-version, because of the fewer costly memory accesses - but you will see a slowdown when changing value C.
B: sin is a complicated function and how long it takes to calculate depends on its argument. For example, for very small arguments - the argument itself can be returned, but for bigger argument much longer Taylor series must be evaluated. Here is one example for cost of tanh, depending on the value of the argument, which like sin is calculated via different approximation/ Taylor series - the most important part the time needed depends on argument.

Cython: Pass a 2D array from Python to a C and retrieve it

I am trying to build a wrapper to a camera driver in written C using Cython. I am new to Cython (started 2 weeks ago). After a lot of struggle, I could successfully develop wrappers for structures, 1D arrays but now I am stuck with 2D arrays.
One of the camera's C APIs takes a 2D array pointer as input and assigns the captured image to it. This function needs to be called from Python and the output image needs to be processed/displayed in Python. After going through the Cython docs and various posts on stack-overflow, I ended up with more confusion. I could not figure out how to pass 2D arrays between Python and the C. The driver api looks (somewhat) like this:
driver.h
void assign_values2D(double **matrix, unsigned int row_size, unsigned int column_size);
c_driver.pyd
cdef extern from "driver.h":
void assign_values2D(double **matrix, unsigned int row_size, unsigned int column_size)
test.pyx
from c_driver import assign_values2D
import numpy as np
cimport numpy as np
cimport cython
from libc.stdlib cimport malloc, free
import ctypes
#cython.boundscheck(False)
#cython.wraparound(False)
def assignValues2D(self, np.ndarray[np.double_t,ndim=2,mode='c']mat):
row_size,column_size = np.shape(mat)
cdef np.ndarray[double, ndim=2, mode="c"] temp_mat = np.ascontiguousarray(mat, dtype = ctypes.c_double)
cdef double ** mat_pointer = <double **>malloc(column_size * sizeof(double*))
if not mat_pointer:
raise MemoryError
try:
for i in range(row_size):
mat_pointer[i] = &temp_mat[i, 0]
assign_values2D(<double **> &mat_pointer[0], row_size, column_size)
return np.array(mat)
finally:
free(mat_pointer)
test_camera.py
b = np.zeros((5,5), dtype=np.float) # sample code
print "B Before = "
print b
assignValues2D(b)
print "B After = "
print b
When compiled, it gives the error:
Error compiling Cython file:
------------------------------------------------------------
...
if not mat_pointer:
raise MemoryError
try:
for i in range(row_size):
mat_pointer[i] = &temp_mat[i, 0]
^
------------------------------------------------------------
test.pyx:120:21: Cannot take address of Python variable
In fact, the above code was taken from a stack-overflow post. I have tried several other ways but none of them are working. Please let me know how I can get the 2D image into Python. Thanks in advance.
You need to type i:
cdef int i
(Alternatively you can type row_size and it also works)
Once it knows that i is an int then it can work out the type that indexing the tmp_map gives and so the & operator works.
Normally it's pretty good about figuring out the type of loop variables like i, but I think the issue is that it can't deduce the type of row_size so it decided it can't deduce the type of i since it is deduced from range(row_size). Because of that it can't deduce the type of temp_mat[i,0].
I suspect you also you also want to change the return statement to return np.array(temp_mat) - the code you have will likely work most of the time but occasionally np.ascontinuousarray will have to make a copy and mat won't be changed.

Passing struct with pointer members to OpenCL kernel using PyOpenCL

Let's suppose I have a kernel to compute the element-wise sum of two arrays. Rather than passing a, b, and c as three parameters, I make them structure members as follows:
typedef struct
{
__global uint *a;
__global uint *b;
__global uint *c;
} SumParameters;
__kernel void compute_sum(__global SumParameters *params)
{
uint id = get_global_id(0);
params->c[id] = params->a[id] + params->b[id];
return;
}
There is information on structures if you RTFM of PyOpenCL [1], and others have addressed this question too [2] [3] [4]. But none of the OpenCL struct examples I've been able to find have pointers as members.
Specifically, I'm worried about whether host/device address spaces match, and whether host/device pointer sizes match. Does anyone know the answer?
[1] http://documen.tician.de/pyopencl/howto.html#how-to-use-struct-types-with-pyopencl
[2] Struct Alignment with PyOpenCL
[3] http://enja.org/2011/03/30/adventures-in-opencl-part-3-constant-memory-structs/
[4] http://acooke.org/cute/Somesimple0.html
No, there is no guaranty that address spaces match. For the basic types (float, int,…) you have alignment requirement (section 6.1.5 of the standard) and you have to use the cl_type name of the OpenCL implementation (when programming in C, pyopencl does the job under the hood I’d say).
For the pointers it’s even simpler due to this mismatch. The very beginning of section 6.9 of the standard v 1.2 (it’s section 6.8 for version 1.1) states:
Arguments to kernel functions declared in a program that are pointers
must be declared with the __global, __constant or __local qualifier.
And in the point p.:
Arguments to kernel functions that are declared to be a struct or
union do not allow OpenCL objects to be passed as elements of the
struct or union.
Note also the point d.:
Variable length arrays and structures with flexible (or unsized)
arrays are not supported.
So, no way to make you kernel runs as described in your question and that's why you haven’t been able to find some examples of OpenCl struct have pointers as members.
I still can propose a workaround that takes advantage of the fact that the kernel is compiled in JIT. It still requires that you pack you data properly and that you pay attention to the alignment and finally that the size doesn’t change during the execution of the program. I honestly would go for a kernel taking 3 buffers as arguments, but anyhow, there it is.
The idea is to use the preprocessor option –D as in the following example in python:
Kernel:
typedef struct {
uint a[SIZE];
uint b[SIZE];
uint c[SIZE];
} SumParameters;
kernel void foo(global SumParameters *params){
int idx = get_global_id(0);
params->c[idx] = params->a[idx] + params->b[idx];
}
Host code:
import numpy as np
import pyopencl as cl
def bar():
mf = cl.mem_flags
ctx = cl.create_some_context()
queue = cl.CommandQueue(self.ctx)
prog_f = open('kernels.cl', 'r')
#a = (1, 2, 3), b = (4, 5, 6)
ary = np.array([(1, 2, 3), (4, 5, 6), (0, 0, 0)], dtype='uint32, uint32, uint32')
cl_ary = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=ary)
#Here should compute the size, but hardcoded for the example
size = 3
#The important part follows using -D option
prog = cl.Program(ctx, prog_f.read()).build(options="-D SIZE={0}".format(size))
prog.foo(queue, (size,), None, cl_ary)
result = np.zeros_like(ary)
cl.enqueue_copy(queue, result, cl_ary).wait()
print result
And the result:
[(1L, 2L, 3L) (4L, 5L, 6L) (5L, 7L, 9L)]
I don't know the answer to my own question, but there are 3 workarounds I can come up with off the top of my head. I consider Workaround 3 the best option.
Workaround 1: We only have 3 parameters here, so we could just make a, b, and c kernel parameters. But I've read there's a limit on the number of parameters you can pass to a kernel, and I personally like to refactor any function that takes more than 3-4 arguments to use structs (or, in Python, tuples or keyword arguments). So this solution makes the code harder to read, and doesn't scale.
Workaround 2: Dump everything in a single giant array. Then the kernel would look like this:
typedef struct
{
uint ai;
uint bi;
uint ci;
} SumParameters;
__kernel void compute_sum(__global SumParameters *params, uint *data)
{
uint id = get_global_id(0);
data[params->ci + id] = data[params->ai + id] + data[params->bi + id];
return;
}
In other words, instead of using pointers, use offsets into a single array. This looks an awful lot like the beginnings of implementing my own memory model, and it feels like it's reinventing a wheel that exists somewhere in PyOpenCL, or OpenCL, or both.
Workaround 3: Make setter kernels. Like this:
__kernel void set_a(__global SumParameters *params, __global uint *a)
{
params->a = a;
return;
}
and ditto for set_b, set_c. Then execute these kernels with worksize 1 to set up the data structure. You still need to know how big a block to allocate for params, but if it's too big, nothing bad will happen (except a little wasted memory), so I'd say just assume the pointers are 64 bits.
This workaround's performance is probably awful (I imagine a kernel call has enormous overhead), but fortunately that shouldn't matter too much for my application (my kernel is going to run for seconds at a time, it's not a graphics thing that has to run at 30-60 fps, so I imagine that the time taken by extra kernel calls to set parameters will end up being a tiny fraction of my workload, no matter how high the per-kernel-call overhead is).

Categories

Resources