How to initialise a GPU array using memory address in pycuda? - python

I have a c++ code that gives image array output in GPU memory. I want to do further processing and image analytics using pycuda. I am trying to make GPU array as:
For testing purpose, I have created an array in c++ as:
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
int* dev_a = nullptr;
cudaMalloc((void**)&dev_a, arraySize * sizeof(int));
cudaMemcpy(dev_a, a, arraySize * sizeof(int), cudaMemcpyHostToDevice);
printf("dev_a : %p\n", dev_a);
Suppose, here I get GPU memory address as '0x7f3454800000'. I am using this address to create GPUarray as:
from pycuda.gpuarray import GPUArray
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda.driver import PointerHolderBase
drv.init()
class Holder(PointerHolderBase):
def __init__(self):
super().__init__()
self.gpudata = '0x7f5954800000'
def get_pointer(self):
return self.gpudata
def __int__(self):
return self.__index__()
# without an __index__ method, arithmetic calls to the GPUArray backed by this pointer fail
# not sure why, this needs to return some integer, apparently
def __index__(self):
return self.gpudata
array = GPUArray((1,5), dtype=np.int32, gpudata=Holder())
print(array.get())
When I am running the code, I am getting error as:
File "array_test.py", line 43, in <module>
print(array.get())
File "/home/govindam/anaconda3/envs/tf_c2/lib/python3.6/site-packages/pycuda/gpuarray.py", line 305, in get
_memcpy_discontig(ary, self, async_=async_, stream=stream)
File "/home/govindam/anaconda3/envs/tf_c2/lib/python3.6/site-packages/pycuda/gpuarray.py", line 1309, in _memcpy_discontig
drv.memcpy_dtoh(dst, src.gpudata)
TypeError: No registered converter was able to produce a C++ rvalue of type unsigned long long from this Python object of type st
How to give the memory address while creating GPUarray so that I am avoid copying from GPU to CPU and again to GPU?

Related

Generating single random number in pyCuda kernel

I have seen many ways to generate an array of random numbers. but I want to generate a single random number. Is there any function as rand() in c++. I don't want a series of random numbers. I just need to generate a random number inside the kernel. is there any builtin function to generate random numbers? I have tried the given code below, but it not working.
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
__global__ void myRand(float *values)
{
values[0] = getRand();
}
"""
mod = SourceModule(code)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
Errors are like:
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_poisson.h(548): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(69): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(78): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(86): error: this declaration may not have extern "C" linkage
30 errors detected in the compilation of "kernel.cu".
The basic problem is that, by default, PyCUDA silently applies C linkage to all code compiled in a SourceModule. As the error is showing, cuRand requires C++ linkage, so getRand can't have C linkage.
You can fix this either by changing these two lines:
mod = SourceModule(code)
myRand = mod.get_function("myRand")
to
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("_Z6myRandPf")
This disables C linkage, but does mean you need to supply the C++ mangled name to the get_function call. You will need to look at the verbose compiler output or compile the code outside of PyCUDA to get that name (for example Godbolt).
Alternatively you can modify the code like this:
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
extern "C" {
__global__ void myRand(float *values)
{
values[0] = getRand();
}
}
"""
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
This leaves the kernel with C linkage, but doesn't touch the device function which is using cuRand.
you can import random in python . and use random.randint(). to generate random number in specified range by defining range in function. exrandom.randint(0,50)

In Numba, how to copy an array into constant memory when targeting CUDA?

I have a sample code that illustrates the issue:
import numpy as np
from numba import cuda, types
import configs
def main():
arr = np.empty(0, dtype=np.uint8)
stream = cuda.stream()
d_arr = cuda.to_device(arr, stream=stream)
kernel[configs.BLOCK_COUNT, configs.THREAD_COUNT, stream](d_arr)
#cuda.jit(types.void(
types.Array(types.uint8, 1, 'C'),
), debug=configs.CUDA_DEBUG)
def kernel(d_arr):
arr = cuda.const.array_like(d_arr)
if __name__ == "__main__":
main()
When I run this code with cuda-memcheck, I get:
numba.errors.ConstantInferenceError: Failed in nopython mode pipeline (step: nopython rewrites)
Constant inference not possible for: arg(0, name=d_arr)
Which seems to indicate that array I passed in was not a constant so it could not be copied to constant memory - is that the case? If so, how can I copy to constant memory an array that was given to a kernel as input?
You don't copy to constant array using an array that was given to the kernel as input.
That type of input array is already in the device, and device code cannot write to constant memory.
Constant memory can only be written to from host code, and the constant syntax expects the array to be a host array.
Here is an example:
$ cat t32.py
import numpy as np
from numba import cuda, types, int32, int64
a = np.ones(3,dtype=np.int32)
#cuda.jit
def generate_mutants(b):
c_a = cuda.const.array_like(a)
b[0] = c_a[0]
if __name__ == "__main__":
b = np.zeros(3,dtype=np.int32)
generate_mutants[1, 1](b)
print(b)
$ python t32.py
[1 0 0]
$
Note that the implementation of constant memory in Numba CUDA has some behavioral differences compared to what is possible with CUDA C/C++, this issue highlights some of them.

pyCUDA reduction doesn't work

I am using reduction code basically exactly like the examples in the docs. The code below should return 2^3 + 2^3 = 16, but it instead returns 9. What did I do wrong?
import numpy
import pycuda.reduction as reduct
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule as module
newzeros = [{1,2,3},{4,5,6}]
gpuSum = reduct.ReductionKernel(numpy.uint64, neutral="0", reduce_expr="a+b", map_expr="1 << x[i]", arguments="int* x")
mylengths = pycuda.gpuarray.to_gpu(numpy.array(map(len,newzeros),dtype = "uint64",))
sumfalse = gpuSum(mylengths).get()
print sumfalse
I just figured it out. The argument list used when defining the kernel should be unsigned long *x, not int *x. I was using 64-bit integers everywhere else and it messed it up.

Force NumPy ndarray to take ownership of its memory in Cython

Following this answer to "Can I force a numpy ndarray to take ownership of its memory?" I attempted to use the Python C API function PyArray_ENABLEFLAGS through Cython's NumPy wrapper and found it is not exposed.
The following attempt to expose it manually (this is just a minimum example reproducing the failure)
from libc.stdlib cimport malloc
import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.int32_t DTYPE_t
cdef extern from "numpy/ndarraytypes.h":
void PyArray_ENABLEFLAGS(np.PyArrayObject *arr, int flags)
def test():
cdef int N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, np.NPY_INT32, data)
PyArray_ENABLEFLAGS(arr, np.NPY_ARRAY_OWNDATA)
fails with a compile error:
Error compiling Cython file:
------------------------------------------------------------
...
def test():
cdef int N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, np.NPY_INT32, data)
PyArray_ENABLEFLAGS(arr, np.NPY_ARRAY_OWNDATA)
^
------------------------------------------------------------
/tmp/test.pyx:19:27: Cannot convert Python object to 'PyArrayObject *'
My question: Is this the right approach to take in this case? If so, what am I doing wrong? If not, how do I force NumPy to take ownership in Cython, without going down to a C extension module?
You just have some minor errors in the interface definition. The following worked for me:
from libc.stdlib cimport malloc
import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.int32_t DTYPE_t
cdef extern from "numpy/arrayobject.h":
void PyArray_ENABLEFLAGS(np.ndarray arr, int flags)
cdef data_to_numpy_array_with_spec(void * ptr, np.npy_intp N, int t):
cdef np.ndarray[DTYPE_t, ndim=1] arr = np.PyArray_SimpleNewFromData(1, &N, t, ptr)
PyArray_ENABLEFLAGS(arr, np.NPY_OWNDATA)
return arr
def test():
N = 1000
cdef DTYPE_t *data = <DTYPE_t *>malloc(N * sizeof(DTYPE_t))
arr = data_to_numpy_array_with_spec(data, N, np.NPY_INT32)
return arr
This is my setup.py file:
from distutils.core import setup, Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("_owndata", ["owndata.pyx"])]
setup(cmdclass={'build_ext': build_ext}, ext_modules=ext_modules)
Build with python setup.py build_ext --inplace. Then verify that the data is actually owned:
import _owndata
arr = _owndata.test()
print arr.flags
Among others, you should see OWNDATA : True.
And yes, this is definitely the right way to deal with this, since numpy.pxd does exactly the same thing to export all the other functions to Cython.
#Stefan's solution works for most scenarios, but is somewhat fragile. Numpy uses PyDataMem_NEW/PyDataMem_FREE for memory-management and it is an implementation detail, that these calls are mapped to the usual malloc/free + some memory tracing (I don't know which effect Stefan's solution has on the memory tracing, at least it seems not to crash).
There are also more esoteric cases possible, in which free from numpy-library doesn't use the same memory-allocator as malloc in the cython code (linked against different run-times for example as in this github-issue or this SO-post).
The right tool to pass/manage the ownership of the data is PyArray_SetBaseObject.
First we need a python-object, which is responsible for freeing the memory. I'm using a self-made cdef-class here (mostly because of logging/demostration), but there are obviously other possiblities as well:
%%cython
from libc.stdlib cimport free
cdef class MemoryNanny:
cdef void* ptr # set to NULL by "constructor"
def __dealloc__(self):
print("freeing ptr=", <unsigned long long>(self.ptr)) #just for debugging
free(self.ptr)
#staticmethod
cdef create(void* ptr):
cdef MemoryNanny result = MemoryNanny()
result.ptr = ptr
print("nanny for ptr=", <unsigned long long>(result.ptr)) #just for debugging
return result
...
Now, we use a MemoryNanny-object as sentinel for the memory, which gets freed as soon as the parent-numpy-array gets destroyed. The code is a little bit awkward, because PyArray_SetBaseObject steals the reference, which is not handled by Cython automatically:
%%cython
...
from cpython.object cimport PyObject
from cpython.ref cimport Py_INCREF
cimport numpy as np
#needed to initialize PyArray_API in order to be able to use it
np.import_array()
cdef extern from "numpy/arrayobject.h":
# a little bit awkward: the reference to obj will be stolen
# using PyObject* to signal that Cython cannot handle it automatically
int PyArray_SetBaseObject(np.ndarray arr, PyObject *obj) except -1 # -1 means there was an error
cdef array_from_ptr(void * ptr, np.npy_intp N, int np_type):
cdef np.ndarray arr = np.PyArray_SimpleNewFromData(1, &N, np_type, ptr)
nanny = MemoryNanny.create(ptr)
Py_INCREF(nanny) # a reference will get stolen, so prepare nanny
PyArray_SetBaseObject(arr, <PyObject*>nanny)
return arr
...
And here is an example, how this functionality can be called:
%%cython
...
from libc.stdlib cimport malloc
def create():
cdef double *ptr=<double*>malloc(sizeof(double)*8);
ptr[0]=42.0
return array_from_ptr(ptr, 8, np.NPY_FLOAT64)
which can be used as follows:
>>> m = create()
nanny for ptr= 94339864945184
>>> m.flags
...
OWNDATA : False
...
>>> m[0]
42.0
>>> del m
freeing ptr= 94339864945184
with results/output as expected.
Note: the resulting arrays doesn't really own the data (i.e. flags return OWNDATA : False), because the memory is owned be the memory-nanny, but the result is the same: the memory gets freed as soon as array is deleted (because nobody holds a reference to the nanny anymore).
MemoryNanny doesn't have to guard a raw C-pointer. It can be anything else, for example also a std::vector:
%%cython -+
from libcpp.vector cimport vector
cdef class VectorNanny:
#automatically default initialized/destructed by Cython:
cdef vector[double] vec
#staticmethod
cdef create(vector[double]& vec):
cdef VectorNanny result = VectorNanny()
result.vec.swap(vec) # swap and not copy
return result
# for testing:
def create_vector(int N):
cdef vector[double] vec;
vec.resize(N, 2.0)
return VectorNanny.create(vec)
The following test shows, that the nanny works:
nanny=create_vector(10**8) # top shows additional 800MB memory are used
del nanny # top shows, this additional memory is no longer used.
The latest Cython version allows you to do with with minimal syntax, albeit slightly more overhead than the lower-level solutions suggested.
numpy_array = np.asarray(<np.int32_t[:10, :10]> my_pointer)
https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#coercion-to-numpy
This alone does not pass ownership.
Notably, a Cython array is generated with this call, via array_cwrapper.
This generates a cython.array, without allocating memory. The cython.array uses the stdlib.h malloc and free by default, so it would be expected that you use the default malloc, as well, instead of any special CPython/Numpy allocators.
free is only called if ownership is set for this cython.array, which it is by default only if it allocates data. For our case, we can manually set it via:
my_cyarr.free_data = True
So to return a 1D array, it would be as simple as:
from cython.view cimport array as cvarray
# ...
cdef cvarray cvarr = <np.int32_t[:N]> data
cvarr.free_data = True
return np.asarray(cvarr)

Unrolling a trivially parallelizable for loop in python with CUDA

I have a for loop in python that I want to unroll onto a GPU. I imagine there has to be a simple solution but I haven't found one yet.
Our function loops over elements in a numpy array and does some math storing the result in another numpy array. Each iteration adds some to this result array. A possible large simplification of our code might look something like this:
import numpy as np
a = np.arange(100)
out = np.array([0, 0])
for x in xrange(a.shape[0]):
out[0] += a[x]
out[1] += a[x]/2.0
How can I unroll a loop like this in Python to run on a GPU?
The place to start is http://documen.tician.de/pycuda/ the example there is
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b
You place the part of the code you want to parallelize in C code segment and call it from python.
For you example the size of your data will need to be much bigger than 100 to make it worth while. You'll need some way to divide your data into block. If you wanted to add 1,000,000 numbers you could divide it into 1000 blocks. Add each block in the parallezed code. Then add the results in python.
Adding things is not really a natural task for this type of parallelisation. GPUs tend to do the same task for each pixel. You have a task which need to operate on multiple pixels.
It might be better to work with cuda first. A related thread is.
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Categories

Resources