Generating single random number in pyCuda kernel

Generating single random number in pyCuda kernel - python

I have seen many ways to generate an array of random numbers. but I want to generate a single random number. Is there any function as rand() in c++. I don't want a series of random numbers. I just need to generate a random number inside the kernel. is there any builtin function to generate random numbers? I have tried the given code below, but it not working.
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
__global__ void myRand(float *values)
{
values[0] = getRand();
}
"""
mod = SourceModule(code)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
Errors are like:
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_poisson.h(548): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(69): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(78): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(86): error: this declaration may not have extern "C" linkage
30 errors detected in the compilation of "kernel.cu".

The basic problem is that, by default, PyCUDA silently applies C linkage to all code compiled in a SourceModule. As the error is showing, cuRand requires C++ linkage, so getRand can't have C linkage.
You can fix this either by changing these two lines:
mod = SourceModule(code)
myRand = mod.get_function("myRand")
to
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("_Z6myRandPf")
This disables C linkage, but does mean you need to supply the C++ mangled name to the get_function call. You will need to look at the verbose compiler output or compile the code outside of PyCUDA to get that name (for example Godbolt).
Alternatively you can modify the code like this:
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
extern "C" {
__global__ void myRand(float *values)
{
values[0] = getRand();
}
}
"""
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
This leaves the kernel with C linkage, but doesn't touch the device function which is using cuRand.

you can import random in python . and use random.randint(). to generate random number in specified range by defining range in function. exrandom.randint(0,50)

Related

How to initialise a GPU array using memory address in pycuda?

I have a c++ code that gives image array output in GPU memory. I want to do further processing and image analytics using pycuda. I am trying to make GPU array as:
For testing purpose, I have created an array in c++ as:
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
int* dev_a = nullptr;
cudaMalloc((void**)&dev_a, arraySize * sizeof(int));
cudaMemcpy(dev_a, a, arraySize * sizeof(int), cudaMemcpyHostToDevice);
printf("dev_a : %p\n", dev_a);
Suppose, here I get GPU memory address as '0x7f3454800000'. I am using this address to create GPUarray as:
from pycuda.gpuarray import GPUArray
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda.driver import PointerHolderBase
drv.init()
class Holder(PointerHolderBase):
def __init__(self):
super().__init__()
self.gpudata = '0x7f5954800000'
def get_pointer(self):
return self.gpudata
def __int__(self):
return self.__index__()
# without an __index__ method, arithmetic calls to the GPUArray backed by this pointer fail
# not sure why, this needs to return some integer, apparently
def __index__(self):
return self.gpudata
array = GPUArray((1,5), dtype=np.int32, gpudata=Holder())
print(array.get())
When I am running the code, I am getting error as:
File "array_test.py", line 43, in <module>
print(array.get())
File "/home/govindam/anaconda3/envs/tf_c2/lib/python3.6/site-packages/pycuda/gpuarray.py", line 305, in get
_memcpy_discontig(ary, self, async_=async_, stream=stream)
File "/home/govindam/anaconda3/envs/tf_c2/lib/python3.6/site-packages/pycuda/gpuarray.py", line 1309, in _memcpy_discontig
drv.memcpy_dtoh(dst, src.gpudata)
TypeError: No registered converter was able to produce a C++ rvalue of type unsigned long long from this Python object of type st
How to give the memory address while creating GPUarray so that I am avoid copying from GPU to CPU and again to GPU?

Call c++ template function from cython

Currently, I am learning how to call c++ template function from Cython. I have a .h file named 'cos_doubles.h' The file is as follows:
#ifndef _COS_DOUBLES_H
#define _COS_DOUBLES_H
#include <math.h>
template <typename T, int ACCURACY>
void cos_doubles(T * in_array, T * out_array, int size)
{
int i;
for(i=0;i<size;i++){
out_array[i] = in_array[i] * 2;
}
}
#endif
Indeed the variable ACCURACY does nothing. Now, I want to define a template function in cython that use this cos_doubles function, but only has the typename T as a template. In other words, I want to give the variable ACCURACY a value in my cython code. My .pyx code is some thing like the following
# import both numpy and the Cython declarations for numpy
import numpy as np
cimport numpy as np
cimport cython
# if you want to use the Numpy-C-API from Cython
# (not strictly necessary for this example)
np.import_array()
# cdefine the signature of our c function
cdef extern from "cos_doubles.h":
void cos_doubles[T](T* in_array, T* out_array, int size)
I know this code has errors, because I did not define the variable of ACCURACY in void cos_doubles[T](T* in_array, T* out_array, int size). But I do not know the gramma how to set it. For example, I want to let ACCURACY = 4. Can anyone tell me how to do this ?
One solution I already have is
cdef void cos_doubles1 "cos_doubles<double, 4>"(double * in_array, double * out_array, int size)
cdef void cos_doubles2 "cos_doubles<int, 4>"(int * in_array, int * out_array, int size)
but I do not define two different functions. Is there any better solution?

Numpy Ctypes: Segmentation Fault. Passing arrays via pointer

I am trying to create a python interface to a C function with the following structure: (Full code can be found here)
void get_pi_typed (int *type,
double *x,
double *y,
int *len,
int *typeA,
int *typeB,
double *r_low,
double *r,
int *len_r,
int *inds,
double *rc) {
\*DETAILS LEFT OUT
for (i=0;i<*len_r;i++) {
\*DETAILS LEFT OUT
rc[i] = (double)num_cnt/denom_cnt;
}
}
My Python code looks like this:
import numpy as np
import ctypes as ct
# must be a double array, with single dimension that is contiguous
array_1d_int = np.ctypeslib.ndpointer(dtype=np.int32, ndim=1, flags='CONTIGUOUS')
array_1d_double = np.ctypeslib.ndpointer(dtype=np.double, ndim=1, flags='CONTIGUOUS')
# Load the library as _libspfc.
_libspfc = np.ctypeslib.load_library('../src/libspatialfuncs', '.')
_libspfc.get_pi_typed.argtypes = [array_1d_int,\
array_1d_double,\
array_1d_double,\
ct.c_int,\
ct.c_int,\
ct.c_int,\
array_1d_double,\
array_1d_double,\
ct.c_int,\
ct.c_int,\
array_1d_double,\
]
_libspfc.get_pi_typed.restype = None
def getPiTyped(posmat,typeA=-1,typeB=-1,r=np.array([1.]),rLow=None):
"""
Python equivalent to get_pi_typed.
posmat: a matrix with columns type, x and y
typeA: the "from" type that we are interested in, -1 is wildcard
typeB: the "to" type that we are interested i, -1 is wildcard
r: the series of spatial distances wer are interested in
rLow: the low end of each range....0 by default
"""
if not isinstance(r, np.ndarray): #if it is not a 1D numpy array (for ex a scalar or a list), bring it into that shape
r=np.array(r)
r=r.reshape((-1))
if rLow is None:
rLow = np.zeros_like(r)
if not isinstance(rLow, np.ndarray): #if it is not a 1D numpy array (for ex a scalar or a list), bring it into that shape
rLow=np.array(rLow)
rLow=rLow.reshape((-1))
#prepare output array
rc = np.empty_like(r, dtype=np.double)
_libspfc.get_theta_typed(posmat[:,0],posmat[:,1],posmat[:,2],posmat.shape[0],typeA,typeB,rLow,r,r.shape[0],np.arange(1,r.shape[0]+1),rc)
return rc
However, when I try to run the code I get the following error, which seems to be related to the type conversion of the 1st parameter:
x =np.array([[1.,0.,0.],[1.,1.,0.],[2.,0.5,np.sqrt(.75)]])
sf.getPiTyped(x,1,2,1.5)
ArgumentError: argument 1: <type 'exceptions.TypeError'>: Don't know how to convert parameter 1
I tried many variations of argtypes, as well as to convert posmat[:,0] to int or int32 via .astype, however I always get the same error. What am I doing wrong?
EDIT:
According to the 1st comment below I added .ctypes.data to all array input arguments. The ArgumentError is now gone. However I get a Segmentation Fault, very difficult to investigate because python crashes
EDIT2:
I tried to make the array column-contiguous
posmat=np.ascontiguousarray(np.asfortranarray(posmat))
but I still get the seg fault

The error was highlighted by Warren above, the int arguments had to be passed by reference. Note also that the arrays have to be contiguous. Here is the final code:
import numpy as np
import ctypes as ct
# Load the library as _libspfc.
_libspfc = np.ctypeslib.load_library('../src/libspatialfuncs', '.')
def getPiTyped(posmat,typeA=-1,typeB=-1,r=np.array([1.]),rLow=None):
"""
Python equivalent to get_pi_typed.
posmat: a matrix with columns type, x and y
typeA: the "from" type that we are interested in, -1 is wildcard
typeB: the "to" type that we are interested i, -1 is wildcard
r: the series of spatial distances wer are interested in
rLow: the low end of each range....0 by default
"""
#prepare inputs
# argument 1 to 3: make a copy, so the matrix is C contiguous (already included in astype)
ty=posmat[:,0].astype(np.int32)
x=posmat[:,1].copy()
y=posmat[:,2].copy()
n = ct.c_int(posmat.shape[0])
typeA = ct.c_int(typeA)
typeB = ct.c_int(typeB)
if not isinstance(r, np.ndarray): #if it is not a 1D numpy array (for ex a scalar or a list), bring it into that shape
r=np.array(r)
r=r.reshape((-1))
if rLow is None:
rLow = np.zeros_like(r)
if not isinstance(rLow, np.ndarray): #if it is not a 1D numpy array (for ex a scalar or a list), bring it into that shape
rLow=np.array(rLow)
rLow=rLow.reshape((-1))
rLen=ct.c_int(r.shape[0])
ind=np.arange(1,r.shape[0]+1,dtype=np.int32)
#prepare output array
rc = np.empty_like(r, dtype=np.double)
_libspfc.get_pi_typed(ty,\
x,\
y,\
ct.byref(n),\
ct.byref(typeA),\
ct.byref(typeB),\
rLow,\
r,\
ct.byref(rLen),\
ind,\
rc)
return rc

pyCUDA reduction doesn't work

I am using reduction code basically exactly like the examples in the docs. The code below should return 2^3 + 2^3 = 16, but it instead returns 9. What did I do wrong?
import numpy
import pycuda.reduction as reduct
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule as module
newzeros = [{1,2,3},{4,5,6}]
gpuSum = reduct.ReductionKernel(numpy.uint64, neutral="0", reduce_expr="a+b", map_expr="1 << x[i]", arguments="int* x")
mylengths = pycuda.gpuarray.to_gpu(numpy.array(map(len,newzeros),dtype = "uint64",))
sumfalse = gpuSum(mylengths).get()
print sumfalse

I just figured it out. The argument list used when defining the kernel should be unsigned long *x, not int *x. I was using 64-bit integers everywhere else and it messed it up.

Unrolling a trivially parallelizable for loop in python with CUDA

I have a for loop in python that I want to unroll onto a GPU. I imagine there has to be a simple solution but I haven't found one yet.
Our function loops over elements in a numpy array and does some math storing the result in another numpy array. Each iteration adds some to this result array. A possible large simplification of our code might look something like this:
import numpy as np
a = np.arange(100)
out = np.array([0, 0])
for x in xrange(a.shape[0]):
out[0] += a[x]
out[1] += a[x]/2.0
How can I unroll a loop like this in Python to run on a GPU?

The place to start is http://documen.tician.de/pycuda/ the example there is
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b
You place the part of the code you want to parallelize in C code segment and call it from python.
For you example the size of your data will need to be much bigger than 100 to make it worth while. You'll need some way to divide your data into block. If you wanted to add 1,000,000 numbers you could divide it into 1000 blocks. Add each block in the parallezed code. Then add the results in python.
Adding things is not really a natural task for this type of parallelisation. GPUs tend to do the same task for each pixel. You have a task which need to operate on multiple pixels.
It might be better to work with cuda first. A related thread is.
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating single random number in pyCuda kernel - python

you can import random in python . and use random.randint(). to generate random number in specified range by defining range in function. exrandom.randint(0,50)

Related

How to initialise a GPU array using memory address in pycuda?

Call c++ template function from cython

Numpy Ctypes: Segmentation Fault. Passing arrays via pointer

pyCUDA reduction doesn't work

Unrolling a trivially parallelizable for loop in python with CUDA

Categories

Resources