Passing CuPy CUDA device pointer to pybind11

Passing CuPy CUDA device pointer to pybind11 - python

I am trying to instantiate an array in GPU memory using CuPy and then pass the pointer to this array to C++ using pybind11.
A minimal example of the problem I am running into is shown below.
Python
import demolib #compiled pybind11 library
import cupy as cp
x = cp.ones(100000)
y = cp.ones(100000)
demolib.pyadd(len(x),x.data.ptr,y.data.ptr)
C++/CUDA
#include <iostream>
#include <math.h>
#include <cuda_runtime.h>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
// Error Checking Function
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
// Simple CUDA kernel
__global__
void cuadd(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
// Simple wrapper function to be exposed to Python
int pyadd(int N, float *x, float *y)
{
// Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
cuadd<<<numBlocks, blockSize>>>(N,x,y);
// Wait for GPU to finish before accessing on host
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
}
PYBIND11_MODULE(demolib, m) {
m.doc() = "pybind11 example plugin"; // optional module docstring
m.def("pyadd", &pyadd, "A function which adds two numbers");
}
The code throws the following error:
GPUassert: an illegal memory access was encountered /home/tbm/cuda/add_pybind.cu 47
I realize that this specific example could be implemented using a cupy user defined kernel, but the end goal is to be able to do zero-copy passes of cupy arrays into a larger codebase which would be prohibitive to rewrite in this paradigm.
I have also located this GitHub Issue, which is the the reverse of what I'm trying to do.

The fix was to change the argument types of pyadd to int and cast the int to float pointers as shown below. As pointed out in the comments, this was figured out by referencing another question.(unanswered at the time of posting)
int pyadd(int N, long px, long py)
{
float *x = reinterpret_cast<float*> (px);
float *y = reinterpret_cast<float*> (py);
.
.
.

Related

I can't get output numbers with ctypes cuda

cuda1.cu
#include <iostream>
using namespace std ;
# define DELLEXPORT extern "C" __declspec(dllexport)
__global__ void kernel(long* answer = 0){
*answer = threadIdx.x + (blockIdx.x * blockDim.x);
}
DELLEXPORT void resoult(long* h_answer){
long* d_answer = 0;
cudaMalloc(&d_answer, sizeof(long));
kernel<<<10,1000>>>(d_answer);
cudaMemcpy(&h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);
cudaFree(d_answer);
}
main.py
import ctypes
import numpy as np
add_lib = ctypes.CDLL(".\\a.dll")
resoult= add_lib.resoult
resoult.argtypes = [ctypes.POINTER(ctypes.c_long)]
x = ctypes.c_long()
print("R:",resoult(x))
print("RV: ",x.value)
print("RB: ",resoult(ctypes.byref(x)))
output in python:0
output in cuda: 2096
I implemented based on c language without any problems but in cuda mode I have a problem how can I have the correct output value
Thanks

cudaMemcpy is expecting pointers for dst and src.
In your function resoult, h_answer is a pointer to a long allocated by the caller.
Since it's already the pointer where the data should be copied to, you should use it as is and not take it's address by using &h_answer.
Therefore you need to change your cudaMemcpy from:
cudaMemcpy(&h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);
To:
cudaMemcpy(h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost);

Trying to mimic the fsolve function from Python in a cpp scritp for simple functions

First of all, I'm new to C++. Tbh I find it very hard to "get used to" but the last week I have been trying to "translate" a script from Python to C++ due to computational time requirements.
One of the problem I have is root-finding of simple 1D functions:
First the simple MWE from Python:
Trying to solve ax^2+bx+c = 0 :
def f(x,a,b,c):
return a*x**2+b*x+c
optimize.fsolve(f,-1.0e2,args=(1,1,-1))
which will return either -1.618 or 0.618 depending on the starting's guess sign
Ok, in C++ from what I have searched online is a bit more complicated and used the GSL Root finding library.
For simple non-polyonimal functions, it works like a charm, but when second order come in,
the addition of endpoints seem to be a problem when you simpled search for a "quick" solution:
#include <stdio.h>
#include <math.h>
#include <functional>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <gsl/gsl_math.h>
#include <gsl/gsl_interp2d.h>
#include <gsl/gsl_spline2d.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_spline.h>
#include <gsl/gsl_integration.h>
#include <gsl/gsl_roots.h>
struct my_f_params { double a; double b; double c; };
double
my_f (double x, void * p)
{
struct my_f_params * params = (struct my_f_params *)p;
double a = (params->a);
double b = (params->b);
double c = (params->c);
return (a * x + b) * x + c;
}
double root (struct my_f_params prms, double r)
{
int status;
int iter = 0, max_iter = 50;
const gsl_root_fsolver_type *T;
gsl_root_fsolver *s;
double x_lo= -5e0, x_hi=11e0;
gsl_function F;
F.function = &my_f;
F.params = &prms;
T = gsl_root_fsolver_falsepos;
s = gsl_root_fsolver_alloc (T);
gsl_root_fsolver_set (s, &F, x_lo, x_hi);
do
{
iter++;
gsl_set_error_handler_off();
status = gsl_root_fsolver_iterate (s);
r = gsl_root_fsolver_root (s);
x_lo = gsl_root_fsolver_x_lower (s);
x_hi = gsl_root_fsolver_x_upper (s);
status = gsl_root_test_interval (x_lo, x_hi,
0, 0.001);
printf("%f %f\n",x_lo,x_hi);
}
while (status == GSL_CONTINUE && iter < max_iter);
return r;
}
int main(int argc, char const *argv[])
{
struct my_f_params params = {1, 1, -1};
printf("root of x2+x1-1=0 %f\n", root(params,1.25));
return 0;
}
now, if the starting x_lo , x_hi cover the 2 solution integrals, it does not proceed to finding the closest one, giving the error
gsl: falsepos.c:74: ERROR: endpoints do not straddle y=0
Default GSL error handler invoked.
Aborted (core dumped)
Have tried a lot of things from google before I posted here.
Really really thank you for your time, anything is appreciated!

What you need is to look into this similar issue with good explanation how to make it work error in GSL - root finding.
To help, you can use https://www.desmos.com/calculator/zuaqvcvpbz to view the function, and you can set initial values like this:
double x_lo = 0.0, x_hi = 1.0;
in your implementation to make it run;
Adding this code will help to find appropriate values for x_lo and x_hi :
gsl_set_error_handler_off(); // this turns off error reporting
int check = gsl_root_fsolver_set(s, &F, x_lo, x_hi);
if (check == GSL_EINVAL) {// this is the error code you got
do {
x_lo += 0.1; // it would be appropriate to check the sign in both
x_hi -= 0.1; // cases, to make sure interval is adjusted properly
check = gsl_root_fsolver_set(s, &F, x_lo, x_hi);
} while (check != 0);
}

Call a Python function from C and consume a 2D Numpy array in C

I'm trying to figure out, how I could achieve this:
I'm having a Python script, which in the end produces a Numpy array, an array of arrays of floats, to be more specific. I have all properly set: I can pass parameters from C to Python, launch Py functions from C, and process returned scalar values in C.
What I'm currently not being able to do, is to return such a Numpy array as a result of a Py function to C.
Could somebody probably provide me a pointer, how to achieve this?
TIA

What you need to look at is Inter Process communication (IPC). There are several ways to perform it.
You can use one of:
Files (Easy to use)
Shared memory (really fast)
Named pipes
Sockets (slow)
See Wikipedia's IPC page and find the best approach for your needs.

Here's a small working demo example (1D, not 2D array! it's not perfect, adjust to your needs).
# file ./pyscript.py
import numpy as np
# print inline
np.set_printoptions(linewidth=np.inf)
x = np.random.random(10)
print(x)
# [0.52523722 0.29053534 0.95293405 0.7966214 0.77120688 0.22154705 0.29398872 0.47186567 0.3364234 0.38107864]
~ demo.c
// file ./demo.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int fd[2];
pipe(fd); // create pipes
char buf[4096];
pid_t pid=fork();
if (pid==0) { // child process
dup2(fd[1],1);
close(fd[0]);
close(fd[1]);
char *pyscript = "./pyscript.py";
char *args[] = {"python3", pyscript, (char*)NULL};
execv("/usr/bin/python3",args);
}
else {
int status;
close(fd[1]);
int bytes = read(fd[0], buf, sizeof(buf));
printf("Python script output: %.*s\n", bytes, buf);
char* values[10];
int count = 0;
values[count++] = &buf[1]; // ignore the '[' coming from numpy array output
char* p = buf;
while (*p) {
if (*p == ' ') {
*p = 0;
values[count++] = p + 1;
}
p++;
}
float a[10];
float f;
for (int i = 0; i < 10; i++) {
printf("%f\n", f = atof(values[i]) ); // float values
a[i] = f;
}
waitpid(pid, &status, 0);
}
return 0;
}
Sample output
# cc demo.c
# ./a.out
Python script output: [0.23286839 0.54437959 0.37798547 0.17190732 0.49473837 0.48112695 0.93113395 0.20877592 0.96032973 0.30025713]
0.23286839
0.54437959
0.232868
0.544380
0.377985
0.171907
0.494738
0.481127
0.931134
0.208776
0.960330
0.300257
a will be your desired result, an array of float.

One has to use the PyList API for decoding list objects from Python to C
https://docs.python.org/3.3/c-api/list.html?highlight=m
Solved.

How to wrap C function with a pointer struct argument for Python?

Below is a simple C program square.c, which is a simplification of a complex one:
#include <stdio.h>
#include <stdlib.h>
struct Square {
float length;
float width;
};
typedef struct Square *sq;
float calc_area(sq a) {
float s;
s = a->length * a->width;
return s;
}
int main() {
sq a;
a = malloc(sizeof(struct Square));
a->length = 10.0;
a->width = 3.0;
printf("%f\n", calc_area(a));
free(a);
}
I want to wrap the calc_area() function and call it in Python. I am not quite familiar with C nor Cython, but it seems that the difficulty is that the argument of calc_area() is a pointer struct.
I tried to use Cython, and below is my effort:
First, I wrote a header file square.h:
#include <stdio.h>
#include <stdlib.h>
struct Square {
float length;
float width;
};
typedef struct Square *sq;
float calc_area(sq a);
Second, I wrote the square_wrapper.pyx as below
cdef extern from 'square.h':
struct Square:
float length
float width
ctypedef struct Square *sq
float calc_area(sq a)
def c_area(a):
return calc_area(sq a)
It seems that ctypedef struct Square *sq is not correct, but I have no idea how to modify it - this is the first time I try to wrap a C program and I cannot find similar examples to my case.
How could I fix this using Cython or any other tools? Thanks in advance!

Python to C for loop conversion

I have the following python code:
r = range(1,10)
r_squared = []
for item in r:
print item
r_squared.append(item*item)
How would I convert this code to C? Is there something like a mutable array in C or how would I do the equivalent of the python append?

simple array in c.Arrays in the C are Homogenous
int arr[10];
int i = 0;
for(i=0;i<sizeof(arr);i++)
{
arr[i] = i; // Initializing each element seperately
}
Try using vectors in C go through this link
/ vector-usage.c
#include <stdio.h>
#include "vector.h"
int main() {
// declare and initialize a new vector
Vector vector;
vector_init(&vector);
// fill it up with 150 arbitrary values
// this should expand capacity up to 200
int i;
for (i = 200; i > -50; i--) {
vector_append(&vector, i);
}
// set a value at an arbitrary index
// this will expand and zero-fill the vector to fit
vector_set(&vector, 4452, 21312984);
// print out an arbitrary value in the vector
printf("Heres the value at 27: %d\n", vector_get(&vector, 27));
// we're all done playing with our vector,
// so free its underlying data array
vector_free(&vector);
}

Arrays in C are mutable by default, in that you can write a[i] = 3, just like Python lists.
However, they're fixed-length, unlike Python lists.
For your problem, that should actually be fine. You know the final size you want; just create an array of that size, and assign to the members.
But of course there are problems for which you do need append.
Writing a simple library for appendable arrays (just like Python lists) is a pretty good learning project for C. You can also find plenty of ready-made implementations if that's what you want, but not in the standard library.
The key is to not use a stack array, but rather memory allocated on the heap with malloc. Keep track of the pointer to that memory, the capacity, and the used size. When the used size reaches the capacity, multiply it by some number (play with different numbers to get an idea of how they affect performance), then realloc. That's just about all there is to it. (And if you look at the CPython source for the list type, that's basically the same thing it's doing.)
Here's an example. You'll want to add some error handling (malloc and realloc can return NULL) and of course the rest of the API beyond append (especially a delete function, which will call free on the allocated memory), but this should be enough to show you the idea:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
int *i;
size_t len;
size_t capacity;
} IntArray;
IntArray int_array_make() {
IntArray a = {
.i = malloc(10 * sizeof(int)),
.len = 0,
.capacity = 10
};
return a;
}
void int_array_append(IntArray *a, int value) {
if (a->len+1 == a->capacity) {
size_t new_capacity = (int)(a->capacity * 1.6);
a->i = realloc(a->i, new_capacity * sizeof(int));
a->capacity = new_capacity;
}
a->i[a->len++] = value;
}
int main(int argc, char *argv[]) {
IntArray a = int_array_make();
for (int i = 0; i != 50; i++)
int_array_append(&a, i);
for (int i = 0; i != a.len; ++i)
printf("%d ", a.i[i]);
printf("\n");
}

c doesnt have any way of dynamically increasing the size of the array like in python. arrays here are of fixed length
if you know the size of the array that you will be using, u can use this kind of declaration, like this
int arr[10];
or if you would want to add memery on the fly (in runtime), use malloc call along with structure (linked lists)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing CuPy CUDA device pointer to pybind11 - python

Related

I can't get output numbers with ctypes cuda

Trying to mimic the fsolve function from Python in a cpp scritp for simple functions

Call a Python function from C and consume a 2D Numpy array in C

How to wrap C function with a pointer struct argument for Python?

Python to C for loop conversion

Categories

Resources