I'm quite new to CUDA/C++ programming and I'm stuck at passing the input parameters to the CUDA Kernel from the Tensorflow C++ API.
First off I register the following Op:
REGISTER_OP("Op")
.Attr("T: {float, int64}")
.Input("in: T")
.Input("angles: T")
.Output("out: T");
Afterwards I want to pass the second Input (angles) through to the CPU/GPU Kernel. Somehow the following implementation works fine for the CPU implementation but throws an error in Python when I run it on my GPU...
Python Error message:
Process finished with exit code -1073741819 (0xC0000005)
This is how I'm trying to access the value of the Input. Note that the input for "angles" is allways a single value (float or int):
void Compute(OpKernelContext* context) override {
...
const Tensor &input_angles = context->input(1);
auto angles_flat = input_angles.flat<float>();
const float N = angles_flat(0);
...
}
Calling the CPU/GPU Kernels as follows:
...
Functor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data(),
N);
...
As I said before, running this Op on the CPU works just how I it want to, but when I run it on the GPU I always get the abovementioned Python Error... Does someone know how to fix this? I can only guess that I'm trying to access a wrong address on the GPU with angles_flat(0)... So if anybody can help me out here it would be highly appreciated!!
Related
I'm trying to learn CUDA for python using Numba in a Google Colab jupyter notebook. To learn how to apply 3D thread allocation for nested loops I wrote the following kernel:
from numba import cuda as cd
# Kernel to loop over 3D grid
#cd.jit
def grid_coordinate_GPU():
i = cd.blockDim.x * cd.blockIdx.x + cd.threadIdx.x
j = cd.blockDim.y * cd.blockIdx.y + cd.threadIdx.y
k = cd.blockDim.z * cd.blockIdx.z + cd.threadIdx.z
print(f"[{i},{j},{k}]")
# Grid Dimensions
Nx = 2
Ny = 2
Nz = 2
threadsperblock = (1,1,1)
blockspergrid = (Nx,Ny,Nz)
grid_coordinate_GPU[blockspergrid, threadsperblock]()
The problem I however find is that printing the coordinates in format string does not work. The exact error I get is:
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'str'>) found for signature:
>>> str(int64)
There are 10 candidate implementations:
- Of which 8 did not match due to:
Overload of function 'str': File: <numerous>: Line N/A.
With argument(s): '(int64)':
No match.
- Of which 2 did not match due to:
Overload in function 'integer_str': File: numba/cpython/unicode.py: Line 2394.
With argument(s): '(int64)':
Rejected as the implementation raised a specific error:
NumbaRuntimeError: Failed in nopython mode pipeline (step: native lowering)
NRT required but not enabled
During: lowering "s = call $76load_global.17(kind, char_width, length, $const84.21, func=$76load_global.17, args=[Var(kind, unicode.py:2408), Var(char_width, unicode.py:2409), Var(length, unicode.py:2407), Var($const84.21, unicode.py:2410)], kws=(), vararg=None, varkwarg=None, target=None)" at /usr/local/lib/python3.7/dist-packages/numba/cpython/unicode.py (2410)
raised from /usr/local/lib/python3.7/dist-packages/numba/core/runtime/context.py:19
During: resolving callee type: Function(<class 'str'>)
During: typing of call at <ipython-input-12-4a28d7f41e76> (12)
To solve this I tried a couple of things.
Firstly I tried to initialise the CUDA simulator by setting the environment variable NUMBA_ENABLE_CUDASIM = 1 following the Numba Documentation. This however dit not change much.
Secondly I thought that the problem laid within the inability of the Jupiter notebook to print the result in the notebook instead of the terminal. I tried to solve this by following this GitHub post which instructed me to use wurlitzer. This however did not do much.
Lastly I added cd.synchronize() after the call to the kernel to try and mimic the c++ example I tried to implement in the first place. This sadly did not work either.
It would be amazing if someone could help me out!
The simple solution was to skip the formatted string and just use print(i,j,k) within the kernel instead.
So I have a C program that I am running from Python. But am getting segmentation fault error. when I run the C program alone, it runs fine. The C program interfaces a fingerprint sensor using the fprint lib.
#include <poll.h>
#include <stdlib.h>
#include <sys/time.h>
#include <stdio.h>
#include <libfprint/fprint.h>
int main(){
struct fp_dscv_dev **devices;
struct fp_dev *device;
struct fp_img **img;
int r;
r=fp_init();
if(r<0){
printf("Error");
return 1;
}
devices=fp_discover_devs();
if(devices){
device=fp_dev_open(*devices);
fp_dscv_devs_free(devices);
}
if(device==NULL){
printf("NO Device\n");
return 1;
}else{
printf("Yes\n");
}
int caps;
caps=fp_dev_img_capture(device,0,img);
printf("bloody status %i \n",caps);
//save the fingerprint image to file. ** this is the block that
causes the segmentation fault.
int imrstx;
imrstx=fp_img_save_to_file(*img,"enrolledx.pgm");
fp_img_free(*img);
fp_exit();
return 0;
}
the python code
from ctypes import *
so_file = "/home/arkounts/Desktop/pythonsdk/capture.so"
my_functions = CDLL(so_file)
a=my_functions.main()
print(a)
print("Done")
The capture.so is built and accessed in python. But calling from python, I get a Segmentation fault. What could be my problem?
Thanks alot
Although I am unfamiliar with libfprint, after taking a look at your code and comparing it with the documentation, I see two issues with your code that can both cause a segmentation fault:
First issue:
According to the documentation of the function fp_discover_devs, NULL is returned on error. On success, a NULL-terminated list is returned, which may be empty.
In the following code, you check for failure/success, but don't check for an empty list:
devices=fp_discover_devs();
if(devices){
device=fp_dev_open(*devices);
fp_dscv_devs_free(devices);
}
If devices is non-NULL, but empty, then devices[0] (which is equivalent to *devices) is NULL. In that case, you pass this NULL pointer to fp_dev_open. This may cause a segmentation fault.
I don't think that this is the reason for your segmentation fault though, because this error in your code would only be triggered if an empty list were returned.
Second issue:
The last parameter of fp_dev_img_capture should be a pointer to an allocated variable of type struct fp_img *. This tells the function the address of the variable that it should write to. However, with the code
struct fp_img **img;
[...]
caps=fp_dev_img_capture(device,0,img);
you are passing that function a wild pointer, because img does not point to any valid object. This can cause a segmentation fault as soon as the wild pointer is dereferenced by the function or cause some other kind of undefined behavior, such as overwriting other variables in your program.
I suggest you write the following code instead:
struct fp_img *img;
[...]
caps=fp_dev_img_capture(device,0,&img);
Now the third parameter is pointing to a valid object (to the variable img).
Since img is now a single pointer and not a double pointer, you must pass img instead of *img to the functions fp_img_save_to_file and fp_img_free.
This second issue is probably the reason for your segmentation fault. It seems that you were just "lucky" that your program did not segfault as a standalone program.
I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.
This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?
I am learning XGBoost. I want to finish a demonstration with XGBoost python api.
when I use the function "xgboost.DMatrix" that data is set a file,silent is set True. However, the function "xgboost.DMatrix" always output some message "[23:28:44] 1441x10 matrix with 11528 entries loaded from file_name". was I setting error parameters? reference
This is interesting. The silent value is taken and passed to the Wrapper, but it seems like the wrapper doesn't really use it !
This shows the appropriate code https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L202
Which says:
int XGDMatrixCreateFromFile(const char *fname,
int silent,
DMatrixHandle *out) {
API_BEGIN();
if (rabit::IsDistributed()) {
LOG(CONSOLE) << "XGBoost distributed mode detected, "
<< "will split data among workers";
}
*out = new std::shared_ptr<DMatrix>(DMatrix::Load(fname, false, true));
API_END();
}
i.e. even though silent is a argument, it is not used in the function anywhere ... (very weird)
So, it seems right now, if you're using any of the wrappers (Python, R, julia, etc) the silent functionality for DMatrix won't work.
I would like to learn how to install new op. So for doing that i' m following the given tutorial. I made a folder named user_ops, create a "zero_out.cc" file and copy the code given in the tutorial. When i' m trying to compile the Op into a dynamic library with g++ errors appear:
zero_out.cc: In lambda function:
zero_out.cc:10:14: error: ‘Status’ has not been declared
return Status::OK();
^
zero_out.cc: At global scope:
zero_out.cc:11:6: error: invalid user-defined conversion from ‘’ to ‘tensorflow::Status ()(tensorflow::shape_inference::InferenceContext)’ [-fpermissive]
});
^
zero_out.cc:8:70: note: candidate is: ::operator void ()(tensorflow::shape_inference::InferenceContext)() const
.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
^
zero_out.cc:8:70: note: no known conversion from ‘void ()(tensorflow::shape_inference::InferenceContext)’ to ‘tensorflow::Status ()(tensorflow::shape_inference::InferenceContext)’
In file included from zero_out.cc:1:0:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/op.h:252:30: note: initializing argument 1 of ‘tensorflow::register_op::OpDefBuilderWrapper& tensorflow::register_op::OpDefBuilderWrapper::SetShapeFn(tensorflow::Status ()(tensorflow::shape_inference::InferenceContext))’
OpDefBuilderWrapper& SetShapeFn(<
Why is that happening? How could i fix that?
Assuming your only problem is the undefined Status type -- and copying and pasting the tutorial code works just fine except for this -- you need to either move the using namespace tensorflow to before the first use of Status, or fully qualify it (as in return tensorflow::Status::OK())
For example, the REGISTER_OP section could read as follows, if you did the templated version:
REGISTER_OP("ZeroOut")
.Attr("T: {float, int32}")
.Input("to_zero: T")
.Output("zeroed: T")
.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
c->set_output(0, c->input(0));
return tensorflow::Status::OK();
});
Seems to me, that the Tensorflow tutorial doesn't have the right code.
So i followed the code of this tutorial and it is working perfectly!
I have no clue what it says!