Why is this numba.cuda lookup table implementation failing?

Why is this numba.cuda lookup table implementation failing? - python

I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
#cuda.jit(device=True)
def lookup(x):
return lut[x]
#vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup with the following simplified code results in the same error:
#cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0

The above code is fixed by modifying the part in bold:
#cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with #talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.

Related

TensorFlow nullptr check failed on GPU

I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.

This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?

Optimising python function with numba

I am trying to speed up a python function using numba, however I cannot seem to make it compile.
The input for my function is a 27x4 array of type np.int32.
My function is:
#nb.jit(nopython=True)
def edge_profile(input):
pos = input[:,:3]
val = input[:,3]
centre = np.mean(pos,axis=0).astype(np.int32)
diff = np.absolute(pos-centre).sum(axis=1)
cell_edge = np.zeros(3)
for i in range(3):
idx = np.where(diff==i+1)[0]
idy = np.where(val[idx]==1)[0]
cell_edge[i] = len(idy)
return cell_edge.astype(np.int32)
However this produces an extremely large error message which I have unable to use to diagnose the problem. I have tried specifying the input types as follows:
#nb.jit(nb.int32[:](nb.int32[:,:]))
def ...
however this produces an equally large error message.
I fell that I am probably using some function/feature that is not supported in numba, but I do not know enough about it to identify the problem. Any help would be greatly appreciated.

Numba should work ok so long as you stick to basic lists and arrays in the function you want to speed up. It appears that you are already using functions from numpy that are probably already well optimized. So its unlikely you will see a speed up even if you did get it to work. You haven't mentioned what your OS is. Under ubuntu 14.04 you can get it to work through some steps outlined here.

Theano matrix multiplication

I have a piece of code that is supposed to calculate a simple
matrix product, in python (using theano). The matrix that I intend to multiply with is a shared variable.
The example is the smallest example that demonstrates my problem.
I have made use of two helper-functions. floatX converts its input to something of type theano.config.floatX
init_weights generates a random matrix (in type floatX), of given dimensions.
The last line causes the code to crash. In fact, this forces so much output on the commandline that I can't even scroll to the top of it anymore.
So, can anyone tell me what I'm doing wrong?
def floatX(x):
return numpy.asarray(x,dtype=theano.config.floatX)
def init_weights(shape):
return floatX(numpy.random.randn(*shape))
a = init_weights([3,3])
b = theano.shared(value=a,name="b")
x = T.matrix()
y = T.dot(x,b)
f = theano.function([x],y)

This work for me. So my guess is that you have a problem with your blas installation. Make sure to use Theano development version:
http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
It have better default for some configuration. If that do not fix the problem, look at the error message. There is main part that is after the code dump. After the stack trace. This is what is the most useful normally.
You can disable direct linking by Theano to blas with this Theano flag: blas.ldflags=
This can cause slowdown. But it is a quick check to confirm the problem is blas.
If you want more help, dump the error message to a text file and put it on the web and link to it from here.

NotImplementedError at Decorator of NumbaPro (Python)

I am new to NumbaPro in Python. I have the following code which I want to parallelize in x,y grid in CUDA (Anaconda Accelerate), however everytime I run this it gives a NotImplementedError at the Decorator line, I am not sure what is wrong, can someone please help me? Many Thanks:
#cuda.jit(argtypes=(float64[:,:], float64[:,:,:], float64, float64, float64), device=True)
def computeflow(PMapping2, Array_hist2, Num_f1, p_depth1, image_width1):
x, y = cuda.grid(2);
for y in xrange(0,p_depth1):
for x in xrange(0,image_width1):
Array_H, bin_edges = numpy.histogram(Array_hist2[y,x,:], bins=Num_f1, range=None, normed=False, weights=None, density=None);
Array_H = (numpy.imag(numpy.fft.ifft(Array_H,n=1024)));
Array_H1 = Array_H[0:len(Array_H)/2];
Array_H1[20:1024] = 0;
PMapping2[y,x] = numpy.sum(Array_H1);
Mapping1=cuda.to_device(PMapping);
Array_hist1=cuda.to_device(Array_hist);
computeflow[(3,3),(3,3)](PMapping, Array_hist, Num_f, p_depth, image_width);
PMapping1.to_host();

NotImplementedError: offset=203 opcode=2b opname=STORE_SLICE+3
This mean that the slice operation a[i:j] = b is not implemented yet. ref
And looking at the function you're trying to use cuda on, it looks like you do not fully understand how cuda works. I suggest you look up some general guides on for example cuda/pycuda or opencl/pyopencl, to get a quick grasp on how function for parallalizing for gpu's need to be designed. It a too big of a topic to go through here. The doc's for these kinds of things is sadly pretty bad on continiums pages. Probably because there is still a lot of development going on.

Invalid pointer when using global variables/lambda functions in python?

I've come across a bit of a strange bug, and I'm really not sure what's causing it.
I have a list containing lambda functions, and i have set this list to be a global variable as shown below. The global variables are then referenced within the "residual" function with the required function inputs.
from numpy import array
from math import exp
from guppy import hpy
hp = hpy()
hp.setrelheap()
rate_array = [[0, "4.91*(10**-22)*(T_m**4)"],
[1, "1.4*(10**-18)*(T_m**0.928)*exp(-T_m/16200)"]]
global k
k = [0,0]
for i in range(0,2):
k[i] = (eval('lambda T_m,T_r,Z_initial: ' + rate_array[i][1]))
def residual(t,y,yd):
Z_initial = 10
res_0 = -k[1](y[2],y[3], Z_initial)*y[0]
res_1 = -k[0](y[2],y[3],Z_initial)*y[1]
return array([res_0,res_1 ])
y = [0,0,0,0]
yd= [0,0,0,0]
t=0
residual(t,y,yd)
print("\nMemory statistics are as follows:\n")
print hp.heap()
Running the residual function seems to give a invalid pointer error, or a segmentation fault. Every now and then the code runs, but most of the time it does not. I don't see anything wrong with the code, so im not sure whats going on. Is there anything glaringly obvious?
EXTRA: I realize this is a strange way to go about it, but to explain: the "rate_array" contains strings as it is pickled/unpickled depending on whether or not the user wants to edit it.
Also the residual function is integrated by 3rd party software and it only allows the 3 inputs shown, so i can't just pass the array as an input into the function as normal. I also cannot append it to include the rate_array, as it wont take lambda functions as an acceptable type.
Normally the lambda functions/expressions could just be in the residual function itself, as just extra lines of code, but then its not possible for the user to edit them before the code runs.....Its such a mess!
EDIT: apologies, i heavily cut down the code to try and make it simpler to explain, but in the process just typed stuff that was plain wrong, now corrected.
The error code:
*** glibc detected *** python: free(): invalid pointer: 0x0000000002a7aa50 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7fd857ddbb96]
/usr/local/lib/python2.7/dist-packages/guppy/sets/setsc.so(+0xbeeb)[0x7fd841bd1eeb]
/usr/local/lib/python2.7/dist-packages/guppy/sets/setsc.so(+0xbf73)[0x7fd841bd1f73]
python[0x48abf9]
python[0x48a9de]
python(PyEval_EvalFrameEx+0x52c)[0x45fb2c]
python(PyEval_EvalFrameEx+0xcb7)[0x4602b7]
python(PyEval_EvalCodeEx+0x199)[0x467209]
python(PyEval_EvalCode+0x32)[0x4d0242]
python[0x5102bb]
python(PyRun_FileExFlags+0x9a)[0x44a466]
python(PyRun_SimpleFileExFlags+0x2bc)[0x44a97a]
python(Py_Main+0xb36)[0x44b6bc]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fd857d7e76d]
python[0x4ce0ad]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is this numba.cuda lookup table implementation failing? - python

Related

TensorFlow nullptr check failed on GPU

Optimising python function with numba

Theano matrix multiplication

NotImplementedError at Decorator of NumbaPro (Python)

Invalid pointer when using global variables/lambda functions in python?

Categories

Resources