TensorFlow nullptr check failed on GPU - python

I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.

This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?

Related

pytorch recover from RuntimeError: CUDA error: device-side assert triggered without restarting script

I suppose everybody who has worked with Pytorch knows the error
RuntimeError: CUDA error: device-side assert triggered
to some extend.
I'm generating a lot of data with GPU-Code in my script (200k+ long vectors), so it takes a while.
I'm doing this in batches via a generator as I do not have the memory to store all vectors in my GPU at once. The generator has the following structure:
for i in range(0, len(inputs), batch_size):
try:
<generate the vectors>
yield 1, <the vectors> # Here it was successful
except RuntimeError:
print(f'could not generate vectors {index} to {index + batch_size}')
yield 0, (i, i+ batch_size) # Here the input was malformed
I know that some of the input is malformed to the point that generating vectors from it will fail with a runtime error and that's fine, it's not even 1% of my dataset. I want to get the indices and deal with it later.
Here's my problem
Once vector creation fails, the GPU is basically bricked and will respond to all requests with aforementioned error. Validating all input beforehand would be cumbersome and slow. I don't want to do it. I want to roll over all malformed inputs and deal with it later.
My question is
How can I recover the GPU from this bricked state as easy and fast as possible? All questions that I have found so far ask about fixing the underlying error, which I do not need to do. I just want to get on with generating vectors from my dataset.
One way to do it might be to log your progress through the generated input vectors and restart the process/machine if the GPU gets bricked. If the percentage of malformed inputs are small enough, the cost of resetting the GPU/machine might be negligible. You can have a periodic job which checks if you're done with the job and restarts it if it's not. This is a crude way to approach this problem but it should work.
For example:
for i in range(0, len(inputs), batch_size):
try:
exist = check_if_current_index_has_succeded_or_failed()
if exist:
continue
else:
log_current_index()
<generate the vectors>
log_success()
yield 1, <the vectors> # Here it was successful
except RuntimeError:
log_failure()
print(f'could not generate vectors {index} to {index + batch_size}')
yield 0, (i, i+ batch_size) # Here the input was malformed

Get the number of GPUs used in Tensorflow Distributed in a multi node approach

I am currently trying to compare Horovod and Tensorflow Distributed API.
When using using Horovod, I am able to access the total number of GPUs currently used as follows:
import horovod.tensorflow as hvd
size = hvd.size()
A similar concept is available when using PyTorch Distributed API:
size = int(os.environ["WORLD_SIZE"])
I would like to perform the same operation and obtain the number of GPUs currently in use for multi GPUs/nodes with TF Distributed official API.
I can't use CUDA_VISIBLE_DEVICES environment variable as it would only work on a single node.
A few findings which answer my question:
Equivalent of hvd.size(): (the session must be started and initialized first unlike hvd ! Else you will just get "1")
==> tf.distribute.get_strategy().num_replicas_in_sync
Equivalent of hvd.rank(): (the session must be started and initialized first unlike hvd ! Else you will just get "0")
def get_rank():
replica_id = tf.distribute.get_replica_context().replica_id_in_sync_group
if isinstance(replica_id, tf.Tensor):
return tf.get_static_value(replica_id) != 0)
else:
return 0
Is TF Distributed running ? : tf.distribute.has_strategy() => True/False (same remark as above, else you just get False)

Debugging Tensorflow hang on global variables initialisation

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.
I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.
Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?
Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.
I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.
I managed to solve the problem. The tip from #amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.
I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using:
tf.summary.FileWriter('./logs/traing_graph', graph)
I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)
It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.
There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.
I fixed the dimensions of the graph and graph initialization worked just fine!
My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)
A common methodology to debug tensorflow is to replace the placeholders and/or variables with numpy arrays and put them inside tf.const. When you do so you can actually examine the logic of your code by setting a breakpoints and to see numbers in "pythoninc" and not just tensors. It will be much easier to help you if you would post your code here, but here is a dummy example:
with tf.name_scope('scope_name'):
### This block is for debug only
import numpy as np
batch_size = 20
sess = tf.Session()
sess.run(tf.tables_initializer())
init_op = tf.global_variables_initializer()
sess.run(init_op)
### End of first debug block
## Replacing Placeholders for debug - uncomment the placehlolders and comment the numpy arrays to producation mode
const_a = tf.constant((np.random.rand(batch_size, 26) > 0.85).astype(int), dtype=tf.float32)
const_b = tf.constant(np.random.randint(0, 20, batch_size * 26).reshape((batch_size, 26)), dtype=tf.float32)
# real_a_placeholder = tf.log(input_placeholder_dict[A_DATA])
# real_b_placeholder = tf.log(input_placeholder_dict[B_DATA])
# dummy opreation
c = a - b
# selecting top k - in the sanity check you can see here that you actullay get the top items and top values
top_k = 5
top_k_values, top_k_indices = tf.nn.top_k(c,
k=top_k, sorted=True,
name="top_k")
## Replacing Variable for debug - uncomment the variables and comment the numpy arrays to producation mode
Now, run your code with breakpoints and you have 2 options to see the values in the debugger:
1.sess.run(palceholder_name)
2.you can use eval - varaible_name.eval(sessnio=sess)

What is the alternative of tf.Variable.ref() in Tensorflow version 0.12?

I'm trying to run open code of A3C reinforcement learning algorithm to learn A3C in A3C code
However,I got several errors and I could fix except one.
In the code, ref() which is a member function of tf.Variable is used (1,2), but in recent tensorflow version 0.12rc, that function seems to be deprecated.
So I don't know what is the best way to replace it (I don't understand exactly why the author used ref()). When I just changed it to the variable itself (for example v.ref() to v), there was no error, but reward is not changed. It seems it cannot learn and I guess it is because the variables are not properly updated.
Please advise me what is the proper way to modify the code to work.
The new method tf.Variable.read_value() is the replacement for tf.Variable.ref() in TensorFlow 0.12 and later.
The use case for this method is slightly tricky to explain, and is motivated by some caching behavior that causes multiple uses of a remote variable on a different device to use a cached value. Let's say you have the following code:
with tf.device("/cpu:0")
v = tf.Variable([[1.]])
with tf.device("/gpu:0")
# The value of `v` will be captured at this point and cached until `m2`
# is computed.
m1 = tf.matmul(v, ...)
with tf.control_dependencies([m1])
# The assign happens (on the GPU) after `m1`, but before `m2` is computed.
assign_op = v.assign([[2.]])
with tf.control_dependencies([assign_op]):
with tf.device("/gpu:0"):
# The initially read value of `v` (i.e. [[1.]]) will be used here,
# even though `m2` is computed after the assign.
m2 = tf.matmul(v, ...)
sess.run(m2)
You can use tf.Variable.read_value() to force TensorFlow to read the variable again later, and it will be subject to whatever control dependencies are in place. So if you wanted to see the result of the assign when computing m2, you'd modify the last block of the program as follows:
with tf.control_dependencies([assign_op]):
with tf.device("/gpu:0"):
# The `read_value()` call will cause TensorFlow to transfer the
# new value of `v` from the CPU to the GPU before computing `m2`.
m2 = tf.matmul(v.read_value(), ...)
(Note that, currently, if all of the ops were on the same device, you wouldn't need to use read_value(), because TensorFlow doesn't make a copy of the variable when it is used as the input to an op on the same device. This can cause a lot of confusion—for example when you enqueue a variable to a queue!—and it's one of the reasons that we're working on enhancing the memory model for variables.)

Theano matrix multiplication

I have a piece of code that is supposed to calculate a simple
matrix product, in python (using theano). The matrix that I intend to multiply with is a shared variable.
The example is the smallest example that demonstrates my problem.
I have made use of two helper-functions. floatX converts its input to something of type theano.config.floatX
init_weights generates a random matrix (in type floatX), of given dimensions.
The last line causes the code to crash. In fact, this forces so much output on the commandline that I can't even scroll to the top of it anymore.
So, can anyone tell me what I'm doing wrong?
def floatX(x):
return numpy.asarray(x,dtype=theano.config.floatX)
def init_weights(shape):
return floatX(numpy.random.randn(*shape))
a = init_weights([3,3])
b = theano.shared(value=a,name="b")
x = T.matrix()
y = T.dot(x,b)
f = theano.function([x],y)
This work for me. So my guess is that you have a problem with your blas installation. Make sure to use Theano development version:
http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
It have better default for some configuration. If that do not fix the problem, look at the error message. There is main part that is after the code dump. After the stack trace. This is what is the most useful normally.
You can disable direct linking by Theano to blas with this Theano flag: blas.ldflags=
This can cause slowdown. But it is a quick check to confirm the problem is blas.
If you want more help, dump the error message to a text file and put it on the web and link to it from here.

Categories

Resources