Debugging Tensorflow hang on global variables initialisation

Debugging Tensorflow hang on global variables initialisation - python

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.
I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.
Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?
Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.
I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.

I managed to solve the problem. The tip from #amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.
I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using:
tf.summary.FileWriter('./logs/traing_graph', graph)
I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)
It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.
There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.
I fixed the dimensions of the graph and graph initialization worked just fine!
My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)

A common methodology to debug tensorflow is to replace the placeholders and/or variables with numpy arrays and put them inside tf.const. When you do so you can actually examine the logic of your code by setting a breakpoints and to see numbers in "pythoninc" and not just tensors. It will be much easier to help you if you would post your code here, but here is a dummy example:
with tf.name_scope('scope_name'):
### This block is for debug only
import numpy as np
batch_size = 20
sess = tf.Session()
sess.run(tf.tables_initializer())
init_op = tf.global_variables_initializer()
sess.run(init_op)
### End of first debug block
## Replacing Placeholders for debug - uncomment the placehlolders and comment the numpy arrays to producation mode
const_a = tf.constant((np.random.rand(batch_size, 26) > 0.85).astype(int), dtype=tf.float32)
const_b = tf.constant(np.random.randint(0, 20, batch_size * 26).reshape((batch_size, 26)), dtype=tf.float32)
# real_a_placeholder = tf.log(input_placeholder_dict[A_DATA])
# real_b_placeholder = tf.log(input_placeholder_dict[B_DATA])
# dummy opreation
c = a - b
# selecting top k - in the sanity check you can see here that you actullay get the top items and top values
top_k = 5
top_k_values, top_k_indices = tf.nn.top_k(c,
k=top_k, sorted=True,
name="top_k")
## Replacing Variable for debug - uncomment the variables and comment the numpy arrays to producation mode
Now, run your code with breakpoints and you have 2 options to see the values in the debugger:
1.sess.run(palceholder_name)
2.you can use eval - varaible_name.eval(sessnio=sess)

Related

No console output using Keras model.fit() function

I'm following this tutorial to perform time series classifications using Transformers with Keras and TensorFlow. I'm using Windows 10 and the PyDev Eclipse plugin. Unfortunately, my program stops and the console output is completely blank every time I run the following code:
n_classes = len(np.unique(y_train))
input_shape = np.array(x_trainScaled).shape[0:]
model = build_model(n_classes,input_shape,head_size=256,num_heads=4,ff_dim=4,num_transformer_blocks=4,mlp_units=[128],mlp_dropout=0.4,dropout=0.25)
model.compile(loss="sparse_categorical_crossentropy",optimizer=keras.optimizers.Adam(learning_rate=1e-4),metrics=["sparse_categorical_accuracy"])
print(model.summary())
callbacks = [keras.callbacks.EarlyStopping(patience=100, restore_best_weights=True)]
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
pathToModel = 'my/path/to/model/'
model.save(pathToModel)
Even previous warnings or print statements are completely erased and I have no idea what's going on. If I comment the model.fit(...) statement out, the program terminates and crashes with an error message resulting from a model.predict(...) call.
Any help is highly appreciated.

The solution was to transform the input data and labels to numpy arrays first. Thus, calling the fit function as follows:
model.fit(np.array(x_trainScaled),np.array(y_train),validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
worked perfectly fine for me, as opposed to:
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)

Do not use tf.reset_default_graph() to clear nested graphs

I have a bunch of functions, which create portions of computation graph. In some of such functions I do
with tf.name_scope("my_scope_name"):
self._eye_n_components = tf.eye(se...
At the beginning of topmost function I call
tf.reset_default_graph()
and then call those partial functions and also they can call each other.
Unfortunately, I get an error
Error: Do not use tf.reset_default_graph() to clear nested graphs. If
you need a cleared graph, exit the nesting and create a new graph.
Several questions.
1) What is nesting and how to "exit nesting"?
2) How to create new graph?
3) How to catch, where I am entering the nesting?
4) How to clear entire graph so that tensorflow does not think I am trying to clear nested one?

This error message is displayed when you call tf.reset_default_graph() in one of the following scenarios:
Inside a with graph.as_default(): block.
Inside a with tf.Session(): block.
Between creating a tf.InteractiveSession and calling sess.close().
Each of these scenarios involves registering a default (and potentially "nested") tf.Graph object, which will be unregistered when you exit the block (or close the tf.InteractiveSession). Resetting the default graph in those scenarios would leave the system in an inconsistent state, so you should ensure to exit the block (or close the tf.InteractiveSession) before calling tf.reset_default_graph().

I solved by closing a session and loading the neural network model again.
My answers are:
(1) Exit with... block or sess.close()
(2) Load neural network model (and trained weight) like:
gd = tf.GraphDef.FromString(open(checkpoint + '_frozen.pb', 'rb').read())
inp, predictions = tf.import_graph_def(gd, return_elements=['input:0', 'MobilenetV2/Predictions/Reshape_1:0'])
(3) When you print out model you may see like Tensorflow object <VSR.Backend.TF.Framework.Trainer.VSR object at 0x000001E5DA53C898>
(4) I heard tf.reset_default_graph() and tf.keras.backend.clear_session() from here, but I never make the code work.

TensorFlow nullptr check failed on GPU

I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.

This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?

Tensorflow session returns as 'closed'

I have successfully ported the CIFAR-10 ConvNet tutorial code for my own images and am able to train on my data and generate Tensorboard outputs etc.
My next step was to implement an evaluation of new data against the model I built. I am trying now to use cifar10_eval.py as a starting point however am running into some difficulty.
I should point out that the original tutorial code runs entirely without a problem, including cifar10_eval.py. However, when moving this particular code to my application, I get the following error message (last line).
RuntimeError: Attempted to use a closed Session.
I found this error is thrown by TF's session.py
# Check session.
if self._closed:
raise RuntimeError('Attempted to use a closed Session.')
I have checked the directories in which all files should reside and be created, and all seems exactly as it should (they mirror perfectly those created by running the original tutorial code). They include a train, eval and data folders, containing checkpoints/events files, events file, and data binaries respectively.
I wonder if you could help pointing out how I can debug this, as I'm sure there may be something in the data flow that got disrupted when transitioning the code. Unfortunately, despite digging deep and comparing to the original, I can't find the source, as they are essentially similar with trivial changes in file names and destination directories only.
EDIT_01:
Debugging step by step, it seems the line that actually throws the error is #106 in the original cifar10_eval.py:
def eval_once(args etc)
...
with tf.Session() as sess:
...
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op)) # <========== line 106
summary_op is created in def evaluate of this same script and passed as an arg to def eval_once.
summary_op = tf.merge_all_summaries()
...
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)

From documentation on Session, a session can be closed with .close command or when using it through a context-manager in with block. I did find tensorflow/models/image/cifar10 | xargs grep "sess" and I don't see any sess.close, so it must be the later.
IE, you'll get this error if you do something like this
with tf.Session() as sess:
sess.run(..)
sess.run(...) # Attempted to use a closed Session.

It was a simple (but humbling) error in indentation.
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op))
summary.value.add(tag='Precision # 1', simple_value=precision)
summary_writer.add_summary(summary, global_step)
was outside of the try: block, and of course, no session could be found.
Sigh.

TensorFlow: Node not found

I am training a neural network and have been running this code without any problems but sometimes (twice) I get an error Not Found: FetchOutputs node not found at the line y_1 = sess.run(get_labels(step)) (See below).
get_labels(step) is a function to return the correct labels of my training images which is in a text file.
def get_labels(step):
with open('labels.txt','r') as fin:
reader = csv.reader(fin)
c = [[int(s) for s in row] for i,row in enumerate(reader) if i==step]
label_numbers = np.array(c)
# Convert to one-hot vectors
numpy_label = np.zeros((BATCH_SIZE,5))
for i in range(BATCH_SIZE):
numpy_label[i,label_numbers[0][i]-1] = 1
# Convert to tensor
y_label = tf.convert_to_tensor(numpy_label,dtype=tf.float32)
return y_label
This is my main function:
def main():
# Placeholder for correct labels
y_label = tf.placeholder(tf.float32,shape=[BATCH_SIZE,5])
< Other functions etc. >
sess.run(tf.initialize_all_variables())
tf.train.start_queue_runners(sess=sess)
for step in range(1000):
# Get labels for current batch
y_1 = sess.run(get_labels(step))
# Train
sess.run([train_step],feed_dict={y_label:y_1})
< Other stuff like writing summaries, saving variables etc. >
sess.close()
From reading some of the issues on GitHub, I know this is to do with the fact that I call y_1 = sess.run(get_labels(step)) after tf.train.start_queue_runners(sess=sess) but I don't understand:
why it works most of the time, but occasionally doesn't?
Is y_1 = sess.run(get_labels(step)) adding or modifying nodes in the graph? I thought I was just running a node get_labels(step) that was already defined in the graph. I tried finalizing the graph before starting the queue runners but that gave me the error that finalized graphs cannot be modified.
What would be the proper way to write the code? Usually I just restart my program and it is fine - but clearly I am not doing it the proper way.
Thank you!
EDIT:
I think it might be important to mention that this happens when I am trying to run a TensorFlow script in a separate screen on a server i.e. I have one screen running a TensorFlow script and now I create a new screen to run a different TensorFlow script. I just started using screens so I might be missing something fundamental about how they work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Debugging Tensorflow hang on global variables initialisation - python

Related

No console output using Keras model.fit() function

Do not use tf.reset_default_graph() to clear nested graphs

TensorFlow nullptr check failed on GPU

Tensorflow session returns as 'closed'

TensorFlow: Node not found

Categories

Resources