Do not use tf.reset_default_graph() to clear nested graphs - python

I have a bunch of functions, which create portions of computation graph. In some of such functions I do
with tf.name_scope("my_scope_name"):
self._eye_n_components = tf.eye(se...
At the beginning of topmost function I call
tf.reset_default_graph()
and then call those partial functions and also they can call each other.
Unfortunately, I get an error
Error: Do not use tf.reset_default_graph() to clear nested graphs. If
you need a cleared graph, exit the nesting and create a new graph.
Several questions.
1) What is nesting and how to "exit nesting"?
2) How to create new graph?
3) How to catch, where I am entering the nesting?
4) How to clear entire graph so that tensorflow does not think I am trying to clear nested one?

This error message is displayed when you call tf.reset_default_graph() in one of the following scenarios:
Inside a with graph.as_default(): block.
Inside a with tf.Session(): block.
Between creating a tf.InteractiveSession and calling sess.close().
Each of these scenarios involves registering a default (and potentially "nested") tf.Graph object, which will be unregistered when you exit the block (or close the tf.InteractiveSession). Resetting the default graph in those scenarios would leave the system in an inconsistent state, so you should ensure to exit the block (or close the tf.InteractiveSession) before calling tf.reset_default_graph().

I solved by closing a session and loading the neural network model again.
My answers are:
(1) Exit with... block or sess.close()
(2) Load neural network model (and trained weight) like:
gd = tf.GraphDef.FromString(open(checkpoint + '_frozen.pb', 'rb').read())
inp, predictions = tf.import_graph_def(gd, return_elements=['input:0', 'MobilenetV2/Predictions/Reshape_1:0'])
(3) When you print out model you may see like Tensorflow object <VSR.Backend.TF.Framework.Trainer.VSR object at 0x000001E5DA53C898>
(4) I heard tf.reset_default_graph() and tf.keras.backend.clear_session() from here, but I never make the code work.

Related

How to get a tensor's value in TensorFlow (without making another session)

I'm finding a way to getting a tensor's value. In most case, the problem would be solved by calling "sess.run(target_op)". However, I want to know another way. I am editing the code downloaded from GitHub so there's already a session running code there. Without touching the session running part, is there any way to get some specific tensor value? In my case, the code is built for getting accuracy for image recognition. While session runs and doing the accuracy evaluation I also want to get "prediction" tensor value in the same session without creating another session. For example, an operation like tf.Print shows tensor value througha terminal window without running session directly(in the first figure we just have to do sess.run(e) to print out tensor from c)
example of tf.Print
a = tf.constant(5)
b = tf.constant(3)
c = tf.add(a,b)
#print tensor c (which is 8)
d = tf.Print(c,[c])
f = tf.constant(2)
e = tf.multiply(f,d)
sess = tf.Session()
#print operation can be executed without running the session directly
g = sess.run(e)`
Like the tf.Print is there any operation that gets tensor value without running session directly? (like the second figure)
example of operation I am looking for
More specifically, what I want is to get the value of tensor(with actual numbers and arrays, not just 'tensor' data structure)and pass it to the global variable to access the value freely even after the session closes. The session only executes the operator which is located at end of the graph while the tensor I want the value is located in the middle of the graph. With restriction that I cannot create more session than the original code has, is there any way to get the specific tensor value?( I can't use .eval() or .run() because either needs to access 'session'. the code I am editing runs the code by using slim.evaluate_once function and as session() is binded to the function, I cannot approach to session())
There is no reason why you can't just call any tensor from the graph, provided you feed in the appropriate feed_dict. For instance say you want a tensor called biasAdd:0 and your so called-end tensor is called prediction
Then you can just get this tensor and evaluate it:
tensor = graph.get_tensor_by_name("biasAdd:0")
tensor_value, prediction_value = ses.run([tensor, prediction],... )
In tensorflow you have to use run or eval to get a numerical value from the graph

Debugging Tensorflow hang on global variables initialisation

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.
I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.
Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?
Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.
I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.
I managed to solve the problem. The tip from #amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.
I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using:
tf.summary.FileWriter('./logs/traing_graph', graph)
I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)
It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.
There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.
I fixed the dimensions of the graph and graph initialization worked just fine!
My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)
A common methodology to debug tensorflow is to replace the placeholders and/or variables with numpy arrays and put them inside tf.const. When you do so you can actually examine the logic of your code by setting a breakpoints and to see numbers in "pythoninc" and not just tensors. It will be much easier to help you if you would post your code here, but here is a dummy example:
with tf.name_scope('scope_name'):
### This block is for debug only
import numpy as np
batch_size = 20
sess = tf.Session()
sess.run(tf.tables_initializer())
init_op = tf.global_variables_initializer()
sess.run(init_op)
### End of first debug block
## Replacing Placeholders for debug - uncomment the placehlolders and comment the numpy arrays to producation mode
const_a = tf.constant((np.random.rand(batch_size, 26) > 0.85).astype(int), dtype=tf.float32)
const_b = tf.constant(np.random.randint(0, 20, batch_size * 26).reshape((batch_size, 26)), dtype=tf.float32)
# real_a_placeholder = tf.log(input_placeholder_dict[A_DATA])
# real_b_placeholder = tf.log(input_placeholder_dict[B_DATA])
# dummy opreation
c = a - b
# selecting top k - in the sanity check you can see here that you actullay get the top items and top values
top_k = 5
top_k_values, top_k_indices = tf.nn.top_k(c,
k=top_k, sorted=True,
name="top_k")
## Replacing Variable for debug - uncomment the variables and comment the numpy arrays to producation mode
Now, run your code with breakpoints and you have 2 options to see the values in the debugger:
1.sess.run(palceholder_name)
2.you can use eval - varaible_name.eval(sessnio=sess)

What is the alternative of tf.Variable.ref() in Tensorflow version 0.12?

I'm trying to run open code of A3C reinforcement learning algorithm to learn A3C in A3C code
However,I got several errors and I could fix except one.
In the code, ref() which is a member function of tf.Variable is used (1,2), but in recent tensorflow version 0.12rc, that function seems to be deprecated.
So I don't know what is the best way to replace it (I don't understand exactly why the author used ref()). When I just changed it to the variable itself (for example v.ref() to v), there was no error, but reward is not changed. It seems it cannot learn and I guess it is because the variables are not properly updated.
Please advise me what is the proper way to modify the code to work.
The new method tf.Variable.read_value() is the replacement for tf.Variable.ref() in TensorFlow 0.12 and later.
The use case for this method is slightly tricky to explain, and is motivated by some caching behavior that causes multiple uses of a remote variable on a different device to use a cached value. Let's say you have the following code:
with tf.device("/cpu:0")
v = tf.Variable([[1.]])
with tf.device("/gpu:0")
# The value of `v` will be captured at this point and cached until `m2`
# is computed.
m1 = tf.matmul(v, ...)
with tf.control_dependencies([m1])
# The assign happens (on the GPU) after `m1`, but before `m2` is computed.
assign_op = v.assign([[2.]])
with tf.control_dependencies([assign_op]):
with tf.device("/gpu:0"):
# The initially read value of `v` (i.e. [[1.]]) will be used here,
# even though `m2` is computed after the assign.
m2 = tf.matmul(v, ...)
sess.run(m2)
You can use tf.Variable.read_value() to force TensorFlow to read the variable again later, and it will be subject to whatever control dependencies are in place. So if you wanted to see the result of the assign when computing m2, you'd modify the last block of the program as follows:
with tf.control_dependencies([assign_op]):
with tf.device("/gpu:0"):
# The `read_value()` call will cause TensorFlow to transfer the
# new value of `v` from the CPU to the GPU before computing `m2`.
m2 = tf.matmul(v.read_value(), ...)
(Note that, currently, if all of the ops were on the same device, you wouldn't need to use read_value(), because TensorFlow doesn't make a copy of the variable when it is used as the input to an op on the same device. This can cause a lot of confusion—for example when you enqueue a variable to a queue!—and it's one of the reasons that we're working on enhancing the memory model for variables.)

How to use tf.cond in combination with batching operations / queue runners

Situation
I want to train a specific network architecture (a GAN) that needs inputs from different sources during training.
One input source is examples loaded from disk. The other source is a generator sub-network creating examples.
To choose which kind of input to feed to the network I use tf.cond. There is one caveat though that has already been explained: tf.cond evaluates the inputs to both conditional branches even though only one of those will ultimately be used.
Enough setup, here is a minimal working example:
import numpy as np
import tensorflow as tf
BATCH_SIZE = 32
def load_input_data():
# Normally this data would be read from disk
data = tf.reshape(np.arange(10 * BATCH_SIZE, dtype=np.float32), shape=(10 * BATCH_SIZE, 1))
return tf.train.batch([data], BATCH_SIZE, enqueue_many=True)
def generate_input_data():
# Normally this data would be generated by a much bigger sub-network
return tf.random_uniform(shape=[BATCH_SIZE, 1])
def main():
# A bool to choose between loaded or generated inputs
load_inputs_pred = tf.placeholder(dtype=tf.bool, shape=[])
# Variant 1: Call "load_input_data" inside tf.cond
data_batch = tf.cond(load_inputs_pred, load_input_data, generate_input_data)
# Variant 2: Call "load_input_data" outside tf.cond
#loaded_data = load_input_data()
#data_batch = tf.cond(load_inputs_pred, lambda: loaded_data, generate_input_data)
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
print(threads)
# Get generated input data
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: False})
print(data_batch_values)
# Get input data loaded from disk
data_batch_values = sess.run(data_batch, feed_dict={load_inputs_pred: True})
print(data_batch_values)
if __name__ == '__main__':
main()
Problem
Variant 1 does not work at all since the queue runner threads don't seem to run. print(threads) outputs something like [<Thread(Thread-1, stopped daemon 140165838264064)>, ...].
Variant 2 does work and print(threads) outputs something like [<Thread(Thread-1, started daemon 140361854863104)>, ...]. But since load_input_data() has been called outside of tf.cond, batches of data will be loaded from disk even when load_inputs_pred is False.
Is it possible to make Variant 1 work, so that input data is only loaded when load_inputs_pred is True and not for every call to session.run()?
If you're using a queue when loading your data and follow it up with a batch input then this shouldn't be a problem as you can specify the max amount to have loaded or stored in the queue.
input = tf.WholeFileReader(somefilelist) # or another way to load data
return tf.train.batch(input,batch_size=10,capacity=100)
See here for more details:
https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html#batch
Also there's an alternative approach that skips the tf.cond completely. Just define two losses one that follows the data through the autoencoder and discrimator and the other that follows the data through just the discriminator.
Then it just becomes a matter of calling
sess.run(auto_loss,feed_dict)
or
sess.run(real_img_loss,feed_dict)
In this way the graph will only run through which ever loss was called upon. Let me know if this needs more explanation.
Lastly I think to make variant one work you need to do something like this if you're using preloaded data.
https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#preloaded-data
Otherwise I'm not sure what the issue is to be honest.

Tensorflow session returns as 'closed'

I have successfully ported the CIFAR-10 ConvNet tutorial code for my own images and am able to train on my data and generate Tensorboard outputs etc.
My next step was to implement an evaluation of new data against the model I built. I am trying now to use cifar10_eval.py as a starting point however am running into some difficulty.
I should point out that the original tutorial code runs entirely without a problem, including cifar10_eval.py. However, when moving this particular code to my application, I get the following error message (last line).
RuntimeError: Attempted to use a closed Session.
I found this error is thrown by TF's session.py
# Check session.
if self._closed:
raise RuntimeError('Attempted to use a closed Session.')
I have checked the directories in which all files should reside and be created, and all seems exactly as it should (they mirror perfectly those created by running the original tutorial code). They include a train, eval and data folders, containing checkpoints/events files, events file, and data binaries respectively.
I wonder if you could help pointing out how I can debug this, as I'm sure there may be something in the data flow that got disrupted when transitioning the code. Unfortunately, despite digging deep and comparing to the original, I can't find the source, as they are essentially similar with trivial changes in file names and destination directories only.
EDIT_01:
Debugging step by step, it seems the line that actually throws the error is #106 in the original cifar10_eval.py:
def eval_once(args etc)
...
with tf.Session() as sess:
...
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op)) # <========== line 106
summary_op is created in def evaluate of this same script and passed as an arg to def eval_once.
summary_op = tf.merge_all_summaries()
...
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
From documentation on Session, a session can be closed with .close command or when using it through a context-manager in with block. I did find tensorflow/models/image/cifar10 | xargs grep "sess" and I don't see any sess.close, so it must be the later.
IE, you'll get this error if you do something like this
with tf.Session() as sess:
sess.run(..)
sess.run(...) # Attempted to use a closed Session.
It was a simple (but humbling) error in indentation.
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op))
summary.value.add(tag='Precision # 1', simple_value=precision)
summary_writer.add_summary(summary, global_step)
was outside of the try: block, and of course, no session could be found.
Sigh.

Categories

Resources