Tensorflow session returns as 'closed' - python

I have successfully ported the CIFAR-10 ConvNet tutorial code for my own images and am able to train on my data and generate Tensorboard outputs etc.
My next step was to implement an evaluation of new data against the model I built. I am trying now to use cifar10_eval.py as a starting point however am running into some difficulty.
I should point out that the original tutorial code runs entirely without a problem, including cifar10_eval.py. However, when moving this particular code to my application, I get the following error message (last line).
RuntimeError: Attempted to use a closed Session.
I found this error is thrown by TF's session.py
# Check session.
if self._closed:
raise RuntimeError('Attempted to use a closed Session.')
I have checked the directories in which all files should reside and be created, and all seems exactly as it should (they mirror perfectly those created by running the original tutorial code). They include a train, eval and data folders, containing checkpoints/events files, events file, and data binaries respectively.
I wonder if you could help pointing out how I can debug this, as I'm sure there may be something in the data flow that got disrupted when transitioning the code. Unfortunately, despite digging deep and comparing to the original, I can't find the source, as they are essentially similar with trivial changes in file names and destination directories only.
EDIT_01:
Debugging step by step, it seems the line that actually throws the error is #106 in the original cifar10_eval.py:
def eval_once(args etc)
...
with tf.Session() as sess:
...
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op)) # <========== line 106
summary_op is created in def evaluate of this same script and passed as an arg to def eval_once.
summary_op = tf.merge_all_summaries()
...
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)

From documentation on Session, a session can be closed with .close command or when using it through a context-manager in with block. I did find tensorflow/models/image/cifar10 | xargs grep "sess" and I don't see any sess.close, so it must be the later.
IE, you'll get this error if you do something like this
with tf.Session() as sess:
sess.run(..)
sess.run(...) # Attempted to use a closed Session.

It was a simple (but humbling) error in indentation.
summary = tf.Summary()
summary.ParseFromString(sess.run(summary_op))
summary.value.add(tag='Precision # 1', simple_value=precision)
summary_writer.add_summary(summary, global_step)
was outside of the try: block, and of course, no session could be found.
Sigh.

Related

TensorFlow How to Initialize Global Step

So I'm trying to run a training session, and when I do I get this error when trying to run my algorithm (when I use tf.train.get_global_step()):
ValueError: global_step is required for exponential_decay.
For some reason, tf.train.get_or_create_global_step() doesn't exist for me, I'm not sure if that's because it's a removed method or what. I updated TensorFlow and everything I'm up to date.
I've dug around the documentation and there's nothing about it. To run I'm using tf.app.run() with a main function.
Is there another way to initialize the global step variable?
Although tf.train.get_or_create_step() is perfectly fine, here is another solution:
g_step = tf.get_variable('global_step', trainable=False, initializer=0)
learning_rate = tf.train.exponential_decay(0.1, g_step)
tf.train.AdamOptimizer(learning_rate).minimize(loss=loss, global_step=g_step)
Create an untrainable variable that initializes with zero and passes it to the Optimizer.
If you need global_step later use tf.train.global_step():
sess = tf.Session()
# Initialize the variable
sess.run(g_step.initializer)
print('global_step: %s' % tf.train.global_step(sess, g_step))
So, the reason this function wasn't showing up was because I actually hadn't been on the newest version of TensorFlow even though it was telling me I was completely up to date.
Seen Here:
So all I did to fix it was uninstall tensorflow, then install from the actual link I don't have it anymore, but a quick google search would suffice.

Why match_filenames_once function returns a local variable

I was trying to understand the mechanism of tensorflow for reading images using queues. I was using the code found here, whom basic parts are:
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once('D:/Dataset/*.jpg'))
image_reader = tf.WholeFileReader()
image_name, image_file = image_reader.read(filename_queue)
image = tf.image.decode_jpeg(image_file)
with tf.Session() as sess:
tf.global_variables_initializer().run()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
image_tensor = sess.run([image])
print(image_tensor)
which in reality does nothing special. I was getting an error:
OutOfRangeError (see above for traceback): FIFOQueue
'_0_input_producer' is closed and has insufficient elements (requested
1, current size 0)
which lead to search for missing images, wrong folder, wrong glob pattern etc until I discovered that tensorflow basically meant this:
"You need to initialize local variables also"!
Besides the fact that the code seemed to work in the original gist with just this substitution:
tf.initialize_all_variables().run()
instead of
tf.global_variables_initializer().run()
in my code it does not work. It produces the same error. I guess it has changed the implementation of initialize_all_variables() with tensorflow development (I am using 1.3.0), since in here it mentions that it initialize local variables also.
So, the final conclusion I came with was that I should initialize local variables also. And my code worked. The error message is awfully misleading (which did not help at all) but anyway to the main part I am a bit confused why am I getting a local variable by match_filenames_once. In documentation there is no reference about this (I am not sure it should though).
Am I always going to get local from this match_filenames_once? Can I control it somehow?

Do not use tf.reset_default_graph() to clear nested graphs

I have a bunch of functions, which create portions of computation graph. In some of such functions I do
with tf.name_scope("my_scope_name"):
self._eye_n_components = tf.eye(se...
At the beginning of topmost function I call
tf.reset_default_graph()
and then call those partial functions and also they can call each other.
Unfortunately, I get an error
Error: Do not use tf.reset_default_graph() to clear nested graphs. If
you need a cleared graph, exit the nesting and create a new graph.
Several questions.
1) What is nesting and how to "exit nesting"?
2) How to create new graph?
3) How to catch, where I am entering the nesting?
4) How to clear entire graph so that tensorflow does not think I am trying to clear nested one?
This error message is displayed when you call tf.reset_default_graph() in one of the following scenarios:
Inside a with graph.as_default(): block.
Inside a with tf.Session(): block.
Between creating a tf.InteractiveSession and calling sess.close().
Each of these scenarios involves registering a default (and potentially "nested") tf.Graph object, which will be unregistered when you exit the block (or close the tf.InteractiveSession). Resetting the default graph in those scenarios would leave the system in an inconsistent state, so you should ensure to exit the block (or close the tf.InteractiveSession) before calling tf.reset_default_graph().
I solved by closing a session and loading the neural network model again.
My answers are:
(1) Exit with... block or sess.close()
(2) Load neural network model (and trained weight) like:
gd = tf.GraphDef.FromString(open(checkpoint + '_frozen.pb', 'rb').read())
inp, predictions = tf.import_graph_def(gd, return_elements=['input:0', 'MobilenetV2/Predictions/Reshape_1:0'])
(3) When you print out model you may see like Tensorflow object <VSR.Backend.TF.Framework.Trainer.VSR object at 0x000001E5DA53C898>
(4) I heard tf.reset_default_graph() and tf.keras.backend.clear_session() from here, but I never make the code work.

Debugging Tensorflow hang on global variables initialisation

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.
I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.
Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?
Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.
I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.
I managed to solve the problem. The tip from #amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.
I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using:
tf.summary.FileWriter('./logs/traing_graph', graph)
I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)
It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.
There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.
I fixed the dimensions of the graph and graph initialization worked just fine!
My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)
A common methodology to debug tensorflow is to replace the placeholders and/or variables with numpy arrays and put them inside tf.const. When you do so you can actually examine the logic of your code by setting a breakpoints and to see numbers in "pythoninc" and not just tensors. It will be much easier to help you if you would post your code here, but here is a dummy example:
with tf.name_scope('scope_name'):
### This block is for debug only
import numpy as np
batch_size = 20
sess = tf.Session()
sess.run(tf.tables_initializer())
init_op = tf.global_variables_initializer()
sess.run(init_op)
### End of first debug block
## Replacing Placeholders for debug - uncomment the placehlolders and comment the numpy arrays to producation mode
const_a = tf.constant((np.random.rand(batch_size, 26) > 0.85).astype(int), dtype=tf.float32)
const_b = tf.constant(np.random.randint(0, 20, batch_size * 26).reshape((batch_size, 26)), dtype=tf.float32)
# real_a_placeholder = tf.log(input_placeholder_dict[A_DATA])
# real_b_placeholder = tf.log(input_placeholder_dict[B_DATA])
# dummy opreation
c = a - b
# selecting top k - in the sanity check you can see here that you actullay get the top items and top values
top_k = 5
top_k_values, top_k_indices = tf.nn.top_k(c,
k=top_k, sorted=True,
name="top_k")
## Replacing Variable for debug - uncomment the variables and comment the numpy arrays to producation mode
Now, run your code with breakpoints and you have 2 options to see the values in the debugger:
1.sess.run(palceholder_name)
2.you can use eval - varaible_name.eval(sessnio=sess)

Tensorflow fail with "Unable to get element from the feed as bytes." when attempting to restore checkpoint

I am using Tensorflow r0.12.
I use google-cloud-ml locally to run 2 different training jobs. In the first job, I find good initial values for my variables. I store them in a V2-checkpoint.
When I try to restore my variables for using them in the second job :
import tensorflow as tf
sess = tf.Session()
new_saver = tf.train.import_meta_graph('../variables_pred/model.ckpt-10151.meta', clear_devices=True)
new_saver.restore(sess, tf.train.latest_checkpoint('../variables_pred/'))
all_vars = tf.trainable_variables()
for v in all_vars:
print(v.name)
I got the following error message :
tensorflow.python.framework.errors_impl.InternalError: Unable to get element from the feed as bytes.
The checkpoint is created with these lines in the first job :
saver = tf.train.Saver()
saver.export_meta_graph(filename=os.path.join(output_dir, 'export.meta'))
saver.save(sess, os.path.join(output_dir, 'export'), write_meta_graph=False)
According to this answer, it could come from the absence of metadata file but I am loading the metadata file.
PS : I use the argument clear_devices=True because the device specifications generated by a launch on google-cloud-ml are quite intricated and I don't need to necessarily get the same dispatch.
The error message was due to the absence of the file named "checkpoint" by inadvertency.
After the reintroduction of this file in the appropriate folder, it appears that the loading of the checkpoint is working.
Sorry for having omitted this key point.
I think the problem could be that when you save the model you set write_meta_graph=False. As a result I don't think you are actually saving the graph so when you try to restore there is no graph to restore. Try setting write_meta_graph=True
The error message was also due to the mistakes in the file named "checkpoint" by inadvertency.
For examples, the folder which contains the models has been moved, but the value of "model_checkpoint_path:" in "checkpoint" still is old path.

Categories

Resources